Make a Donation

Consider making a donation to support SHAKSPER.

Subscribe to Our Feeds

Current Postings RSS

Announcements RSS

Home :: Archive :: 2007 :: August ::
Shakespeare Golden Ear Test
The Shakespeare Conference: SHK 18.0492  Wednesday, 1 August 2007

From: 		Ward Ward <
 This e-mail address is being protected from spambots. You need JavaScript enabled to view it
 >
Date: 		Wednesday, 01 Aug 2007 00:24:25 -0700
Subject: 18.0473 Shakespeare Golden Ear Test
Comment: 	RE: SHK 18.0473 Shakespeare Golden Ear Test

Our warmest thanks to the 80 brave SHAKSPERians who took our Golden Ear 
Test, Round 1, and to Hardy for making them all reachable by us.  We 
hope that SHAKSPERians will indulge us in an oversize 20-page posting, 
giving a preliminary analysis of the results of the test, which test ran 
for ten days, from July 6 through July 15.  It's now OK to discuss the 
test online, as well as off, but try not to give away too many of the 
test's specifics, to preserve some usability for it in the future.  The 
test is still up, http://goldenear.cmc.edu, but we haven't been 
monitoring it since July 15 and would suppose that any results since 
then are likely to be retakes to get another look at the test.

Our two-sentence conclusion is this:  As individuals, on average, the 
whole SHAKSPER group got almost two out of three unrecognized passages 
right (63%), and the top 30% got almost three out of four right (74%). 
As an aggregated group, the whole group got almost four out of five 
right (79%), and the top 30% got almost five out of six right (82%).

For a two-paragraph conclusion, scroll down to Table II, and then to 
Section XI, Conclusions, and don't miss the cautions and further thanks 
at the very end.

For 20 pages of further detail, read on.

I. Who took the test?

31 of the 80 takers considered themselves Shakespeare pros (39%) and 49 
considered themselves amateurs.  We would guess that almost all were 
from SHAKSPER, though there was also an admixture of takers from HLAS, a 
rowdy, free-wheeling sister group of Shakespeare authorship buffs, most 
of whom also belong to SHAKSPER but are less likely to be pros than 
SHAKSPERians and, on their own turf, are subject to none of Hardy's 
constraints. It's possible that a few of these do not belong to 
SHAKSPER, but we don't think it matters much.  HLAS people are scarcely 
less hooked on Shakespeare, and, hence, almost as high on our list of 
people who might take a test like ours seriously and give us interesting 
results.

26 respondents described themselves as critics, 14 as writers, 33 as 
artists, including performing artists, and 8 as "other humanities [than 
literature], or social, or natural sciences."  80 is about 6% of 
SHAKSPER's membership, a respectable showing.  To encourage wider 
participation, we did not require test takers to give their names, and 
most did not, but 14 takers (18%) did identify themselves as willing, 
and, in some cases, eager, to take our Round 2 test.  12 of these were 
Rated, that is, they scored Bronze or better on the test.  We did not 
encounter many, if any, A-List Shakespeare celebrities, no  Harold 
Blooms or Stephen Greenblatts among the self-identifiers, nor many, if 
any, of SHAKSPER's most vocal past advocates of shutting down your 
computers and listening to your intuition only -- but we can't exclude 
the possibility that some Shakespeare grandees or intuitionists might 
have taken the test anonymously.  Some A-list people helped us design 
the test; we would not expect any of these to have taken it.

II. Gross Accuracy as Found: Two out of three for the group, four out of 
five for the top 30%

Of the 80 who took the test, 24 (30%) were rated Bronze or better, with 
the following distribution:

	Table I.  Rated Takers by Category, All Ratings Gross


		Golden		Silver		Bronze		Total
Identified	4		4			4
12
Unidentified	3		2			7
12

Sh. Pro		6		1			5
12
Sh. Amateur	1		5			6
12

Total		6			11			24

Rated players got four out of five identifications right, in gross 
figures, three out of four right in net figures, and did better on 
non-Shakespeare than on Shakespeare.  Gross figures count all correct 
answers, whether recognized from memory or detected by intuition; net 
figures subtract the recognized answers and count intuitive answers 
only.  Net figures are the interesting ones, but they are harder to 
arrive at than gross and are not available for all purposes.

Of the rated takers, neatly, half considered themselves Shakespeare 
pros, the other half amateurs. Half identified themselves, half did not. 
  All of the ratings are slightly inflated because they are based on 
gross, not net accuracy.  Subtracting recognized passages would reduce 
most Golden Ears to Silver, and most Silver to Bronze.

The whole group, on average, got about two out of three identifications 
right, both in gross figures and in net, since the whole group 
recognized fewer passages, on average, than the Rated players.  Like the 
Rated players, the whole group did better on non-Shakespeare than on 
Shakespeare.  The average gross score of all 80 takers was 18.6 of 28 
(66%); their net score equivalent would be about a point lower, 17.6 
(63%).

The 31 pros in the whole group scored between 14 and 25, averaging 18.9 
right of 28 questions (68%, gross). The 49 amateurs scored between 14 
and 24, averaging 18.4 right (66%, gross).  It is not surprising that 
the pros did better, on average, than the amateurs.  It is surprising 
that the gap is so small, especially considering that these are gross 
scores uncorrected for passages recognized by the test-taker, which one 
would expect to be more common among pros than among amateurs. In fact, 
the average gross accuracy scores of every subgroup - critics, writers, 
artists, others -- we tested fell into an extremely narrow range, none 
lower than 18, none as high as 19 (Table X, below).

Should the two-out-of-three or three-out-of-four individual gross 
accuracy levels we found be considered high or low?  Judging from 
post-test lamentations we have heard, scattered mutterings about 
"humbling experience" and "stupid test," and low self-identification 
rates even of high-scoring players, we would guess that many, perhaps 
most, takers were disappointed with their scores, expected to do better, 
and didn't want to put their names to their test results.

Our further guess is that they reacted to our test much as law students 
were expected to do to an experiment/demonstration routinely performed 
in Evidence classes in our day, decades ago.  A two-minute dramatic 
event is staged, and the students are asked to describe what happened. 
Who was chasing whom?  What did they say?  What color were the first 
one's eyes?  How tall was the second one, and what did he weigh?  What 
color were his socks?  The answers, when collected, always turn out to 
be all over the lot and filled with inaccuracies, and the class is 
supposed to learn from the "humbling experience" that eyewitness 
testimony is not very reliable.

We followed a different model in our Golden Ear experiment (below).  Our 
expectations of individual performances were not so high, and we believe 
that SHAKSPERians have done quite a bit better as a group than most of 
them think.  The average individual outcomes as found struck us as par 
for the course or better; and those for rated players seem particularly 
impressive. Moreover, we now have further confirmation that the scores 
as found can be tweaked by screening and aggregation to reach 
surprisingly high levels of group accuracy, which have never, to our 
knowledge, been demonstrated before on this scale.

Let's start with the outcomes as found.  Two-out-of-three average 
accuracy, is far from perfect, but it is better than chance, better on 
non-Shakespeare than most of our individual computer tests on sizable 
texts (though not better than all of them combined), and better than all 
our computer tests combined on the very short, sonnet-length passages we 
used in the test.  It is also, we would guess, better than three of our 
four past pilot-study groups mostly of Claremont Colleges students 
(below). The mean for the SHAKSPER group is barely a point and a half 
short of a Bronze.  For reference, our preset range boundaries were: 
Golden Ear, 24-28 out of 28; Silver, 22-23; Bronze, 20-21. No one got a 
tin ear, 12 or less, because chance tends to pull all scores, high and 
low, toward the mean.  If you don't recognize anything at all and guess 
at random, you still have 50-50 odds of getting each guess right, and 
you will much more likely get a 15 or a 16 than a zero.  Getting a zero 
would be a remarkable feat, implying powers of discrimination comparable 
to what is needed to max the test with a 28.  With a 35% average failure 
rate, about what SHAKSPER's was, you would expect 1.1 Tin Ear (below 12) 
purely by chance; we got none.  You would also expect 1.1 Golden Ear 
(24+) purely by chance; we got seven, and would conclude that their 
success has to be more than pure luck.

Of the 24 rated players, Bronze or better, the pro gross average was 
22.7, amateur average, 21.3, all rated combined, 22.0.  These amount to 
81%, 76%, and 79% gross accuracy, respectively.  Stated differently, the 
average rated pro player had a Silver Ear and could correctly ascribe 
four out of five passages; the average rated amateur could get three out 
of four - some of which in both cases, however, were from memory, not 
from intuitive detection.  Again, we know of no computer test, or 
combination of computer tests, which could reach this level of accuracy 
in identifying such short passages.

III. Gross accuracy tweaked upwards by aggregation and screening.

Now let's get to the tweaking.  We knew from our pilot study of a dozen 
Claremont students in a 2002 Shakespeare class that you could bring out 
a group's latent accuracy by screening for the best and averaging for 
the group. The best student in the group got 79% right; the top half of 
the group, horizontally aggregated by majority rule for each question, 
got 85% right; and the whole class, so aggregated, got 89% right.  These 
are both gross and net figures because student recognition of the 
passages was close to zero.  Could we do the same with a larger, more 
experienced panel of SHAKSPERians?   The short answer is not quite the 
students' remarkable level of nine out of ten, but still an impressive 
five out of six, net, after removing all recognized passages.

In our day the classic demonstration of the benefits of aggregation, 
since revealingly elaborated by James Surowiecki in The Wisdom of 
Crowds, 2004, was performed routinely in business school classes.  The 
class would be asked to guess the number of beans in a jar, and, despite 
their best efforts to get it right, would find that their answers were 
all over the lot, just like the law students.  But, far from dismissing 
it as another "humbling experience," the professor would go on to plot 
the guesses.  If the class took the task seriously, as they generally do 
in business schools, the usual outcome was this:  the guesses would form 
a bell curve, with the peak of the curve within five or ten percent of 
the actual number of beans in the jar.  Thanks to the Wisdom of Crowds, 
the class as a whole would often equal or surpass the accuracy of its 
best member, and with far greater reliability and predictability, since 
no one can tell in advance who is going to make the best guess.  Chance 
is much more of a factor with one considered guess than with many.  Our 
adaptation of the Wisdom of Crowds to the Golden Ear test was not just 
to average takers' individual scores, but also to average their answers 
to each question and draw from these "horizontal" averages a collective 
group score.  For the SHAKSPER takers, the two tweaks combined raised 
gross group accuracy dramatically, from an untweaked individual average 
accuracy of 66% to a tweaked collective average accuracy of almost 93%.

Here's how we did it:

We have seen that the gross vertical average of final individual scores 
of all takers was 18.6, with 66% correct answers.  The collective gross 
score of all takers, first finding the group's majority horizontally on 
each question and then scoring the whole group's averaged answers 
vertically, just like a single test, was 22, or 79% correct answers. 
Simply aggregating the whole group's guesses on each question raised 
their gross accuracy from two out of three to four out of five.

As it happens, the gross average of individual scores of all Rated (that 
is, Bronze or better) takers was also 22, exactly equal to the 
four-out-of-five accuracy of the whole group.  Aggregation raised the 
whole group's gross accuracy 13% (i.e., 79% - 66%) to the same level as 
its top 30% unaggregated.  Aggregating the top 30%, in turn, raised its 
gross accuracy another 13.9%, to 26 out of 28, 92.9% accuracy, one point 
better, as it happens than the anonymous top-scoring individual.  As it 
further happens, gross aggregated accuracy was identical for Shakespeare 
and for non-Shakespeare. Unaggregated, the top players got a gross 
average of four out of five right; aggregated they got 13 out of 14 
right. Both of the tweaks combined raised the group's gross collective 
score from 18.6 to an astonishing 26, and its gross collective accuracy 
from 66% to almost 93%, and it seems that both tweaks contributed about 
equally to the improvement.  After these two tweaks, the experience 
might seem not quite so humbling.

IV. Net Accuracy:  After removing recognized passages, still tweakable 
to almost Five out of Six.

But we need at least one further tweak, and this one brings the average 
back down a bit, to four out of five for the group, and five out of six 
for the Top-30% Rated players.  All the gross numbers discussed so far 
make no allowance for recognition and treat every correct answer as if 
it came from intuition.  Unless recognition is zero, as it almost was 
for our student pilot panels, but surely was not for our SHAKSPER panel, 
this is bound to overstate the power of intuition.

We tried to avoid familiar passages, identify them, and exclude them 
where found.

"Avoid" means that we tried to pick the least familiar passages, 
especially Shakespeare passages, we thought we could find.  Jim 
Carroll's complaint (8 July) that our "Shakespearean passages have been 
chosen to minimize what is distinctive about Shakespeare: his 
interesting diction and his constant stretching for metaphorical 
expression" probably reflects this precaution, but we thought it far 
preferable to exclude archetypical passages like "Friends, Romans, 
countrymen" or "Shall I compare thee to a summer's day?" from a test of 
intuitive detection than to include them and either miscount them as 
true detection or toss them out as obviously remembered, in which case, 
you would have to wonder why we bothered to include them in the first 
place.  We would guess that Shakespeare's most archetypical passages are 
also his most familiar, and they are more likely to be tests of memory 
than of intuition.  If we succeeded in choosing unfamiliar passages, the 
ones we chose are no doubt less distinctive than passages randomly 
selected from Shakespeare -- but the alternatives, choosing familiar 
passages, or even choosing at random without regard to familiarity, 
would have been too big a waste of our takers' time and our own. 
Steering clear of the most familiar passages alone was enough in our 
pilot studies to make sure that only one of the students recognized even 
one passage.  The students' net accuracy was therefore essentially the 
same as their gross accuracy because, for them, almost every passage was 
a case of first impression.  However, as we shall see, it wasn't enough 
for SHAKSPER, especially for Shakespeare passages, and we had to correct 
for it.

"Identify" means that we asked takers to tell us outright whether they 
recognized each passage.  The responses showed us that, with a group as 
sophisticated as SHAKSPER, our efforts to avoid familiar Shakespeare 
passages were not always successful.  Our worst choice from this 
standpoint was a passage from Twelfth Night, which was recognized by 45% 
of all the takers and 75% of the Rated takers.  Two other play passages 
and one Shakespeare sonnet got 20-30% recognition from the whole group 
and 40-50% recognition from the Rated players.  The other Shakespeare 
questions averaged maybe 7% recognition for the whole group, 18% for the 
ranked group. The overall average recognition rate was three or four 
times higher for Shakespeare than for non-Shakespeare, and twice as high 
for rated players as for the group as a whole.  That is, on average, 15% 
of the whole group and 29% of the rated group recognized our Shakespeare 
passages, and 4% of the whole group, and 8% of the rated group 
recognized the non-Shakespeare passages.  By the same token, however, 
they did not recognize 70-85% of our Shakespeare passages, and 92-96% of 
our non-Shakespeare passages making these fully and properly testable by 
our methods.

"Exclude" means we tried, using simplifying assumptions, to find a way 
to exclude recognized passages from our accuracy estimates.  To get from 
gross accuracy, which is inflated by recognition, to net accuracy, which 
is not, we assumed that essentially all recognition identifications 
would be correct and subtracted them both from the group's total correct 
answer and from its total valid takes.  Spot-checking the top two 
categories, who got a quarter of the recognitions, indicates that the 
assumption is not quite true, only 96% true.  Four percent of their 
supposed recognitions were wrong.  But the actuality is close enough to 
100% to allow us to use it as a simplifying assumption, which would very 
slightly overstate net accuracy.  It is preferable to calculating the 
impact of each recognition separately and manually for 2,240 answers, 
and far preferable to using gross accuracy only.

This exclusion process lowered the average of individual percentages by 
one to eight percent, with the greatest reductions for Shakespeare and 
rated players, where recognition was very high, and the least for 
non-Shakespeare and the whole group, where recognition was lower. 
Overall reductions from gross averaged individual accuracy to net were 
6% for the rated group and 4% for the whole group.  Here is a global 
summary of all our averages, with averaged individual accuracy 
percentages above and aggregated group accuracy below:

	Table II.  Average Gross and Net Accuracy Rates, Individual and
Group

			All, gross	All, net		Rated,
Gross 	Rated, net


Shakespeare		66%		60%			76%
66%
Non-Shakespeare 	66%		66%			81%
80%
All			67%		63%			79%
74%

Aggregated (Group)79%		79%			93%
82%

This is our most important, bottom-line table, which gives the vertical 
average of individual scores (that is, the sum of all correct answers 
divided by the sum of all unrecognized takes), above, and the 
horizontally, majority-rule-for-each question aggregated group score, 
below, gross figures to the left, net to the right.  The key figures are 
now the net ones.  What leaps out from it to our eye is (1) that netting 
out the recognized answers, unsurprisingly, cuts the Shakespeare 
accuracy percentages much more than the non-Shakespeare; (2) it narrows 
the gap between the Rated players' averaged individual scores and those 
of the whole group slightly, from 12 points to 11 points, thanks mostly 
to lower net Shakespeare recognition, and (3) surprisingly, it cuts the 
gap between the two groups' aggregated group scores from 14 percentage 
points to only three.  Netting for recognition made no difference at all 
for the whole group's aggregated accuracy score of 79%, but  for cut the 
Rated group's aggregated accuracy score from a dizzying 93% to 82% -- 
only slightly higher than the whole group's despite the rated group's 
much higher individual accuracy.

I haven't fully worked this out with Valenza, but he thinks the basic 
forces at work are mathematical.  For a fixed error rate the majority 
decision gets better as you increase the size of the voting population, 
and the convergence benefits should be less pronounced with a smaller 
population, even if it is more skilled, especially if you have already 
squeezed out most of the group's latent accuracy by aggregation.  If you 
are innocent, he notes, you would be better off in principle with a jury 
of four 60%-accurate jurors and a 1% chance of conviction than with a 
jury of two 80%-accurate jurors and a 4% chance of conviction.  In 
practice, majority rule, different levels of skill among the 
test-takers, and different levels of difficulty among the passages 
complicate this.  Again Valenza:  "If students vote on true/false 
answers on a math test, and certain questions are out of the reach of 
all, aggregation won't help at all on such questions, so their aggregate 
hit rate will converge on the number of easy questions and stick there." 
See Section VI below for some striking examples of convergence among 
SHAKSPER respondents, both in getting right many passages that now look 
easy, and in getting wrong a few passages that now look hard.

Whatever this says about the value of screening, it suggests that 
aggregation can still boost a group's net accuracy significantly, 16 
percentage points for the whole group, six for the rated group, and that 
aggregated group accuracy is almost four out of five for the whole group 
and almost five out of six for the rated group.

V. Other discounts:  Honor system, replicability, choice of samples, and 
sample size.

There were several important differences between our SHAKSPER panel and 
our prior student pilot panels.  Though most of the students had read or 
seen several Shakespeare high school favorite plays like Julius Caesar, 
their recognition of our supposedly obscure passages was much, much 
lower.  So were their stakes in the outcome, their eagerness to take the 
test, their expectations of their own performance, and their overall 
Shakespeare investments.  They had nothing to lose from a low score, no 
incentive to pump up their scores, and little opportunity to do so 
either, since they all took the same test on paper at more of less the 
same time in more or less the same place and didn't get the answers till 
the tests were all in.

SHAKSPER was a different matter.  Its members are heavily invested in 
Shakespeare, often with a conspicuous attachment to one side or the 
other of a hot debate. Many of them trust much more to their intuition 
than to our stylometrics.  Many are pros; a few are A-list Shakespeare 
celebrities.  They are highly knowledgeable and perfectionist.  They 
have hard-earned reputations (or at least hopes of them) which they can 
trade on and don't want to jeopardize.  For many issues, especially 
abstract, symbolic ones or obscure, complicated, technical ones like 
Stylometry, it is often reputations, more than evidence, that seem to 
carry the day.  One-upmanship, though less blatant and all-consuming 
than on HLAS, is still very much the coin of the realm on SHAKSPER. 
Many members have a stake in the outcome and much to lose from getting a 
known low score. This means that the incentives not to take the test or, 
taking it, not to rest content with a low score -- far less to let the 
results be bruited around -- are much stronger than they were for our 
students.  It's hardly surprising that Shakespeare A-list grandees did 
not hasten to put their name to our test; they had much to lose and 
little to gain from it. And a web-based test like ours, which doesn't 
take your name before the test, and which tells you the correct answers 
afterward, is easier to take advantage of than a names-taken paper test 
of the same people in the same room at the same time.

Anyone who offers such a test to an audience like SHAKSPER has to deal 
with tradeoffs between what you have to do to get people to take the 
test at all and what you have to do to keep them from giving biased or 
inflated results.  Many social scientists would have wanted us to build 
in hard controls on bias and inflation: strictly randomize the takers; 
have a control group; don't tell anyone the answers, make sure they 
can't easily copy or Google the test; give everybody a name or a code 
and put cookies in their computers to make sure they can't take it 
twice; or, best of all, tell the ones from Canada and the UK not to take 
the test and make the others all come and get tested in the same room at 
the same time with a timer and a monitor present, just like the College 
Board, which has excellent reasons to take such precautions for a 
high-stakes test.

Most of these hard controls seem to us inappropriate for a group like 
SHAKSPER, too off-putting, too impractical, too pointless, or too easy 
to get around.  We chose soft controls. We tried to make the test as 
inviting, non-threatening, non-onerous, and rewarding as we could, 
short, net-based, and with as much anonymity and feedback as anyone 
could want. We tried to keep the perceived stakes as low as we could. 
We limited the experiment to ten days.  We asked people not to retake 
the test or discuss the questions online while the test was going on. 
In short, we relied heavily on the honor system and soft controls to 
keep the test one of first impression.

We think we did the right thing, and believe that test abuse was close 
to zero.  We found only two obvious retakes, both innocent, and both 
later self-identified for us by the takers.  The rest (admittedly a 
partly qualitative judgment) look legitimate to us.  If there are a few 
fudged ones, it is extremely unlikely in a test with this many takers 
that they would change the outcome by more than a percent or so, or, 
perhaps more important, that the change, if any, would overstate the 
group's accuracy.  None of the comments we have seen so far are 
concerned in the least with overstating the group's accuracy; the 
overwhelming concern has been with understatement.

Three such criticisms, suggested by Jim Carroll, Bob Grumman, and a 
couple of offline correspondents are that the samples are too short 
(Carroll and the correspondents), the test too long (Grumman), and the 
Shakespeare passages insufficiently distinctive (Carroll).  We have 
previously discussed the last point.  If you are testing for Shakespeare 
detection, and not for Shakespeare recall, you want passages that not 
everyone knows, and it would hardly be surprising if these were less 
distinctive than, say, "Lay on, Macduff," or "my kingdom for a horse." 
As for lengthening the passages, it is possible that longer passages 
could be easier to identify by intuition, as they surely are by 
stylometrics.  But the costs of using much longer passages seem to us 
prohibitive.  The only practicable choices for a test like this are many 
passages and short or few passages and long.   "Many and long" is not a 
real option because it would make the test much too long to take, even 
for SHAKSPERians.  Judging from just a few e-mails from takers, our 
test, with 31 sonnet-length passages, takes about 15-20 minutes to 
finish with snap judgments, and up to 40 minutes or so if the judgments 
are more studied.  Suppose we tripled or quadrupled our passage length 
to make it comparable to the shortest passages you can reasonably expect 
to test with computers, that is, to 500 words instead of 140.  It's hard 
to imagine such an expanded test taking less than an hour and easy to 
imagine, not just Bob Grumman's, but every test taker's patience and, 
worse, their focus, wearing thin.  And would we still hear complaints 
that the longer passages were also too short?  We can't rule it out.  It 
is not wise to make your test a Marathon if you want your takers to take 
it, finish it, and pay close attention to it all the way through.

It's also possible, both with SHAKSPER and with HLAS, to imagine a few 
dedicated Marathoners whose patience and focus would not wear thin, and 
who would not just eyeball the text, but would spend whatever time it 
took to comb through the passages for tell-tale words like "Dunsinane" 
or "Osric," or, worse, comb it for those deplorable, countable 
stylometric tell-tales that people like us spend our days looking for, 
and that intuition is supposed to make superfluous - feminine endings, 
open lines, contractions, hendiadys, incongruous who's, and the like. 
Worst of all, they might Google the passage; it wouldn't be hard to do. 
  Whatever you may think or say of such rational, left-brain deployment, 
it is not intuition, and long, high-stakes passages invite much more of 
it than short, lower-stakes passages.

Though shortening the test, to, say, eight long passages, instead of 31 
short ones, might at first glance seem to get the test back to a 
reasonable length, we would expect it, if anything, to raise the stakes 
on each passage, and, with it, the temptation to study each passage 
harder -- and much longer - and, again, supplement or supplant the 
right-brain intuition we are trying to test with left-brain deployment, 
which has nothing to do with intuition.  What would our most predictable 
critic, Jim Carroll, say of a test with just three or four long 
Shakespeare passages?  That it was just the ticket and welcome evidence 
of our growing methodological sophistication, or that it, too, had grave 
methodological shortcomings and was not ready for prime time - with far 
too few passages for a fair test of Shakespeare or anyone else, yet 
still was too much of a Marathon to ask of reasonable takers?  There are 
many other serious problems with few-but-long passages - they are less 
broadly representative and, hence, more subject to variability; they are 
much more vulnerable to recognition; it's much harder to find ones that 
aren't; the distortion costs of using one like our unfortunate Twelfth 
Night passage that everyone turns out to know is much higher; and you 
would have a lot of explaining to do if you tried to make 
apples-to-apples comparisons of irreparably short passages like Shall I 
die? with longer ones -- but we think the ones we have discussed should 
be enough to make our case.  We don't pretend to have said the last word 
on this subject, and, as always, we invite others to try different 
tradeoffs than the ones we used, but nothing is either good or bad but 
alternatives make it so, and neither lengthening the passages nor 
lengthening the test strikes us much of an improvement.

VI. Identification Hits.

As with our student panel, most of our SHAKSPER answers to each 
question, Shakespeare or non-Shakespeare, right or wrong, showed very 
high intra-group agreement as to whether or not the passage was by 
Shakespeare, and also showed high agreement between the whole group and 
the Rated group.  No more than 7% of the aggregated answers look like 
tossups.  The other 93-96% show majorities of 57% or more. If numbers 
like these were reported in a national election, everyone would consider 
it a landslide (Table III). I call this consensus; Valenza calls it 
convergence.

	Table III.  Group Consensus: Very High, but Not Always Correct
All figures net accuracy

                                         Shakespeare
Non-Shakespeare      All
High consensus, questions answered correctly
         Full panel              9 (68-80% maj)        11 (59-100% maj)
20 (59-100% maj)
         Rated only              11 (64-88% maj)       12 (57-100% maj)
23 (57-100% maj)

High consensus, questions answered incorrectly
         Full panel               3 (64-67% maj)        3 (57-69% maj)
6 (57-69% maj)
         Rated only               2 (73-78% maj)        2 (57-58% maj)
4 (57-78% maj)

Tossups, all incorrect
         Full panel               2 (51% maj)                0
2 (51%)
         Rated only               1 (53% maj                 0
1 (53%)

No tossups were correct

This means that both panels had high consensus on 26 or 27 out of the 28 
questions and were closely divided on only one or two.  Looking at 
high-consensus answers only, the full panel got 20 of 26 (77%) firmly 
right, in gross, and the other six firmly wrong.  The Rated panel got 23 
of 27 firmly right (85%) and the other four firmly wrong.  We'll skip 
the details of the impressive Shakespeare "firmly rights" and go 
straight to the equally impressive Non-Shakespeare "firmly rights."

Table IV shows that neither panel had much trouble with most
non-Shakespeare authors represented.

Table IV.  Eleven Non-Shakespeare Hits

Passage		Percentages who thought it non-Shakespeare (full
panel/rated only)
			Listed in declining order of Rated percentages.
All percentages net.

Bacon poems		87/100%
Middleton		89/100%
Chapman			82/100%
Spenser			78/96%
Fletcher		67/91%
Daniel			75/87%
Marlowe II		71/83%
Shall I Die?		65/82%
Earl of Oxford		59/79%
Marlowe I		70/78%
Jonson			60/72%

All of these seem like solid hits to us, both according to what we see 
as the orthodox consensus and according to what our computer evidence 
has done to confirm it. None of these tested passages seem likely to be 
Shakespeare's.  Not everyone agrees with us or the orthodox consensus on 
every passage, but the important point here is that remarkably few of 
our test-takers thought these passages sounded like Shakespeare.  We 
would hardly consider numbers like these a humbling outcome for the 
group that produced them. Shall I Die? was the only one of these widely 
recognized (by 31%/54% of the two panels), but few of those who did not 
recognize it thought it was Shakespeare's.  Would longer passages 
greatly enhance these landslides?  We doubt it; they are already so 
lopsided it's hard to imagine longer passages changing things much, even 
if they should be easier to identify.  Do they signal that the group is 
befuddled by too-short passages?  It doesn't look like it.

VII. Identification Misses.

However, two outcomes, though equally convergent and consensual, were 
not so impressive for accuracy, and a third showed the full group at 
odds with the Rated group (Table V).

Table V. Two and a half Non-Shakespeare misses

Passage		Percentages who thought it non-Shakespeare (Full
panel/Rated)
			All percentages net

Oldcastle		31/42%
Drayton			43/43%
Funeral Elegy		41/57%

The Oldcastle and Drayton passages, one recalling a beleaguered-stag 
scene from As You Like It, the other a sonnet from Drayton's Idea, 
suggest that even strong majorities of both groups can be fooled by 
well-turned, vivid, image-rich passages by other writers.  It is also 
possible that Drayton, who Henslowe says co-authored the play Sir John 
Oldcastle (1600) with Anthony Munday, Robert Wilson, and Richard 
Hathway, could have written both of the confounding passages.  A second 
edition of Oldcastle ascribed it to Shakespeare, and it was included in 
the 1664 Folio and Brooke's Apocrypha,  but we know of no one today who 
seriously ascribes it to Shakespeare, and our tests say it's very 
unlikely to be Shakespeare's work (our 2004,  p. 402).  No takers, 
incidentally, recognized the Oldcastle passage and only one recognized 
the one from Idea.

What about the Funeral Elegy?   Donald Foster relied in part on computer 
tests to prove that the Elegy "couldn't not be Shakespeare," and he did 
speak of intuitive "sniff tests" with a hint of disdain. When Brian 
Vickers' crushing counter case, Counterfeiting Shakespeare (2002) 
loomed, and Foster abandoned his Shakespeare ascription, the dull, 
pious, pedestrian Elegy of the eye instantly became Exhibit A for those 
who say you should always trust your gut instincts, never anyone's 
computers.

If so, and if the whole SHAKSPER group's intuitions were to be taken as 
the only valid test, Foster should have stuck to his guns on authorship 
and reconsidered his disdain for sniff tests.  Only 6% of the whole 
panel recognized our Elegy passage, and 59% of those who didn't thought 
it was Shakespeare's!  On the other hand, a net 57% of the Rated Panel 
thought it was not Shakespeare.  Our tests say that Foster did the right 
thing to concede, and that the Elegy is on a different statistical 
planet from Shakespeare, though it could easily be by Ford (our 2001). 
Maybe it's only the best ears we are supposed to listen to, but it's not 
easy to know in advance which ones those are on any given passage, and 
they are not always connected to the best mouths.  On balance, the 
conflict between rated ears and all ears does little to make the Elegy 
look like a reliable success story for detection by gut instinct.

Table VI shows two Shakespeare misses for both panels, and two equivocal 
tossups.

	Table VI.  Two Shakespeare misses and two more divided panels

Passage		Percentages who thought it Shakespeare (Full
panel/Rated)
			All percentages net


The Rape of Lucrece 	38/22%
Pericles Act V		33/47%

Love's Labor's Lost	49/64%
Venus and Adonis	49/59%

Only one taker recognized our passage from The Rape of Lucrece, and very 
few of the others thought it was Shakespeare's, fewest of all, oddly, on 
the Rated panel, which was otherwise generally more accurate than the 
whole group.  Seven takers recognized our passage from Pericles, Act. V, 
but two-thirds of the whole panel, and 53% of the Rated panel, thought 
it was not Shakespeare's.  Pericles is generally considered co-authored 
by Shakespeare and George Wilkins; scholarly consensus gives Acts 3-5 to 
Shakespeare, and our tests agree with it.

The passages from Love's Labor's Lost and Venus and Adonis were 
recognized by 19 and three takers, respectively; for the others, the 
whole group was divided half and half, but clear majorities of the Rated 
group correctly ascribed them both to Shakespeare.  It's not clear how 
to score these.  Two misses and two tossups?  Or two misses and two 
half-hits?  Neither seems an unequivocal success story for these 
identifications.

VIII. Three shots in the dark.

Three passages on the test were not scored, since scholarly consensus as 
to who wrote them is not settled.  But we tested them anyway in case we 
found the group's instincts helpful in determining actual ascriptions. 
This is wholly uncharted territory, but, if we had a computer test that 
looked like it might be 80% accurate, we might not bet a thousand pounds 
on it, as we have on some of our computer tests, but we certainly would 
not want to let it sit on the shelf unexplored.  The same may be said 
for ascription by gut instinct.  With tweaking, it can reach 82% group 
net accuracy for passages of known authorship, and we can't imagine 
SHAKSPERians not being curious as to what it says about passages of 
unsettled authorship.  Table VII gives the outcomes:

	Table VII. SHAKSPER's Group Ascriptions for Three Doubtful Passages

Passage		Percentages who thought it Shakespeare (Full
panel/Rated)
			All percentages net

1H6				87/89%
A Lover's Complaint 		42/26%
Edward III			68/73%

14% and 25% of the panels recognized another beleaguered stag scene from 
1H6, Talbot before Bordeaux.  87/89% of those who did not recognize it 
thought it was Shakespeare, one of the most lopsided majorities on the 
test.  Gary Taylor assigns the scene, 4.02, to Shakespeare; Paul Vincent 
thinks it is co-authored by Shakespeare and "Author Y."  Marcus Dahl 
could find no hand but Shakespeare's in the whole play.  Our own tests 
are ambivalent on the scene as a whole.  It looks like a Shakespeare 
could-be by all our regular tests, but an improbable by one new test. 
The passage itself is much too short for our tests.  We lean toward 
Vincent's view of the whole scene, but the stylometric evidence is hard 
to judge - much harder, it seems than the intuitive evidence. 
SHAKSPER's judgment in this case is consistent with all four views of 
1H6, though one could argue that SHAKSPER's judgment on non-Shakespeare 
beleaguered-stag passages is not terribly reliable .

Perhaps surprisingly, only one person recognized the passage from A 
Lover's Complaint.  Could it be more discussed these days than read?  Of 
the many who did not recognize it, 58/74% thought it was not 
Shakespeare.  MacDonald Jackson, Kenneth Muir, and most scholars of the 
late twentieth century have assigned LC to Shakespeare.  Our best guess 
(our 1997 and 2004), and Brian Vickers'(his 2007), and Marina 
Tarlinskaja's is that it is not.  SHAKSPER's judgment favors the 
doubters, but, again, one could argue that SHAKSPER's group judgments on 
Shakespeare poems outside the Sonnets don't look like its most reliable.

Five and eight percent of the two panels recognized the Countess Scene 
passage from Edward III, which we take to be a recent addition to the 
consensus Canon.  Our tests say it could be Shakespeare, and 68/73% of 
the two panels seem to agree.

IX. How SHAKSPER's Round 1 compares with other takers.

If we were to seek a kind of "control group" to compare with SHAKSPER's 
takers, we would turn to several Claremont student groups who have taken 
the test, or a precursor, or we would look to our computer tests.  Four 
student groups have taken this, or a previous test:  Valenza's 
preceptorial class of entering freshmen in 1995; the Claremont Rugby 
Side on a bus in New Zealand in 2002; Ann Meyer's Class on Shakespeare's 
Tragedies in 2002; and a sprinkling of Philosophy, Politics, and 
Economics alumni volunteers in 2007.  We have also, over the years, 
identified a couple of miscellaneous Golden Ears from casual takers, and 
we have gone for advice to a small group of pros, most notably MacDonald 
Jackson and John Farrell, but also Brian Vickers, Lisa Hopkins and Matt 
Steggle, none of whom should be deemed in any way responsible for any of 
our test's shortcomings .  None of our tested groups got the extensive 
recordkeeping and analysis that we have given the SHAKSPER groups, but 
it is safe to say from memory that SHAKSPER outperformed the 
preceptorials, the Rugby Side, and the PPE students.

Whether it outperformed the Claremont Shakespeare class is not so clear. 
  We no longer have the students' individual scores, nor their 
individual averages, but they were the least casual, and the 
most-studied of our pilot groups.  Here is an adaptation of a posting we 
sent to SHAKSPER in 2004:

Table VIII.  A Gross-Score Comparison of SHAKSPER with Claremont Pilot Group

				Claremont Students		SHAKSPER

				2002
2007
Worst individual:       54% correct, gross		50% correct,
gross		
Best individual         79% correct, gross		89% gross
Best combined           84% correct, gross, n=6 	93% correct,
gross, n=24
All combined            89% correct, gross, n=12	79% correct,
gross, n=80

18 of the student group's 25 successful identifications (72%) were by 
lopsided votes, two-to-one or higher. 16 of the whole SHAKSPER group's 
22 successful identifications (73%) were lopsided in this sense, as were 
20 of the Rated group's 24 successful identifications (83%).

A better comparison, since SHAKSPERians, on average, recognized about a 
tenth of the passages and the students next to none, would be between 
the students' gross scores, essentially equivalent to their net scores, 
and SHAKSPER's net scores.  Such a comparison would look like Table IX:

Table IX.  Net-Score Comparison of SHAKSPER with Claremont Pilot Group

				Claremont Students		SHAKSPER

				2002				2007
Worst individual:       54% correct, net		44% correct, net

Best individual         79% correct, net		87% correct, net
Best combined           84% correct, net, n=6	82%, correct, net, n=24
All combined            89% correct, net, n=12	79%, correct, net, n=80

It would not be surprising if a group of 80 got higher-highest and 
lower-lowest scores than a group of 12.  It's mildly surprising that the 
smaller, far less knowledgeable, less motivated group got higher group 
accuracy than the larger, more knowledgeable, more motivated group, but 
the smaller the group, the more likely is its outcome to be a fluke. 
Recall, for example, that the Shakespeare class did much better than the 
other three Claremont student groups tested.  Could that have been a 
fluke?  Perhaps it was also mildly surprising that the whole of the 
student group, aggregated, did better than the top half, aggregated, 
which, in turn, did better than the best individual. For the students, 
aggregation was a bigger boost to accuracy than screening.  If so, the 
reverse SHAKSPER outcome, where the best individual surpassed the best 
group, which surpassed the whole group, would not be a surprise.  For 
SHAKSPER, screening seems to have been a more powerful tweak than 
aggregation. There must be a literature on this, perhaps to be gleaned 
from James Surowiecki's footnotes, but we have not studied it.  For now, 
we would say that the two sets of results are remarkably similar and we 
would guess, where they differ, that the larger SHAKSPER panels give a 
better notion than the student panel of what you can reasonably expect 
do with intuition and what you cannot.

It would be possible, with more manual counting than we want to do now, 
to compare the net accuracy of Shakespeare pros with those of amateurs, 
and to see whether literary critics did any better or worse than stage 
performers, artists, scientists, and so on.  It's all buried in the data 
and retrievable in principle, but it would not be easy, and it is not at 
the top of our agenda.  We would guess from the similar accuracy levels 
of SHAKSPER and the student group, and especially from the surprisingly 
similar gross accuracy levels of SHAKSPER's own pro and amateur 
respondents, 68% and 66%, respectively, that the difference in pros' and 
amateurs' net accuracy, if any, would be barely detectable, and not 
necessarily favorable to the pros, whom we would expect to recognize 
more passages than amateurs.   Table X also shows a remarkably narrow 
range of average gross accuracy scores for the various subgroups identified:

	Table X.  Gross Accuracy Scores of Identified Subgroups

Subgroup	Number	Gross Average 	Gross Average Accuracy
Score			%

Professionals	31		18.9			67.5
Amateurs		49		18.6			66.4
Critics		26		18.7			66.8
Writers		14		18.9			67.5
Artists		33		18.6			66.4
Other			21		18.0			64.3

Some of the categories overlap.  "Other" is mostly people who declined 
to state a category.

Would net accuracy differ greatly from these gross accuracy scores?  We 
don't know, but it seems improbable.

One might imagine that writers and artists would be more intuitive, and 
critics more analytical (see Simonton, Origins of Genius, 1999), but the 
average accuracy of the three categories is virtually identical.  We did 
not ask anyone to list their IQ's or their verbal and math SAT scores, 
considering it too nosy and off-putting even for us, but we would have 
loved to have had them, and maybe their college majors as well.  Four of 
our best six students in the 2002 pilot study were science majors; the 
two best of all were science majors from Harvard-bright Harvey Mudd 
College; the others were from our college, Claremont McKenna College, 
whose students, on average, are only Columbia-bright.  We have to wonder 
whether any of our SHAKSPER takers were scientists.

Another way of looking at such questions on a smaller scale might be to 
look at the top end only, Golden and Silver Ears, where we would expect 
recognition to be at its greatest. Conveniently, there are seven Golden 
Ears, six of them pros, and six Silver Ears, five of them amateurs.  The 
86%-pro Golden Ears said they recognized, on average, a remarkable 27% 
of all questions; the 83%-amateur Silver Ears claimed to have recognized 
14% of all questions, half the Golden Ears' rate, but more that twice 
the rate of the other five-sixths of the group, which was 6%.  Golden 
and Silver ears are only a sixth of the whole group, but the big 
difference between the two top groups, one mostly pro, the other mostly 
amateur,  is fully consistent with our commonsense guess that pros would 
recognize more passages than amateurs.

If so, we can roughly calculate the two top groups' net accuracy, as 
explained in Section IV above, simply by subtracting recognitions from 
both correct answers and total takes of each question, giving us net 
correct answers as a fraction of net takes.  This cuts average 
Golden-Ear accuracy from 87% gross to 81% net, and Silver-Ear accuracy 
from 79% gross to 75% net.  With recognized passages removed, both 
groups got fewer passages right in fewer questions, with the 
high-recognition Golden Ears, not surprisingly, losing more accuracy 
than the lower-recognition Silver Ears.  Netting for recognition, in 
effect, reduces Golden Ears to Silver, and Silver to Bronze, with the 
top pros, on average, still retaining a 6% detection accuracy edge over 
the top amateurs, with 81% net accuracy, versus 75%.  Their gross edge 
had been 8%, 87% versus 79%.

Suppose we sought an aggregated, majority-rule on each question group 
score for the Golden Ears only?  It's not clear that they would score 
much higher than the rated group as a whole (82%), which, let us recall, 
was only three points higher than the whole group (79%).  All 7 Golden 
Ears recognized four of the 28 passages.  For the remaining 24 passages, 
where there was some unrecognition to be had, there was a net correct 
majority on 20, a net incorrect majority on three, and a 50-50 tossup on 
one.  If you put aside the tossup altogether, that would give the Golden 
Ears 20 right out of 23 (87%).  If you give half-credit for the tossup, 
they get 20.5 right out of 24 (85%).  If you count the tossup, but give 
no credit for it, they get 20 right out of 24 (83%).  Any of these would 
be arguable, but we would consider the most conservative of them, 83%, 
the most defensible.

We would conclude from this that the Golden Ears were far ahead of all 
others in recognition of passages, and 6% better than the Silver Ears in 
net  intuitive detections of unrecognized passages, but horizontal 
aggregation doesn't seem to boost group accuracy as much at the highest 
level as it does for the whole group.  Unlike the Claremont students in 
the pilot study, the best individuals in the SHAKSPER group did better 
than the whole group aggregated, and better than the best of the group 
aggregated.  Several of these individuals also did better than the best 
Claremont student, even after netting out recognized passages.  Net 
group accuracy for Golden Ears could be five out of six (83%), but it 
would take a bit of indulgence to get it to six out of seven (86%). Top 
pros look better than top amateurs at recognizing passages, and it is 
probable that this is also so of all pros, through we haven't tested for 
it.  We have very little evidence that any category of taker surpasses 
the others on average.

X.  How the SHAKSPER group compares in accuracy with stylometric tests.

Here is what we said about the Claremont pilot study in 2004:

How does this accuracy compare with that of our best quantitative tests? 
  "Far higher" would be a persuasive answer for such short, 
Sonnet-length samples.  All of our quantitative tests are sensitive to 
sample length because longer samples average out more variance than 
shorter ones, giving us tighter ranges and higher discrimination for 
long samples than for short.  Most of the samples we used in our Golden 
Ear test have no more than 150 words, far shorter than any for which we 
have dared to validate any of our quantitative tests.  For comparison, 
our current estimated composite accuracy rates for longer, 
single-authored passages look something like this:

Text			Shakespeare 	Non-Shakespeare

Whole plays		100%		100%
Poems, 3000 words	100%		100%
Play Verse, 3000 words	 95%		100%
Poems, 1500 words	100%		100%
Play Verse, 1500 words	 96%		 88%
Poems 750 words	  	 93%		 71%
Play Verse 750 words	 97%		 75%
Poems, 470 words	 92%		 73%

Not much has changed since then.  Our accuracy figures remain the same, 
and the SHAKSPER Golden Ear outcomes are similar to those from our 
student pilot study, only slightly higher for the best individuals, but 
somewhat lower for the group.  SHAKSPER's double-tweaked accuracy from 
intuition alone is not as good as our tests have been on samples of 
1,500 words or more, but it's in the same ball park with our accuracy 
for passages of 470 to 750 words - and it is far higher than we would 
expect any or all of our tests to do on the very short passages tested, 
which averaged about 140 words.  If Golden Ear had been tried as a 
stylometric test, it probably would not quite have met our test criteria 
- of around 95% reliability in saying "could be" to known Shakespeare, 
and at least 20% reliability is saying "couldn't be" to known 
non-Shakespeare, but that is because we rely on negative evidence and 
are far less tolerant of false negatives than of false positives. 
Without question, intuition is far better than all our tests combined on 
the sonnet-length passages we tested.

As we explained in Section V, it is conceivable that using longer 
passages would raise Golden-Ear accuracy, but doubtful that anyone could 
devise practicable intuitive tests for, say, the 1,500-word or 
3,000-word passages for which we consider our stylometric tests to be 
well validated.  If we, or anyone else, could find and offer 28 
3,000-word passages  for identification, the test would be equal in 
length to Hamlet, Macbeth, Romeo and Juliet, and The Comedy of Errors 
combined and would take more than a day just to read, let alone analyze, 
in entirety.  In general, the longer the passages, the fewer can be 
tested without expecting miracles of motivation.  From this perspective, 
Golden-Ear testing may be almost as impractical for wholesale testing of 
long passages as computers are for testing short passages.

XI. Conclusions

In sum, after much tweaking and netting, SHAKSPERians as a group seem 
capable of getting almost four out of five, or five out of six, 
identifications right, and its very best individual did a bit better 
than that, with net accuracy reaching 87%.  However, only two of 80 test 
takers got net accuracy higher than 83%.  The average Golden Ear had 80% 
net accuracy, silver 75%; the average individual SHAKSPERian taker had 
63% net accuracy.  SHAKSPER's overall performance roughly matched the 
best of our student pilot groups, with aggregated group accuracy 
slightly lower, and the accuracy of the very top individuals somewhat 
higher. SHAKSPER's recognition rates were much higher than those of the 
Claremont pilot group, and its Golden Ears' rates much higher than the 
rest of SHAKSPER.  No one else has taken the test as seriously as 
SHAKSPER.  None of the subcategories of takers stood out as much better 
or worse than the others, and the differences in gross accuracy between 
pros and amateurs seem remarkably small.  Intuition seems much more 
accurate than stylometrics for very short, sonnet-length passages; 
stylometric seems more accurate than intuition for longer passages, but 
an actual head-to-head comparison of many long passages seems impracticable.

XII. Golden Ear Round 2

Golden Ear Round 1 has given us what looks like a highly talented panel 
of a dozen rated, identified SHAKSPERians, to take Golden Ear Round 2. 
We haven't yet asked them whether any of them want their names or their 
particulars, pro/amateur, writer, player, etc., disclosed to SHAKSPER or 
anyone else.  To these we might add up to eight or nine rated players 
discovered in previous tests, if we can find them and get them to serve. 
  A couple of unrated SHAKSPERians want to take Round 2, and we would 
not grudge them the experience, nor would we be above retroactively 
including some or all of the dozen rated players who did not identify 
themselves on the test, or, for that matter, other unrated players who 
want another shot, but we need enough identification on Round 2 to 
e-mail it to the proper recipients and to relate their Round 2 outcomes 
to their Round 1 outcomes.

Unfortunately, we don't have Round 2 on the web and shall have to find a 
way to mail or e-mail it to takers to do by hand, and to be scored by 
hand.  Like Round 1, Round 2 will have a known, scorable component, both 
to help calibrate the test, which we would guess is more difficult than 
Round 1, and to provide more data points to help see how much, if any, 
of the high accuracy rates found at the top of Round 1 was luck of the 
draw.  It's one thing to find and congratulate the best guesser of the 
number of beans in the Business School jar; it is quite another to 
expect him to do it twice.  Round 2 will also have some of its own shots 
in the dark, passages whose authorship is not settled.

XIII. A last note on methodology

In some ways, it is astonishing, given the frequency and fervency of 
declarations that intuition can outperform Stylometry - or that 
styometry can outperform "sniff tests" --  that no one has ever tried to 
see whether, and to what extent either proposition is actually so.  We 
have been trying to remedy that for twelve years and have now, at last, 
thanks to the help of our student programmer, Ryan Wilson, and his 
advisor, Arthur Lee, gotten a Round 1 survey up on the net and gotten an 
excellent response from SHAKSPER, which has permitted us, at last, to 
try a first cut at an answer.  We have used the best methodology we 
could manage, but who are we to proclaim that the tradeoffs we chose in 
our first big outing are the ones that should bind all others?  For an 
exercise as novel as this, it would be surprising if further 
experimentation with different parameters did not produce new insights 
perhaps wiser and more penetrating than ours, and informed by our 
mistakes, as well as by our successes.  Every adventure is a 
reconnaissance for the next, and this seems to us a question begging to 
be explored from more than one perspective. As always, if anyone in or 
out of SHAKSPER would like to try an experiment with different tradeoffs 
than the ones we chose, and SHAKSPERians were willing to take it, we 
would be happy to help them out.  In the meantime, we consider our 
tradeoffs reasonable ones and our evidence the best currently available. 
  We hope it will inspire better.

In the meantime, we would like, again, to thank the SHAKSPERians and 
others who took our survey for giving it their full attention, and 
especially for honoring our request to withhold their online comments 
till the test was over, so as not to wreck the test for others.  We now 
welcome comments, online and off, but we hope the online ones will take 
care not to give away too many specifics of the test, which cost twelve 
years of pretesting and hundreds of dollars for programming to prepare, 
and, now, many hours from SHAKSPERians to take and us to analyze.  We 
would just as soon keep it available for future use with different 
groups, and as a standard against which other versions can be measured. 
  We hope that SHAKSPERians will help us keep it so, as much as possible.


 This e-mail address is being protected from spambots. You need JavaScript enabled to view it
  (the address, not the professor) was retired July 
1, 2007; please use 
 This e-mail address is being protected from spambots. You need JavaScript enabled to view it
  instead.

Ward Elliott
Burnet C. Wohlford Professor of American Political Institutions
Claremont McKenna College
Pitzer Hall, 850 Columbia Ave.
Claremont, CA 91711-6420
(909) 607-3649
Fax (909) 621-8419

 This e-mail address is being protected from spambots. You need JavaScript enabled to view it
 
http://govt.cmc.edu/welliott
"Better grey words with crimson examples than crimson words with grey 
examples."

_______________________________________________________________
S H A K S P E R: The Global Shakespeare Discussion List
Hardy M. Cook, 
 This e-mail address is being protected from spambots. You need JavaScript enabled to view it
 
The S H A K S P E R Web Site <http://www.shaksper.net>

DISCLAIMER: Although SHAKSPER is a moderated discussion list, the 
opinions expressed on it are the sole property of the poster, and the 
editor assumes no responsibility for them.
 

©2011 Hardy Cook. All rights reserved.