The Shakespeare Conference: SHK 18.0492 Wednesday, 1 August 2007
From: Ward Ward <
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
>
Date: Wednesday, 01 Aug 2007 00:24:25 -0700
Subject: 18.0473 Shakespeare Golden Ear Test
Comment: RE: SHK 18.0473 Shakespeare Golden Ear Test
Our warmest thanks to the 80 brave SHAKSPERians who took our Golden Ear
Test, Round 1, and to Hardy for making them all reachable by us. We
hope that SHAKSPERians will indulge us in an oversize 20-page posting,
giving a preliminary analysis of the results of the test, which test ran
for ten days, from July 6 through July 15. It's now OK to discuss the
test online, as well as off, but try not to give away too many of the
test's specifics, to preserve some usability for it in the future. The
test is still up, http://goldenear.cmc.edu, but we haven't been
monitoring it since July 15 and would suppose that any results since
then are likely to be retakes to get another look at the test.
Our two-sentence conclusion is this: As individuals, on average, the
whole SHAKSPER group got almost two out of three unrecognized passages
right (63%), and the top 30% got almost three out of four right (74%).
As an aggregated group, the whole group got almost four out of five
right (79%), and the top 30% got almost five out of six right (82%).
For a two-paragraph conclusion, scroll down to Table II, and then to
Section XI, Conclusions, and don't miss the cautions and further thanks
at the very end.
For 20 pages of further detail, read on.
I. Who took the test?
31 of the 80 takers considered themselves Shakespeare pros (39%) and 49
considered themselves amateurs. We would guess that almost all were
from SHAKSPER, though there was also an admixture of takers from HLAS, a
rowdy, free-wheeling sister group of Shakespeare authorship buffs, most
of whom also belong to SHAKSPER but are less likely to be pros than
SHAKSPERians and, on their own turf, are subject to none of Hardy's
constraints. It's possible that a few of these do not belong to
SHAKSPER, but we don't think it matters much. HLAS people are scarcely
less hooked on Shakespeare, and, hence, almost as high on our list of
people who might take a test like ours seriously and give us interesting
results.
26 respondents described themselves as critics, 14 as writers, 33 as
artists, including performing artists, and 8 as "other humanities [than
literature], or social, or natural sciences." 80 is about 6% of
SHAKSPER's membership, a respectable showing. To encourage wider
participation, we did not require test takers to give their names, and
most did not, but 14 takers (18%) did identify themselves as willing,
and, in some cases, eager, to take our Round 2 test. 12 of these were
Rated, that is, they scored Bronze or better on the test. We did not
encounter many, if any, A-List Shakespeare celebrities, no Harold
Blooms or Stephen Greenblatts among the self-identifiers, nor many, if
any, of SHAKSPER's most vocal past advocates of shutting down your
computers and listening to your intuition only -- but we can't exclude
the possibility that some Shakespeare grandees or intuitionists might
have taken the test anonymously. Some A-list people helped us design
the test; we would not expect any of these to have taken it.
II. Gross Accuracy as Found: Two out of three for the group, four out of
five for the top 30%
Of the 80 who took the test, 24 (30%) were rated Bronze or better, with
the following distribution:
Table I. Rated Takers by Category, All Ratings Gross
Golden Silver Bronze Total
Identified 4 4 4
12
Unidentified 3 2 7
12
Sh. Pro 6 1 5
12
Sh. Amateur 1 5 6
12
Total 6 11 24
Rated players got four out of five identifications right, in gross
figures, three out of four right in net figures, and did better on
non-Shakespeare than on Shakespeare. Gross figures count all correct
answers, whether recognized from memory or detected by intuition; net
figures subtract the recognized answers and count intuitive answers
only. Net figures are the interesting ones, but they are harder to
arrive at than gross and are not available for all purposes.
Of the rated takers, neatly, half considered themselves Shakespeare
pros, the other half amateurs. Half identified themselves, half did not.
All of the ratings are slightly inflated because they are based on
gross, not net accuracy. Subtracting recognized passages would reduce
most Golden Ears to Silver, and most Silver to Bronze.
The whole group, on average, got about two out of three identifications
right, both in gross figures and in net, since the whole group
recognized fewer passages, on average, than the Rated players. Like the
Rated players, the whole group did better on non-Shakespeare than on
Shakespeare. The average gross score of all 80 takers was 18.6 of 28
(66%); their net score equivalent would be about a point lower, 17.6
(63%).
The 31 pros in the whole group scored between 14 and 25, averaging 18.9
right of 28 questions (68%, gross). The 49 amateurs scored between 14
and 24, averaging 18.4 right (66%, gross). It is not surprising that
the pros did better, on average, than the amateurs. It is surprising
that the gap is so small, especially considering that these are gross
scores uncorrected for passages recognized by the test-taker, which one
would expect to be more common among pros than among amateurs. In fact,
the average gross accuracy scores of every subgroup - critics, writers,
artists, others -- we tested fell into an extremely narrow range, none
lower than 18, none as high as 19 (Table X, below).
Should the two-out-of-three or three-out-of-four individual gross
accuracy levels we found be considered high or low? Judging from
post-test lamentations we have heard, scattered mutterings about
"humbling experience" and "stupid test," and low self-identification
rates even of high-scoring players, we would guess that many, perhaps
most, takers were disappointed with their scores, expected to do better,
and didn't want to put their names to their test results.
Our further guess is that they reacted to our test much as law students
were expected to do to an experiment/demonstration routinely performed
in Evidence classes in our day, decades ago. A two-minute dramatic
event is staged, and the students are asked to describe what happened.
Who was chasing whom? What did they say? What color were the first
one's eyes? How tall was the second one, and what did he weigh? What
color were his socks? The answers, when collected, always turn out to
be all over the lot and filled with inaccuracies, and the class is
supposed to learn from the "humbling experience" that eyewitness
testimony is not very reliable.
We followed a different model in our Golden Ear experiment (below). Our
expectations of individual performances were not so high, and we believe
that SHAKSPERians have done quite a bit better as a group than most of
them think. The average individual outcomes as found struck us as par
for the course or better; and those for rated players seem particularly
impressive. Moreover, we now have further confirmation that the scores
as found can be tweaked by screening and aggregation to reach
surprisingly high levels of group accuracy, which have never, to our
knowledge, been demonstrated before on this scale.
Let's start with the outcomes as found. Two-out-of-three average
accuracy, is far from perfect, but it is better than chance, better on
non-Shakespeare than most of our individual computer tests on sizable
texts (though not better than all of them combined), and better than all
our computer tests combined on the very short, sonnet-length passages we
used in the test. It is also, we would guess, better than three of our
four past pilot-study groups mostly of Claremont Colleges students
(below). The mean for the SHAKSPER group is barely a point and a half
short of a Bronze. For reference, our preset range boundaries were:
Golden Ear, 24-28 out of 28; Silver, 22-23; Bronze, 20-21. No one got a
tin ear, 12 or less, because chance tends to pull all scores, high and
low, toward the mean. If you don't recognize anything at all and guess
at random, you still have 50-50 odds of getting each guess right, and
you will much more likely get a 15 or a 16 than a zero. Getting a zero
would be a remarkable feat, implying powers of discrimination comparable
to what is needed to max the test with a 28. With a 35% average failure
rate, about what SHAKSPER's was, you would expect 1.1 Tin Ear (below 12)
purely by chance; we got none. You would also expect 1.1 Golden Ear
(24+) purely by chance; we got seven, and would conclude that their
success has to be more than pure luck.
Of the 24 rated players, Bronze or better, the pro gross average was
22.7, amateur average, 21.3, all rated combined, 22.0. These amount to
81%, 76%, and 79% gross accuracy, respectively. Stated differently, the
average rated pro player had a Silver Ear and could correctly ascribe
four out of five passages; the average rated amateur could get three out
of four - some of which in both cases, however, were from memory, not
from intuitive detection. Again, we know of no computer test, or
combination of computer tests, which could reach this level of accuracy
in identifying such short passages.
III. Gross accuracy tweaked upwards by aggregation and screening.
Now let's get to the tweaking. We knew from our pilot study of a dozen
Claremont students in a 2002 Shakespeare class that you could bring out
a group's latent accuracy by screening for the best and averaging for
the group. The best student in the group got 79% right; the top half of
the group, horizontally aggregated by majority rule for each question,
got 85% right; and the whole class, so aggregated, got 89% right. These
are both gross and net figures because student recognition of the
passages was close to zero. Could we do the same with a larger, more
experienced panel of SHAKSPERians? The short answer is not quite the
students' remarkable level of nine out of ten, but still an impressive
five out of six, net, after removing all recognized passages.
In our day the classic demonstration of the benefits of aggregation,
since revealingly elaborated by James Surowiecki in The Wisdom of
Crowds, 2004, was performed routinely in business school classes. The
class would be asked to guess the number of beans in a jar, and, despite
their best efforts to get it right, would find that their answers were
all over the lot, just like the law students. But, far from dismissing
it as another "humbling experience," the professor would go on to plot
the guesses. If the class took the task seriously, as they generally do
in business schools, the usual outcome was this: the guesses would form
a bell curve, with the peak of the curve within five or ten percent of
the actual number of beans in the jar. Thanks to the Wisdom of Crowds,
the class as a whole would often equal or surpass the accuracy of its
best member, and with far greater reliability and predictability, since
no one can tell in advance who is going to make the best guess. Chance
is much more of a factor with one considered guess than with many. Our
adaptation of the Wisdom of Crowds to the Golden Ear test was not just
to average takers' individual scores, but also to average their answers
to each question and draw from these "horizontal" averages a collective
group score. For the SHAKSPER takers, the two tweaks combined raised
gross group accuracy dramatically, from an untweaked individual average
accuracy of 66% to a tweaked collective average accuracy of almost 93%.
Here's how we did it:
We have seen that the gross vertical average of final individual scores
of all takers was 18.6, with 66% correct answers. The collective gross
score of all takers, first finding the group's majority horizontally on
each question and then scoring the whole group's averaged answers
vertically, just like a single test, was 22, or 79% correct answers.
Simply aggregating the whole group's guesses on each question raised
their gross accuracy from two out of three to four out of five.
As it happens, the gross average of individual scores of all Rated (that
is, Bronze or better) takers was also 22, exactly equal to the
four-out-of-five accuracy of the whole group. Aggregation raised the
whole group's gross accuracy 13% (i.e., 79% - 66%) to the same level as
its top 30% unaggregated. Aggregating the top 30%, in turn, raised its
gross accuracy another 13.9%, to 26 out of 28, 92.9% accuracy, one point
better, as it happens than the anonymous top-scoring individual. As it
further happens, gross aggregated accuracy was identical for Shakespeare
and for non-Shakespeare. Unaggregated, the top players got a gross
average of four out of five right; aggregated they got 13 out of 14
right. Both of the tweaks combined raised the group's gross collective
score from 18.6 to an astonishing 26, and its gross collective accuracy
from 66% to almost 93%, and it seems that both tweaks contributed about
equally to the improvement. After these two tweaks, the experience
might seem not quite so humbling.
IV. Net Accuracy: After removing recognized passages, still tweakable
to almost Five out of Six.
But we need at least one further tweak, and this one brings the average
back down a bit, to four out of five for the group, and five out of six
for the Top-30% Rated players. All the gross numbers discussed so far
make no allowance for recognition and treat every correct answer as if
it came from intuition. Unless recognition is zero, as it almost was
for our student pilot panels, but surely was not for our SHAKSPER panel,
this is bound to overstate the power of intuition.
We tried to avoid familiar passages, identify them, and exclude them
where found.
"Avoid" means that we tried to pick the least familiar passages,
especially Shakespeare passages, we thought we could find. Jim
Carroll's complaint (8 July) that our "Shakespearean passages have been
chosen to minimize what is distinctive about Shakespeare: his
interesting diction and his constant stretching for metaphorical
expression" probably reflects this precaution, but we thought it far
preferable to exclude archetypical passages like "Friends, Romans,
countrymen" or "Shall I compare thee to a summer's day?" from a test of
intuitive detection than to include them and either miscount them as
true detection or toss them out as obviously remembered, in which case,
you would have to wonder why we bothered to include them in the first
place. We would guess that Shakespeare's most archetypical passages are
also his most familiar, and they are more likely to be tests of memory
than of intuition. If we succeeded in choosing unfamiliar passages, the
ones we chose are no doubt less distinctive than passages randomly
selected from Shakespeare -- but the alternatives, choosing familiar
passages, or even choosing at random without regard to familiarity,
would have been too big a waste of our takers' time and our own.
Steering clear of the most familiar passages alone was enough in our
pilot studies to make sure that only one of the students recognized even
one passage. The students' net accuracy was therefore essentially the
same as their gross accuracy because, for them, almost every passage was
a case of first impression. However, as we shall see, it wasn't enough
for SHAKSPER, especially for Shakespeare passages, and we had to correct
for it.
"Identify" means that we asked takers to tell us outright whether they
recognized each passage. The responses showed us that, with a group as
sophisticated as SHAKSPER, our efforts to avoid familiar Shakespeare
passages were not always successful. Our worst choice from this
standpoint was a passage from Twelfth Night, which was recognized by 45%
of all the takers and 75% of the Rated takers. Two other play passages
and one Shakespeare sonnet got 20-30% recognition from the whole group
and 40-50% recognition from the Rated players. The other Shakespeare
questions averaged maybe 7% recognition for the whole group, 18% for the
ranked group. The overall average recognition rate was three or four
times higher for Shakespeare than for non-Shakespeare, and twice as high
for rated players as for the group as a whole. That is, on average, 15%
of the whole group and 29% of the rated group recognized our Shakespeare
passages, and 4% of the whole group, and 8% of the rated group
recognized the non-Shakespeare passages. By the same token, however,
they did not recognize 70-85% of our Shakespeare passages, and 92-96% of
our non-Shakespeare passages making these fully and properly testable by
our methods.
"Exclude" means we tried, using simplifying assumptions, to find a way
to exclude recognized passages from our accuracy estimates. To get from
gross accuracy, which is inflated by recognition, to net accuracy, which
is not, we assumed that essentially all recognition identifications
would be correct and subtracted them both from the group's total correct
answer and from its total valid takes. Spot-checking the top two
categories, who got a quarter of the recognitions, indicates that the
assumption is not quite true, only 96% true. Four percent of their
supposed recognitions were wrong. But the actuality is close enough to
100% to allow us to use it as a simplifying assumption, which would very
slightly overstate net accuracy. It is preferable to calculating the
impact of each recognition separately and manually for 2,240 answers,
and far preferable to using gross accuracy only.
This exclusion process lowered the average of individual percentages by
one to eight percent, with the greatest reductions for Shakespeare and
rated players, where recognition was very high, and the least for
non-Shakespeare and the whole group, where recognition was lower.
Overall reductions from gross averaged individual accuracy to net were
6% for the rated group and 4% for the whole group. Here is a global
summary of all our averages, with averaged individual accuracy
percentages above and aggregated group accuracy below:
Table II. Average Gross and Net Accuracy Rates, Individual and
Group
All, gross All, net Rated,
Gross Rated, net
Shakespeare 66% 60% 76%
66%
Non-Shakespeare 66% 66% 81%
80%
All 67% 63% 79%
74%
Aggregated (Group)79% 79% 93%
82%
This is our most important, bottom-line table, which gives the vertical
average of individual scores (that is, the sum of all correct answers
divided by the sum of all unrecognized takes), above, and the
horizontally, majority-rule-for-each question aggregated group score,
below, gross figures to the left, net to the right. The key figures are
now the net ones. What leaps out from it to our eye is (1) that netting
out the recognized answers, unsurprisingly, cuts the Shakespeare
accuracy percentages much more than the non-Shakespeare; (2) it narrows
the gap between the Rated players' averaged individual scores and those
of the whole group slightly, from 12 points to 11 points, thanks mostly
to lower net Shakespeare recognition, and (3) surprisingly, it cuts the
gap between the two groups' aggregated group scores from 14 percentage
points to only three. Netting for recognition made no difference at all
for the whole group's aggregated accuracy score of 79%, but for cut the
Rated group's aggregated accuracy score from a dizzying 93% to 82% --
only slightly higher than the whole group's despite the rated group's
much higher individual accuracy.
I haven't fully worked this out with Valenza, but he thinks the basic
forces at work are mathematical. For a fixed error rate the majority
decision gets better as you increase the size of the voting population,
and the convergence benefits should be less pronounced with a smaller
population, even if it is more skilled, especially if you have already
squeezed out most of the group's latent accuracy by aggregation. If you
are innocent, he notes, you would be better off in principle with a jury
of four 60%-accurate jurors and a 1% chance of conviction than with a
jury of two 80%-accurate jurors and a 4% chance of conviction. In
practice, majority rule, different levels of skill among the
test-takers, and different levels of difficulty among the passages
complicate this. Again Valenza: "If students vote on true/false
answers on a math test, and certain questions are out of the reach of
all, aggregation won't help at all on such questions, so their aggregate
hit rate will converge on the number of easy questions and stick there."
See Section VI below for some striking examples of convergence among
SHAKSPER respondents, both in getting right many passages that now look
easy, and in getting wrong a few passages that now look hard.
Whatever this says about the value of screening, it suggests that
aggregation can still boost a group's net accuracy significantly, 16
percentage points for the whole group, six for the rated group, and that
aggregated group accuracy is almost four out of five for the whole group
and almost five out of six for the rated group.
V. Other discounts: Honor system, replicability, choice of samples, and
sample size.
There were several important differences between our SHAKSPER panel and
our prior student pilot panels. Though most of the students had read or
seen several Shakespeare high school favorite plays like Julius Caesar,
their recognition of our supposedly obscure passages was much, much
lower. So were their stakes in the outcome, their eagerness to take the
test, their expectations of their own performance, and their overall
Shakespeare investments. They had nothing to lose from a low score, no
incentive to pump up their scores, and little opportunity to do so
either, since they all took the same test on paper at more of less the
same time in more or less the same place and didn't get the answers till
the tests were all in.
SHAKSPER was a different matter. Its members are heavily invested in
Shakespeare, often with a conspicuous attachment to one side or the
other of a hot debate. Many of them trust much more to their intuition
than to our stylometrics. Many are pros; a few are A-list Shakespeare
celebrities. They are highly knowledgeable and perfectionist. They
have hard-earned reputations (or at least hopes of them) which they can
trade on and don't want to jeopardize. For many issues, especially
abstract, symbolic ones or obscure, complicated, technical ones like
Stylometry, it is often reputations, more than evidence, that seem to
carry the day. One-upmanship, though less blatant and all-consuming
than on HLAS, is still very much the coin of the realm on SHAKSPER.
Many members have a stake in the outcome and much to lose from getting a
known low score. This means that the incentives not to take the test or,
taking it, not to rest content with a low score -- far less to let the
results be bruited around -- are much stronger than they were for our
students. It's hardly surprising that Shakespeare A-list grandees did
not hasten to put their name to our test; they had much to lose and
little to gain from it. And a web-based test like ours, which doesn't
take your name before the test, and which tells you the correct answers
afterward, is easier to take advantage of than a names-taken paper test
of the same people in the same room at the same time.
Anyone who offers such a test to an audience like SHAKSPER has to deal
with tradeoffs between what you have to do to get people to take the
test at all and what you have to do to keep them from giving biased or
inflated results. Many social scientists would have wanted us to build
in hard controls on bias and inflation: strictly randomize the takers;
have a control group; don't tell anyone the answers, make sure they
can't easily copy or Google the test; give everybody a name or a code
and put cookies in their computers to make sure they can't take it
twice; or, best of all, tell the ones from Canada and the UK not to take
the test and make the others all come and get tested in the same room at
the same time with a timer and a monitor present, just like the College
Board, which has excellent reasons to take such precautions for a
high-stakes test.
Most of these hard controls seem to us inappropriate for a group like
SHAKSPER, too off-putting, too impractical, too pointless, or too easy
to get around. We chose soft controls. We tried to make the test as
inviting, non-threatening, non-onerous, and rewarding as we could,
short, net-based, and with as much anonymity and feedback as anyone
could want. We tried to keep the perceived stakes as low as we could.
We limited the experiment to ten days. We asked people not to retake
the test or discuss the questions online while the test was going on.
In short, we relied heavily on the honor system and soft controls to
keep the test one of first impression.
We think we did the right thing, and believe that test abuse was close
to zero. We found only two obvious retakes, both innocent, and both
later self-identified for us by the takers. The rest (admittedly a
partly qualitative judgment) look legitimate to us. If there are a few
fudged ones, it is extremely unlikely in a test with this many takers
that they would change the outcome by more than a percent or so, or,
perhaps more important, that the change, if any, would overstate the
group's accuracy. None of the comments we have seen so far are
concerned in the least with overstating the group's accuracy; the
overwhelming concern has been with understatement.
Three such criticisms, suggested by Jim Carroll, Bob Grumman, and a
couple of offline correspondents are that the samples are too short
(Carroll and the correspondents), the test too long (Grumman), and the
Shakespeare passages insufficiently distinctive (Carroll). We have
previously discussed the last point. If you are testing for Shakespeare
detection, and not for Shakespeare recall, you want passages that not
everyone knows, and it would hardly be surprising if these were less
distinctive than, say, "Lay on, Macduff," or "my kingdom for a horse."
As for lengthening the passages, it is possible that longer passages
could be easier to identify by intuition, as they surely are by
stylometrics. But the costs of using much longer passages seem to us
prohibitive. The only practicable choices for a test like this are many
passages and short or few passages and long. "Many and long" is not a
real option because it would make the test much too long to take, even
for SHAKSPERians. Judging from just a few e-mails from takers, our
test, with 31 sonnet-length passages, takes about 15-20 minutes to
finish with snap judgments, and up to 40 minutes or so if the judgments
are more studied. Suppose we tripled or quadrupled our passage length
to make it comparable to the shortest passages you can reasonably expect
to test with computers, that is, to 500 words instead of 140. It's hard
to imagine such an expanded test taking less than an hour and easy to
imagine, not just Bob Grumman's, but every test taker's patience and,
worse, their focus, wearing thin. And would we still hear complaints
that the longer passages were also too short? We can't rule it out. It
is not wise to make your test a Marathon if you want your takers to take
it, finish it, and pay close attention to it all the way through.
It's also possible, both with SHAKSPER and with HLAS, to imagine a few
dedicated Marathoners whose patience and focus would not wear thin, and
who would not just eyeball the text, but would spend whatever time it
took to comb through the passages for tell-tale words like "Dunsinane"
or "Osric," or, worse, comb it for those deplorable, countable
stylometric tell-tales that people like us spend our days looking for,
and that intuition is supposed to make superfluous - feminine endings,
open lines, contractions, hendiadys, incongruous who's, and the like.
Worst of all, they might Google the passage; it wouldn't be hard to do.
Whatever you may think or say of such rational, left-brain deployment,
it is not intuition, and long, high-stakes passages invite much more of
it than short, lower-stakes passages.
Though shortening the test, to, say, eight long passages, instead of 31
short ones, might at first glance seem to get the test back to a
reasonable length, we would expect it, if anything, to raise the stakes
on each passage, and, with it, the temptation to study each passage
harder -- and much longer - and, again, supplement or supplant the
right-brain intuition we are trying to test with left-brain deployment,
which has nothing to do with intuition. What would our most predictable
critic, Jim Carroll, say of a test with just three or four long
Shakespeare passages? That it was just the ticket and welcome evidence
of our growing methodological sophistication, or that it, too, had grave
methodological shortcomings and was not ready for prime time - with far
too few passages for a fair test of Shakespeare or anyone else, yet
still was too much of a Marathon to ask of reasonable takers? There are
many other serious problems with few-but-long passages - they are less
broadly representative and, hence, more subject to variability; they are
much more vulnerable to recognition; it's much harder to find ones that
aren't; the distortion costs of using one like our unfortunate Twelfth
Night passage that everyone turns out to know is much higher; and you
would have a lot of explaining to do if you tried to make
apples-to-apples comparisons of irreparably short passages like Shall I
die? with longer ones -- but we think the ones we have discussed should
be enough to make our case. We don't pretend to have said the last word
on this subject, and, as always, we invite others to try different
tradeoffs than the ones we used, but nothing is either good or bad but
alternatives make it so, and neither lengthening the passages nor
lengthening the test strikes us much of an improvement.
VI. Identification Hits.
As with our student panel, most of our SHAKSPER answers to each
question, Shakespeare or non-Shakespeare, right or wrong, showed very
high intra-group agreement as to whether or not the passage was by
Shakespeare, and also showed high agreement between the whole group and
the Rated group. No more than 7% of the aggregated answers look like
tossups. The other 93-96% show majorities of 57% or more. If numbers
like these were reported in a national election, everyone would consider
it a landslide (Table III). I call this consensus; Valenza calls it
convergence.
Table III. Group Consensus: Very High, but Not Always Correct
All figures net accuracy
Shakespeare
Non-Shakespeare All
High consensus, questions answered correctly
Full panel 9 (68-80% maj) 11 (59-100% maj)
20 (59-100% maj)
Rated only 11 (64-88% maj) 12 (57-100% maj)
23 (57-100% maj)
High consensus, questions answered incorrectly
Full panel 3 (64-67% maj) 3 (57-69% maj)
6 (57-69% maj)
Rated only 2 (73-78% maj) 2 (57-58% maj)
4 (57-78% maj)
Tossups, all incorrect
Full panel 2 (51% maj) 0
2 (51%)
Rated only 1 (53% maj 0
1 (53%)
No tossups were correct
This means that both panels had high consensus on 26 or 27 out of the 28
questions and were closely divided on only one or two. Looking at
high-consensus answers only, the full panel got 20 of 26 (77%) firmly
right, in gross, and the other six firmly wrong. The Rated panel got 23
of 27 firmly right (85%) and the other four firmly wrong. We'll skip
the details of the impressive Shakespeare "firmly rights" and go
straight to the equally impressive Non-Shakespeare "firmly rights."
Table IV shows that neither panel had much trouble with most
non-Shakespeare authors represented.
Table IV. Eleven Non-Shakespeare Hits
Passage Percentages who thought it non-Shakespeare (full
panel/rated only)
Listed in declining order of Rated percentages.
All percentages net.
Bacon poems 87/100%
Middleton 89/100%
Chapman 82/100%
Spenser 78/96%
Fletcher 67/91%
Daniel 75/87%
Marlowe II 71/83%
Shall I Die? 65/82%
Earl of Oxford 59/79%
Marlowe I 70/78%
Jonson 60/72%
All of these seem like solid hits to us, both according to what we see
as the orthodox consensus and according to what our computer evidence
has done to confirm it. None of these tested passages seem likely to be
Shakespeare's. Not everyone agrees with us or the orthodox consensus on
every passage, but the important point here is that remarkably few of
our test-takers thought these passages sounded like Shakespeare. We
would hardly consider numbers like these a humbling outcome for the
group that produced them. Shall I Die? was the only one of these widely
recognized (by 31%/54% of the two panels), but few of those who did not
recognize it thought it was Shakespeare's. Would longer passages
greatly enhance these landslides? We doubt it; they are already so
lopsided it's hard to imagine longer passages changing things much, even
if they should be easier to identify. Do they signal that the group is
befuddled by too-short passages? It doesn't look like it.
VII. Identification Misses.
However, two outcomes, though equally convergent and consensual, were
not so impressive for accuracy, and a third showed the full group at
odds with the Rated group (Table V).
Table V. Two and a half Non-Shakespeare misses
Passage Percentages who thought it non-Shakespeare (Full
panel/Rated)
All percentages net
Oldcastle 31/42%
Drayton 43/43%
Funeral Elegy 41/57%
The Oldcastle and Drayton passages, one recalling a beleaguered-stag
scene from As You Like It, the other a sonnet from Drayton's Idea,
suggest that even strong majorities of both groups can be fooled by
well-turned, vivid, image-rich passages by other writers. It is also
possible that Drayton, who Henslowe says co-authored the play Sir John
Oldcastle (1600) with Anthony Munday, Robert Wilson, and Richard
Hathway, could have written both of the confounding passages. A second
edition of Oldcastle ascribed it to Shakespeare, and it was included in
the 1664 Folio and Brooke's Apocrypha, but we know of no one today who
seriously ascribes it to Shakespeare, and our tests say it's very
unlikely to be Shakespeare's work (our 2004, p. 402). No takers,
incidentally, recognized the Oldcastle passage and only one recognized
the one from Idea.
What about the Funeral Elegy? Donald Foster relied in part on computer
tests to prove that the Elegy "couldn't not be Shakespeare," and he did
speak of intuitive "sniff tests" with a hint of disdain. When Brian
Vickers' crushing counter case, Counterfeiting Shakespeare (2002)
loomed, and Foster abandoned his Shakespeare ascription, the dull,
pious, pedestrian Elegy of the eye instantly became Exhibit A for those
who say you should always trust your gut instincts, never anyone's
computers.
If so, and if the whole SHAKSPER group's intuitions were to be taken as
the only valid test, Foster should have stuck to his guns on authorship
and reconsidered his disdain for sniff tests. Only 6% of the whole
panel recognized our Elegy passage, and 59% of those who didn't thought
it was Shakespeare's! On the other hand, a net 57% of the Rated Panel
thought it was not Shakespeare. Our tests say that Foster did the right
thing to concede, and that the Elegy is on a different statistical
planet from Shakespeare, though it could easily be by Ford (our 2001).
Maybe it's only the best ears we are supposed to listen to, but it's not
easy to know in advance which ones those are on any given passage, and
they are not always connected to the best mouths. On balance, the
conflict between rated ears and all ears does little to make the Elegy
look like a reliable success story for detection by gut instinct.
Table VI shows two Shakespeare misses for both panels, and two equivocal
tossups.
Table VI. Two Shakespeare misses and two more divided panels
Passage Percentages who thought it Shakespeare (Full
panel/Rated)
All percentages net
The Rape of Lucrece 38/22%
Pericles Act V 33/47%
Love's Labor's Lost 49/64%
Venus and Adonis 49/59%
Only one taker recognized our passage from The Rape of Lucrece, and very
few of the others thought it was Shakespeare's, fewest of all, oddly, on
the Rated panel, which was otherwise generally more accurate than the
whole group. Seven takers recognized our passage from Pericles, Act. V,
but two-thirds of the whole panel, and 53% of the Rated panel, thought
it was not Shakespeare's. Pericles is generally considered co-authored
by Shakespeare and George Wilkins; scholarly consensus gives Acts 3-5 to
Shakespeare, and our tests agree with it.
The passages from Love's Labor's Lost and Venus and Adonis were
recognized by 19 and three takers, respectively; for the others, the
whole group was divided half and half, but clear majorities of the Rated
group correctly ascribed them both to Shakespeare. It's not clear how
to score these. Two misses and two tossups? Or two misses and two
half-hits? Neither seems an unequivocal success story for these
identifications.
VIII. Three shots in the dark.
Three passages on the test were not scored, since scholarly consensus as
to who wrote them is not settled. But we tested them anyway in case we
found the group's instincts helpful in determining actual ascriptions.
This is wholly uncharted territory, but, if we had a computer test that
looked like it might be 80% accurate, we might not bet a thousand pounds
on it, as we have on some of our computer tests, but we certainly would
not want to let it sit on the shelf unexplored. The same may be said
for ascription by gut instinct. With tweaking, it can reach 82% group
net accuracy for passages of known authorship, and we can't imagine
SHAKSPERians not being curious as to what it says about passages of
unsettled authorship. Table VII gives the outcomes:
Table VII. SHAKSPER's Group Ascriptions for Three Doubtful Passages
Passage Percentages who thought it Shakespeare (Full
panel/Rated)
All percentages net
1H6 87/89%
A Lover's Complaint 42/26%
Edward III 68/73%
14% and 25% of the panels recognized another beleaguered stag scene from
1H6, Talbot before Bordeaux. 87/89% of those who did not recognize it
thought it was Shakespeare, one of the most lopsided majorities on the
test. Gary Taylor assigns the scene, 4.02, to Shakespeare; Paul Vincent
thinks it is co-authored by Shakespeare and "Author Y." Marcus Dahl
could find no hand but Shakespeare's in the whole play. Our own tests
are ambivalent on the scene as a whole. It looks like a Shakespeare
could-be by all our regular tests, but an improbable by one new test.
The passage itself is much too short for our tests. We lean toward
Vincent's view of the whole scene, but the stylometric evidence is hard
to judge - much harder, it seems than the intuitive evidence.
SHAKSPER's judgment in this case is consistent with all four views of
1H6, though one could argue that SHAKSPER's judgment on non-Shakespeare
beleaguered-stag passages is not terribly reliable .
Perhaps surprisingly, only one person recognized the passage from A
Lover's Complaint. Could it be more discussed these days than read? Of
the many who did not recognize it, 58/74% thought it was not
Shakespeare. MacDonald Jackson, Kenneth Muir, and most scholars of the
late twentieth century have assigned LC to Shakespeare. Our best guess
(our 1997 and 2004), and Brian Vickers'(his 2007), and Marina
Tarlinskaja's is that it is not. SHAKSPER's judgment favors the
doubters, but, again, one could argue that SHAKSPER's group judgments on
Shakespeare poems outside the Sonnets don't look like its most reliable.
Five and eight percent of the two panels recognized the Countess Scene
passage from Edward III, which we take to be a recent addition to the
consensus Canon. Our tests say it could be Shakespeare, and 68/73% of
the two panels seem to agree.
IX. How SHAKSPER's Round 1 compares with other takers.
If we were to seek a kind of "control group" to compare with SHAKSPER's
takers, we would turn to several Claremont student groups who have taken
the test, or a precursor, or we would look to our computer tests. Four
student groups have taken this, or a previous test: Valenza's
preceptorial class of entering freshmen in 1995; the Claremont Rugby
Side on a bus in New Zealand in 2002; Ann Meyer's Class on Shakespeare's
Tragedies in 2002; and a sprinkling of Philosophy, Politics, and
Economics alumni volunteers in 2007. We have also, over the years,
identified a couple of miscellaneous Golden Ears from casual takers, and
we have gone for advice to a small group of pros, most notably MacDonald
Jackson and John Farrell, but also Brian Vickers, Lisa Hopkins and Matt
Steggle, none of whom should be deemed in any way responsible for any of
our test's shortcomings . None of our tested groups got the extensive
recordkeeping and analysis that we have given the SHAKSPER groups, but
it is safe to say from memory that SHAKSPER outperformed the
preceptorials, the Rugby Side, and the PPE students.
Whether it outperformed the Claremont Shakespeare class is not so clear.
We no longer have the students' individual scores, nor their
individual averages, but they were the least casual, and the
most-studied of our pilot groups. Here is an adaptation of a posting we
sent to SHAKSPER in 2004:
Table VIII. A Gross-Score Comparison of SHAKSPER with Claremont Pilot Group
Claremont Students SHAKSPER
2002
2007
Worst individual: 54% correct, gross 50% correct,
gross
Best individual 79% correct, gross 89% gross
Best combined 84% correct, gross, n=6 93% correct,
gross, n=24
All combined 89% correct, gross, n=12 79% correct,
gross, n=80
18 of the student group's 25 successful identifications (72%) were by
lopsided votes, two-to-one or higher. 16 of the whole SHAKSPER group's
22 successful identifications (73%) were lopsided in this sense, as were
20 of the Rated group's 24 successful identifications (83%).
A better comparison, since SHAKSPERians, on average, recognized about a
tenth of the passages and the students next to none, would be between
the students' gross scores, essentially equivalent to their net scores,
and SHAKSPER's net scores. Such a comparison would look like Table IX:
Table IX. Net-Score Comparison of SHAKSPER with Claremont Pilot Group
Claremont Students SHAKSPER
2002 2007
Worst individual: 54% correct, net 44% correct, net
Best individual 79% correct, net 87% correct, net
Best combined 84% correct, net, n=6 82%, correct, net, n=24
All combined 89% correct, net, n=12 79%, correct, net, n=80
It would not be surprising if a group of 80 got higher-highest and
lower-lowest scores than a group of 12. It's mildly surprising that the
smaller, far less knowledgeable, less motivated group got higher group
accuracy than the larger, more knowledgeable, more motivated group, but
the smaller the group, the more likely is its outcome to be a fluke.
Recall, for example, that the Shakespeare class did much better than the
other three Claremont student groups tested. Could that have been a
fluke? Perhaps it was also mildly surprising that the whole of the
student group, aggregated, did better than the top half, aggregated,
which, in turn, did better than the best individual. For the students,
aggregation was a bigger boost to accuracy than screening. If so, the
reverse SHAKSPER outcome, where the best individual surpassed the best
group, which surpassed the whole group, would not be a surprise. For
SHAKSPER, screening seems to have been a more powerful tweak than
aggregation. There must be a literature on this, perhaps to be gleaned
from James Surowiecki's footnotes, but we have not studied it. For now,
we would say that the two sets of results are remarkably similar and we
would guess, where they differ, that the larger SHAKSPER panels give a
better notion than the student panel of what you can reasonably expect
do with intuition and what you cannot.
It would be possible, with more manual counting than we want to do now,
to compare the net accuracy of Shakespeare pros with those of amateurs,
and to see whether literary critics did any better or worse than stage
performers, artists, scientists, and so on. It's all buried in the data
and retrievable in principle, but it would not be easy, and it is not at
the top of our agenda. We would guess from the similar accuracy levels
of SHAKSPER and the student group, and especially from the surprisingly
similar gross accuracy levels of SHAKSPER's own pro and amateur
respondents, 68% and 66%, respectively, that the difference in pros' and
amateurs' net accuracy, if any, would be barely detectable, and not
necessarily favorable to the pros, whom we would expect to recognize
more passages than amateurs. Table X also shows a remarkably narrow
range of average gross accuracy scores for the various subgroups identified:
Table X. Gross Accuracy Scores of Identified Subgroups
Subgroup Number Gross Average Gross Average Accuracy
Score %
Professionals 31 18.9 67.5
Amateurs 49 18.6 66.4
Critics 26 18.7 66.8
Writers 14 18.9 67.5
Artists 33 18.6 66.4
Other 21 18.0 64.3
Some of the categories overlap. "Other" is mostly people who declined
to state a category.
Would net accuracy differ greatly from these gross accuracy scores? We
don't know, but it seems improbable.
One might imagine that writers and artists would be more intuitive, and
critics more analytical (see Simonton, Origins of Genius, 1999), but the
average accuracy of the three categories is virtually identical. We did
not ask anyone to list their IQ's or their verbal and math SAT scores,
considering it too nosy and off-putting even for us, but we would have
loved to have had them, and maybe their college majors as well. Four of
our best six students in the 2002 pilot study were science majors; the
two best of all were science majors from Harvard-bright Harvey Mudd
College; the others were from our college, Claremont McKenna College,
whose students, on average, are only Columbia-bright. We have to wonder
whether any of our SHAKSPER takers were scientists.
Another way of looking at such questions on a smaller scale might be to
look at the top end only, Golden and Silver Ears, where we would expect
recognition to be at its greatest. Conveniently, there are seven Golden
Ears, six of them pros, and six Silver Ears, five of them amateurs. The
86%-pro Golden Ears said they recognized, on average, a remarkable 27%
of all questions; the 83%-amateur Silver Ears claimed to have recognized
14% of all questions, half the Golden Ears' rate, but more that twice
the rate of the other five-sixths of the group, which was 6%. Golden
and Silver ears are only a sixth of the whole group, but the big
difference between the two top groups, one mostly pro, the other mostly
amateur, is fully consistent with our commonsense guess that pros would
recognize more passages than amateurs.
If so, we can roughly calculate the two top groups' net accuracy, as
explained in Section IV above, simply by subtracting recognitions from
both correct answers and total takes of each question, giving us net
correct answers as a fraction of net takes. This cuts average
Golden-Ear accuracy from 87% gross to 81% net, and Silver-Ear accuracy
from 79% gross to 75% net. With recognized passages removed, both
groups got fewer passages right in fewer questions, with the
high-recognition Golden Ears, not surprisingly, losing more accuracy
than the lower-recognition Silver Ears. Netting for recognition, in
effect, reduces Golden Ears to Silver, and Silver to Bronze, with the
top pros, on average, still retaining a 6% detection accuracy edge over
the top amateurs, with 81% net accuracy, versus 75%. Their gross edge
had been 8%, 87% versus 79%.
Suppose we sought an aggregated, majority-rule on each question group
score for the Golden Ears only? It's not clear that they would score
much higher than the rated group as a whole (82%), which, let us recall,
was only three points higher than the whole group (79%). All 7 Golden
Ears recognized four of the 28 passages. For the remaining 24 passages,
where there was some unrecognition to be had, there was a net correct
majority on 20, a net incorrect majority on three, and a 50-50 tossup on
one. If you put aside the tossup altogether, that would give the Golden
Ears 20 right out of 23 (87%). If you give half-credit for the tossup,
they get 20.5 right out of 24 (85%). If you count the tossup, but give
no credit for it, they get 20 right out of 24 (83%). Any of these would
be arguable, but we would consider the most conservative of them, 83%,
the most defensible.
We would conclude from this that the Golden Ears were far ahead of all
others in recognition of passages, and 6% better than the Silver Ears in
net intuitive detections of unrecognized passages, but horizontal
aggregation doesn't seem to boost group accuracy as much at the highest
level as it does for the whole group. Unlike the Claremont students in
the pilot study, the best individuals in the SHAKSPER group did better
than the whole group aggregated, and better than the best of the group
aggregated. Several of these individuals also did better than the best
Claremont student, even after netting out recognized passages. Net
group accuracy for Golden Ears could be five out of six (83%), but it
would take a bit of indulgence to get it to six out of seven (86%). Top
pros look better than top amateurs at recognizing passages, and it is
probable that this is also so of all pros, through we haven't tested for
it. We have very little evidence that any category of taker surpasses
the others on average.
X. How the SHAKSPER group compares in accuracy with stylometric tests.
Here is what we said about the Claremont pilot study in 2004:
How does this accuracy compare with that of our best quantitative tests?
"Far higher" would be a persuasive answer for such short,
Sonnet-length samples. All of our quantitative tests are sensitive to
sample length because longer samples average out more variance than
shorter ones, giving us tighter ranges and higher discrimination for
long samples than for short. Most of the samples we used in our Golden
Ear test have no more than 150 words, far shorter than any for which we
have dared to validate any of our quantitative tests. For comparison,
our current estimated composite accuracy rates for longer,
single-authored passages look something like this:
Text Shakespeare Non-Shakespeare
Whole plays 100% 100%
Poems, 3000 words 100% 100%
Play Verse, 3000 words 95% 100%
Poems, 1500 words 100% 100%
Play Verse, 1500 words 96% 88%
Poems 750 words 93% 71%
Play Verse 750 words 97% 75%
Poems, 470 words 92% 73%
Not much has changed since then. Our accuracy figures remain the same,
and the SHAKSPER Golden Ear outcomes are similar to those from our
student pilot study, only slightly higher for the best individuals, but
somewhat lower for the group. SHAKSPER's double-tweaked accuracy from
intuition alone is not as good as our tests have been on samples of
1,500 words or more, but it's in the same ball park with our accuracy
for passages of 470 to 750 words - and it is far higher than we would
expect any or all of our tests to do on the very short passages tested,
which averaged about 140 words. If Golden Ear had been tried as a
stylometric test, it probably would not quite have met our test criteria
- of around 95% reliability in saying "could be" to known Shakespeare,
and at least 20% reliability is saying "couldn't be" to known
non-Shakespeare, but that is because we rely on negative evidence and
are far less tolerant of false negatives than of false positives.
Without question, intuition is far better than all our tests combined on
the sonnet-length passages we tested.
As we explained in Section V, it is conceivable that using longer
passages would raise Golden-Ear accuracy, but doubtful that anyone could
devise practicable intuitive tests for, say, the 1,500-word or
3,000-word passages for which we consider our stylometric tests to be
well validated. If we, or anyone else, could find and offer 28
3,000-word passages for identification, the test would be equal in
length to Hamlet, Macbeth, Romeo and Juliet, and The Comedy of Errors
combined and would take more than a day just to read, let alone analyze,
in entirety. In general, the longer the passages, the fewer can be
tested without expecting miracles of motivation. From this perspective,
Golden-Ear testing may be almost as impractical for wholesale testing of
long passages as computers are for testing short passages.
XI. Conclusions
In sum, after much tweaking and netting, SHAKSPERians as a group seem
capable of getting almost four out of five, or five out of six,
identifications right, and its very best individual did a bit better
than that, with net accuracy reaching 87%. However, only two of 80 test
takers got net accuracy higher than 83%. The average Golden Ear had 80%
net accuracy, silver 75%; the average individual SHAKSPERian taker had
63% net accuracy. SHAKSPER's overall performance roughly matched the
best of our student pilot groups, with aggregated group accuracy
slightly lower, and the accuracy of the very top individuals somewhat
higher. SHAKSPER's recognition rates were much higher than those of the
Claremont pilot group, and its Golden Ears' rates much higher than the
rest of SHAKSPER. No one else has taken the test as seriously as
SHAKSPER. None of the subcategories of takers stood out as much better
or worse than the others, and the differences in gross accuracy between
pros and amateurs seem remarkably small. Intuition seems much more
accurate than stylometrics for very short, sonnet-length passages;
stylometric seems more accurate than intuition for longer passages, but
an actual head-to-head comparison of many long passages seems impracticable.
XII. Golden Ear Round 2
Golden Ear Round 1 has given us what looks like a highly talented panel
of a dozen rated, identified SHAKSPERians, to take Golden Ear Round 2.
We haven't yet asked them whether any of them want their names or their
particulars, pro/amateur, writer, player, etc., disclosed to SHAKSPER or
anyone else. To these we might add up to eight or nine rated players
discovered in previous tests, if we can find them and get them to serve.
A couple of unrated SHAKSPERians want to take Round 2, and we would
not grudge them the experience, nor would we be above retroactively
including some or all of the dozen rated players who did not identify
themselves on the test, or, for that matter, other unrated players who
want another shot, but we need enough identification on Round 2 to
e-mail it to the proper recipients and to relate their Round 2 outcomes
to their Round 1 outcomes.
Unfortunately, we don't have Round 2 on the web and shall have to find a
way to mail or e-mail it to takers to do by hand, and to be scored by
hand. Like Round 1, Round 2 will have a known, scorable component, both
to help calibrate the test, which we would guess is more difficult than
Round 1, and to provide more data points to help see how much, if any,
of the high accuracy rates found at the top of Round 1 was luck of the
draw. It's one thing to find and congratulate the best guesser of the
number of beans in the Business School jar; it is quite another to
expect him to do it twice. Round 2 will also have some of its own shots
in the dark, passages whose authorship is not settled.
XIII. A last note on methodology
In some ways, it is astonishing, given the frequency and fervency of
declarations that intuition can outperform Stylometry - or that
styometry can outperform "sniff tests" -- that no one has ever tried to
see whether, and to what extent either proposition is actually so. We
have been trying to remedy that for twelve years and have now, at last,
thanks to the help of our student programmer, Ryan Wilson, and his
advisor, Arthur Lee, gotten a Round 1 survey up on the net and gotten an
excellent response from SHAKSPER, which has permitted us, at last, to
try a first cut at an answer. We have used the best methodology we
could manage, but who are we to proclaim that the tradeoffs we chose in
our first big outing are the ones that should bind all others? For an
exercise as novel as this, it would be surprising if further
experimentation with different parameters did not produce new insights
perhaps wiser and more penetrating than ours, and informed by our
mistakes, as well as by our successes. Every adventure is a
reconnaissance for the next, and this seems to us a question begging to
be explored from more than one perspective. As always, if anyone in or
out of SHAKSPER would like to try an experiment with different tradeoffs
than the ones we chose, and SHAKSPERians were willing to take it, we
would be happy to help them out. In the meantime, we consider our
tradeoffs reasonable ones and our evidence the best currently available.
We hope it will inspire better.
In the meantime, we would like, again, to thank the SHAKSPERians and
others who took our survey for giving it their full attention, and
especially for honoring our request to withhold their online comments
till the test was over, so as not to wreck the test for others. We now
welcome comments, online and off, but we hope the online ones will take
care not to give away too many specifics of the test, which cost twelve
years of pretesting and hundreds of dollars for programming to prepare,
and, now, many hours from SHAKSPERians to take and us to analyze. We
would just as soon keep it available for future use with different
groups, and as a standard against which other versions can be measured.
We hope that SHAKSPERians will help us keep it so, as much as possible.
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
(the address, not the professor) was retired July
1, 2007; please use
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
instead.
Ward Elliott
Burnet C. Wohlford Professor of American Political Institutions
Claremont McKenna College
Pitzer Hall, 850 Columbia Ave.
Claremont, CA 91711-6420
(909) 607-3649
Fax (909) 621-8419
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
http://govt.cmc.edu/welliott
"Better grey words with crimson examples than crimson words with grey
examples."
_______________________________________________________________
S H A K S P E R: The Global Shakespeare Discussion List
Hardy M. Cook,
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
The S H A K S P E R Web Site <http://www.shaksper.net>
DISCLAIMER: Although SHAKSPER is a moderated discussion list, the
opinions expressed on it are the sole property of the poster, and the
editor assumes no responsibility for them.
|