The Shakespeare Conference: SHK 29.0421 Tuesday, 4 December 2018
Date: December 4, 2018 at 3:59:15 AM EST
Subject: Re: SHAKSPER: NOS
I’m grateful to Brian Vickers for pointing us to an advance access copy of Pervez Rizvi’s new article critiquing the ‘micro-attribution’ work of Gary Taylor, John Nance, Doug Duhaime, and Keegan Cooper. Unfortunately, the PDF at that address is corrupt and hence unreadable; the journal publisher is looking into correcting this. (For many SHAKSPERians the article will, in addition to this problem, be behind a paywall.)
Meantime, I’d like to respond to some general principles raised by Vickers’s account of Rizvi’s work.
VICKERS: “Macier Eder argued forcefully for 5,000 words” as the minimum sample size for reliable authorship attribution work.
Assuming that Vickers means the essay “Does size matter? Authorship attribution, small samples, big problem” (in ‘Digital Scholarship in the Humanities’ 30 (2015): 167-182) then the scholar’s first name is Maciej not “Macier” and the conclusion was that 2,500-5,000 words is the correct range, depending on various factors (including language) that Eder controlled for.
If Vickers thinks Eder is right, then he should acknowledge that the same essay also “argued forcefully” that ‘bags’ of randomly selected words make for more reliable tests than continuous passages do, since he (Vickers) has repeatedly rejected that view. Furthermore, Eder’s point about sample sizes cuts both ways in two distinct aspects.
First, if it’s the case that the small sample of text that the ‘micro-attribution’ method works on (such as the 63 words from Macbeth 4.1.143-150) is just too small to test because, as Vickers writes, “bigger sample sizes are more reliable than small ones” then it becomes impossible to show that Taylor et al. are wrong in his attributions. Vickers’s argument leads to an admission of defeat—nobody can tell who wrote such a small passage—not to the routing of attributions Vickers doesn’t like.
Secondly, the problem of small sample sizes bedevils work on the canon of Thomas Kyd most especially. As well as the size of the sample being tested one must consider the sizes of the authorial profiles against which it is being tested. Using the Word Adjacency Network method, Segarra et al. in Shakespeare Quarterly in 2016 (that is, the team I’m in) chose to leave out Thomas Nashe and Thomas Kyd entirely from their considerations because each has a secure dramatic canon of just one play (‘Summer’s Last Will and Testament’ and ‘The Spanish Tragedy’, respectively). Even if we add ‘Soliman and Perseda’ to Kyd’s canon we haven’t enough text to build a reliable authorial profile from, using our method.
Of course, if one simply assumes that Kyd’s canon is much larger than everyone has hitherto thought, this problem goes away. Vickers’s new edition of the Collected Works of Kyd will, he tells us, include not only ‘The Spanish Tragedy’ and ‘Solimon and Perseda’, but ‘also Arden of Faversham’, ‘King Leir’, ‘Fair Em’, ‘1 Henry VI’, and ‘Edward III’.
Pervez Rizvi has followed through this assumption about the extended Kyd canon in a set of essays on his website, using his new dataset and his method of interrogating it. The result is that in fact ‘1 Henry VI’, and ‘King Leir’ most closely match the style of Christopher Marlowe not Kyd, and ‘Edward III’ matches Marlowe’s style if Rizvi uses his 3-gram method and Kyd’s if he uses his 4-gram method.
One has to really cherry pick Rizvi’s work in order to find in it evidence for the extended Kyd canon. And what is worse, one has to swallow some pretty unpalatable new evidence. Using his 4-grams method, Rizvi finds that: ‘A Midsummer Night’s Dream’ is closest in style to George Chapman’s work; by 3-grams ‘Richard III’ is closest to Kyd’s; by 3-grams ‘The Taming of the Shrew’ is closest to Marlowe’s but by 4-grams that switches to Kyd’s; by 3-grams The ‘Two Gentlemen of Verona’ is closest to Kyd’s; and by 4-grams ‘Henry V’ is closest to Kyd’s.
In response to all this, some people will throw up their hands and say that all this attribution-by-internal-evidence research is unreliable. But they should not. We should remember that internal evidence alone tells us that Henry VIII is by Shakespeare and John Fletcher, that Pericles is by George Wilkins and Shakespeare, and that Timon of Athens is by Shakespeare and Middleton. These are things worth knowing and they are the fruit of patience and painstaking work.
VICKERS “Rizvi not only has superior mathematical knowledge, he can draw on the remarkable database of over 500 early modern plays that he has recently published.”
Rizvi has not, and does not claim to have, “published” a database of plays. His website gives the reader a ZIP file containing 510 plays, from which set she has to manually delete 22 plays because they are later than the period we are interested in, leaving 488 plays. Then she has to download 38 Shakespeare plays from the Folger website, bringing the total to 526. Rizvi’s work is based on a set of 527 plays and the 527th, the Additions to The Spanish Tragedy (“counted as a separate little play”), is not provided by him.
Rizvi’s play scripts are in XML format that records not only the original spelling of each word, but also the lemma to which it belongs. To get at these lemmas, Rizvi must have used some software method that makes sense of the structure of XML files. This method and software he does not provide on his website and just how they work is a non-trivial part of the puzzle. This aspect of Rizvi’s investigation may be perfectly satisfactory, but we cannot know because he hasn’t disclosed it.
A great deal rests on the detail of the means by which one extracts the verbal matches that are the evidential base for this kind of investigation. Rizvi thinks that it makes no significant difference that in his dataset the Shakespeare plays are represented by texts that were manually edited and modernized by Paul Werstine and Barbara Mowett while his non-Shakspearian texts were all machine-modernized. He may be right, but for his results to be accepted we’d need to know.
Looking at the matches listed in the big spreadsheets on his Collocations and N-Gram website, it is clear that Rizvi includes as an n-gram a phrase begun by one speaker and completed by another (that is, he skips over the intervening speech prefix). Other investigators would not consider that an n-gram at all. Does it matter? We don’t know. Rizvi includes as a match an n-gram that is in dialogue in one play and is a stage direction in another. Rizvi treats as a match an n-gram that is broken by a full-stop which has the same run of words as one not broken by a full-stop, in which the grammatical functions of the words are different. Thus “O, see what thou hast done! In a bad quarrel slain a virtuous son” (Titus Andronicus) counts, in Rizvi’s files, as a match to “ever Usurer did in a bad cause” (The Bashful Lover). Do any of these choices account for his much greater counts for n-gram matches than other investigators have found? We don’t know, but I do not assume that these details are trivial.
Taylor and the others cited by Vickers use the Literature Online (LION) database as their source texts and as their searching software. This has the signal merit that almost everyone in academia has access to it and can reproduce the results that the investigators claim. This does not mean that LION is without problems. Because the database is proprietary, investigators cannot see the program code by which it searches for the strings they ask it to show them, so they have to trust that the hits returned correctly represent what is present in the plays it is meant to be searching.
We are far from having the right tools to begin asking the right questions about the authorship questions we’d like to answer. A fundamental prerequisite for progress is the Open Access distribution of the publications by which we present our investigations, the Open Data publication of the datasets on which they are based, and the Open Source publication of the software by which we automated the processing of these texts. Without these desiderata, we cannot even start to understand why we disagree about such inherently simple things as how many phrases are common to the works of two writers. When two groups of investigators approach these problems using different datasets and different methods and different tools for executing these methods, and when the details of the datasets, methods, and tools are not available for all to see, it is unsound to conclude that one approach as trumped the other.