The Shakespeare Conference: SHK 12.0663  Tuesday, 20 March 2001

From:           Paul Maddox <This email address is being protected from spambots. You need JavaScript enabled to view it.>
Date:           Sunday, 18 Mar 2001 15:17:37 +0000 (GMT)
Subject:        Shakespearean Authorship Research

Dear All,

I'm a student of the University of Birmingham in the UK. For my
dissertation I have written a program that uses n-gram statistics as a
method of comparing documents and attaining authorship. I have used my
program on modern English (SEE) examples with some success, and I'm now
moving my attention to Shakespeare authorship.

I should first explain a little how my program works, then I will move
onto what I'm doing and what I'm trying to achieve.

My program has three stages:

1) Tag document(s) with their syntactic tags.
   There are about 40 tags for different syntactic catagories.
   Eg. "dog" is tagged as "NN", 'noun, singular, mass'.

2) Comparing sets of tagged documents.
   There are two columns allowing a many-to-many comparisons.
   Eg. [Every 'Shakespeare' document] compared to [Every Marlowe

3) Rank the comparison data in order of similarity.
   The comparison engine collapses data down into a single scalar value.
   Eg. "macbeth.txt" compared to "EdwardII.txt" = 0.0345

And that's about it. I hope I've not lost anyone. The basic outcome is
that my program can rank document pairs on relative similarity.

So - what am I doing? As it stands I am working on experiments to
compare Shakespearean sonnets to those of Edward de Vere, and Sir
Francis Bacon. I am using the first 154 Shakespearean sonnets, the first
19 of de Vere's and the only 4 (I was able to find) of Bacon's. I'm
comparing the sets as so:

[Shakespeare] compared to [deVere & Bacon]

This produces 3388 unique comparison document pairs. Below are two sets
of results, using slightly different statistics. I have cropped off the
ten most similar pairs. Potentially you can ignore the actual numbers,
as well as the ".ptf" filenames.

I should note that all the files as well as the full results I mention
are available at:  http://paul.calcaria.net/ss/

 EX 1 [Statistics: syntactic tag count]

0.435933        EdV12.ptf >< Son144.ptf                 MOST SIMILAR
0.439965        EdV14.ptf >< Son92.ptf
0.463866        EdV19.ptf >< Son131.ptf
0.467079        EdV14.ptf >< Son150.ptf
0.504633        FB-HelpLord.ptf >< Son110.ptf
0.507671        EdV19.ptf >< Son134.ptf
0.522417        EdV19.ptf >< Son36.ptf
0.534822        EdV14.ptf >< Son39.ptf
0.535092        EdV4.ptf >< Son3.ptf
0.538892        FB-HelpLord.ptf >< Son52.ptf            TENTH SIMILAR

Remember, these are the top ten comparisons out of 3388. The scores
range between 0.436 -> 2.021 (4sf) and only show relative similarity.

According to these results Edward de Vere's 'Love and Antagonism' is
most similar on a linguistic level to Shakespeare's 144th sonnet.

Some thought has to be made into the probabilistic likelihood of
similarity. We are comparing 23 documents (De Vere and Bacon) to 154
documents from 'Shakespeare', hence it is quite likely we will find
similarity just by coincidence. Saying this however, I have used my
method with much larger sets of documents, and still achieved reasonable

 EX 2 [Statistics: Probability of jumping from one tag to an adjacent

This experiment uses the exact same set of documents as the previous.

11.622716       EdV18.ptf >< Son11.ptf                  MOST SIMILAR
12.921719       EdV2.ptf >< Son123.ptf
13.185570       EdV18.ptf >< Son103.ptf
13.223267       EdV10.ptf >< Son12.ptf
13.266518       EdV4.ptf >< Son128.ptf
13.325471       EdV6.ptf >< Son86.ptf
13.335306       EdV2.ptf >< Son6.ptf
13.869774       EdV2.ptf >< Son124.ptf
14.077771       EdV14.ptf >< Son11.ptf
14.202584       EdV2.ptf >< Son140.ptf                  TENTH SIMILAR

Again, these are the top 10 comparisons out of 3388. The scores range
between 11.62 and 219.2 (4sf) and again show relative similarity.

As we can see in this experiment the top ten (in fact, the top
thirty-two) document pairs contain de Vere and not Bacon. This could be
slightly unfair a test, as there are nineteen de Vere documents, and
only four Bacon documents. We are getting back to the original point
noted about coincidental similarity.

I am particularly interested in people's agreement/disagreement with my
results, as well as their own opinions on Shakespearean authorship. I
would also like to get hold of as much material from suspected authors
to compare to the Shakespearean documents as possible (both poetry and
prose), so any links out there would be appreciated.

All the best,

Subscribe to Our Feeds


Make a Gift to SHAKSPER

Consider making a gift to support SHAKSPER.