The Shakespeare Conference: SHK 26.373 Monday, 24 August 2015
Date: August 22, 2015 at 10:53:44 AM EDT
Subject: Re: MV Dialog
Spike Milligan started his book ‘Hitler: My Part in His Downfall’ by writing ‘After Puckoon I swore I would never write another novel. This is it...’ In the same spirit, I hadn’t intended to argue any more with Jim Carroll but, with Hardy’s indulgence, I’ll make one further long post in the hope of bringing some closure to this argument.
I will write just one paragraph (this one) about the randomness point. Typically, researchers say “If the data were random, we would expect it to look like X. But our data looks like Y. There is a statistically significant difference between X and Y. Therefore something interesting is going on.” If someone points out that the difference between X and is not statistically significant, they are not saying that the data is random. They are saying that even if it had been random it might still have looked quite like Y. No one is saying that Shakespeare wrote words at random.
But let’s forget that now and accept for a moment that the approach you’ve used is valid. You wrote earlier in this discussion (joining two of your posts):
Assuming the number of lines in the WordCruncher version of the folio is the same as reported by Pervez (109,220), the frequency of “man” per line is 0.01654. The expected frequency of “man” twice in a line is therefore 0.01654 x 0.1654 = 0.0002736, which should result in 0.0002736 x 109220 = 29.9 occurrences of “man” twice in a line. There are in fact 38. Now, I could say that I simulated with a computer program the placing of “man” in 109,220 lines of text at the given frequencies, repeated it 1,000 times, found the average (29.3) and the standard deviation (5.2)..... Shakespeare put “man” in the same line far more than it occurs when adding the word randomly at the same frequency to a text that size. Therefore, the clustering of “man” in the same line is not “random”.
That last sentence may be inadvertently misleading in its wording since it appears to be based on the misapprehension that random data must be evenly distributed. I suppose you meant to write ‘Therefore, the clustering of “man” in the same line is statistically significant.’ If so, then notice that the difference between your expected value (29.9) and the actual value (38) is 8.1, which is 1.56 standard deviations, since you found the standard deviation to be 5.2. Few people would consider that statistically significant. Usually, to be safe, you’d want your data to be at least 2 standard deviations away from the average to call it significant; to be really safe, 3 standard deviations.
Incidentally, you don’t need to do a computer simulation for this. If you know the sample size N (here, N=109220) and the probability p (here, p=0.0002736) then the standard deviation is the square root of Np(1-p). This formula gives the standard deviation as 5.47 which is close to what your simulation gave you.
As I wrote before, I am not denying that you might possibly be on to something. No one denies the existence of image clusters in Shakespeare, such as the ones Caroline Spurgeon found. No one denies the existence of idea clusters, such as the association of illegitimate birth with the counterfeiting of coins. But I said that you would need to look at other words too, because I suspected that there would be lots of the kind of associations that you think you are seeing with ‘man’. I just did a search of the Folio text [see note 1 below], looking at words that occur between 1500 and 3000 times, and for which the actual number of lines with two or more occurrences is more than 3 standard deviations away from the expected (using the same method to work out the probability that you used). I found 15 such words. For example, ‘come’, ‘their’ and ‘loue’ (i.e. love) which occur twice on a line far more often than we’d expect by using your technique. Or, if I look at all words that occur more than 1500 times and require only 2 standard deviations, the number of hits rises to 56. For how many words are you going to claim that Shakespeare had some ‘association’ in his mind? And what about the negative evidence? Words like ‘enter’, ‘which’ and ‘vpon’ (i.e. upon) occur twice on a line far less often than we’d expect from using your technique. Are you going to claim some negative association which caused Shakespeare to recoil from using them twice in a line?
The reality is that you can always find stuff like this if you know how to play around with the data. It doesn’t mean anything usually. Even if the data you had shown us were statistically significant, it would not be enough. The statistical formulae are blind to context; e.g. a formula doesn’t know that the word ‘bond’ occurs disproportionately often in the trial scene in MV because it is what the trial is about. So even when you find statistically significant data (which is usually easy to do) you still need to provide old-fashioned literary or bibliographical arguments to support your claim.
Note 1. Various caveats apply. Since we are interested in words, not spellings, we should use a modern-spelling text, not the Folio. I happen to have the Folio text already in a database on my laptop so I used it. Asking if two words are on the same line is only a rough way of asking if they are close together. To do it properly, we’d need to ask if they are within X words of each other, regardless of which line they are on. The number of lines varies from edition to edition, because of prose passages. I used N=109220 because that is the official number of lines in the Folio, from Hinman’s Norton facsimile (he ignored play titles and lists of dramatis personae). If you are using the Riverside text, which comes with WordCruncher, then the number of lines will be different, not least because of Pericles and TNK. And of course not everything in ‘Shakespeare’ is by Shakespeare.
Note 2. Squaring the probability of one event to derive the probability of its happening twice, without considering other factors, is statistically naive. In the UK we had the tragic case of Sally Clark a few years ago. Two of her babies had died of so-called ‘cot death’. She was tried and convicted of their murder. An eminent paediatrician called Professor Sir Roy Meadow testified at her trial that the probability of two cot deaths in the same family was 1 in 73 million. The probability he cited had been worked out by squaring the probability of one cot death, rather as you have done with ‘man’. As some people recognised immediately, it was a gross misuse of statistics and possibly helped to send an innocent woman to prison. (She was released a few years later on another ground of appeal and died young of alcoholism.)