**The Shakespeare Conference: SHK 23.052 Monday, 6 February 2012**

[1] From: John Alvord <This email address is being protected from spambots. You need JavaScript enabled to view it.>

Date: February 4, 2012 10:51:20 PM EST

Subject: Re: SHAKSPER: O Rare

[2] From: Pervez Rizvi <This email address is being protected from spambots. You need JavaScript enabled to view it.>

Date: February 6, 2012 8:08:15 AM EST

Subject: Re: SHAKSPER: O Rare

[1]-----------------------------------------------------------------

From: John Alvord <This email address is being protected from spambots. You need JavaScript enabled to view it.>

Date: February 4, 2012 10:51:20 PM EST

Subject: Re: SHAKSPER: O Rare

Gabriel Egan < This email address is being protected from spambots. You need JavaScript enabled to view it.> wrote,

>About 10 years ago SHAKSPERian Steve Roth did some refinements

>to a project I started called SHAXICAN. (The name was a gibe at

>Donald Foster’s supposed SHAXICON database, which was the

>subject of several articles but never appeared.) The idea was to

>count rare words in Shakespeare by play and by actor’s part,

>looking for correlations. Specifically, we wanted to test the

>hypothesis that the rare words in a particular part acted by

>Shakespeare himself would appear disproportionately often in

>the next play he wrote, since those rare words he’d recently

>spoken on stage would be at the forefront of his mind. That

>was Foster’s claim but SHAXICAN was unable to verify it.

>

>The files from SHAXICAN are still available at

>

> http://gabrielegan.com/shaxican

>

>and the one you want is “correlations.txt” in the “Roth’s refinements”

>section.

>

>Save it to your own computer, then open it in a spreadsheet program

> such as Microsoft Excel. (Excel will take you through a ‘Text import

>wizard’ for handling ‘Delimited data’ files: just accept all the defaults.)

[ . . . ]

I am a computer programmer with a deep interest in Shakespeare. Excel doesn’t work on the correlations.txt file because there are 167,961 lines and Excel can only handle about 65,510.

I summarized the correlations file with a simple Perl program. Based on the results, I am puzzled.

There were 9639 different words in the lines. The most common was “gravity” with a count of 178.

There were 2861 words with a count of 2.

There were no words with a count of 1.

But here http://kottke.org/10/04/how-many-words-did-shakespeare-know

the Shakespeare word count is listed as 31,534 and a single word use count of 14,376.

Maybe the results were limited to words that occurred in 2 or more plays, which is what the file title suggests.

If anyone wants the summary file I will be happy to send a copy.

[2]-------------------------------------------------------------

From: Pervez Rizvi <This email address is being protected from spambots. You need JavaScript enabled to view it.>

Date: February 6, 2012 8:08:15 AM EST

Subject: Re: SHAKSPER: O Rare

Gabriel Egan wrote:

>If you want to find words that appear fewer times than 12 in the canon, look for

>words that appear fewer times overall in the table. You can do this by eye (as

>you would a printed concordance) or better still someone good at Excel might

>write you a formula that finds words appearing only once (or any arbitrary

>number of times) in the table.

>

>If there’s a SHAKSPERian who can do that, I’d be interested to share the

>formula.

When I looked at this on Friday, I did not think it was possible in Excel, but a lightbulb just came on in my head. Here's how you do it . . .

Load up the data into Excel according to Gabriel’s instructions. You should have exactly 167,961 rows in your spreadsheet, the first row being the headers. You should have Word in column A, Play in column B, Part in column C, Play Count in column D and Part Count in column E.

To understand the formula I give below, let's first look at an example. Filter the spreadsheet so that it only shows you rows where Word = 'forester' and Play = 'AYL'. You should get the following:

forester,AYL,MND_OBERON,2,1

forester,AYL,LLL_PRINCESS,2,1

forester,AYL,MND_THESEUS,2,2

Add up the numbers in the Part Count column, i.e. 1 + 1 + 2 = 4. This tells you how many times the word occurs in plays other than AYL (it is spoken once by Oberon, once by the Princess of France and twice by Theseus in MND). Then add the number which is common to all the rows in the Play Count column, i.e. 2, being the number of times it is spoken by someone in AYL. This gives a total of 6, which is the number of times this word occurs in all the plays (according to the text Egan and Roth used).

Now for the formula. Remove the filter above if you applied it. Paste the following formula exactly as given below, into cell F2:

=D2+SUMIFS($E$2:$E$167961,$A$2:$A$167961,A2,$B$2:$B$167961,B2)

Cell F2 should now show you the number of times the word in that row occurs in all the plays. All you have to do is copy the formula into the rest of column F, which you should do very carefully as follows.

1. Close every other window on your PC apart from the spreadsheet.

2. Click in cell F2.

3. Press ctrl-C to copy it.

4. Do not click anywhere else but use the scrollbar to scroll right down to the last cell in column F, i.e. F167961.

5. Hold the shift key down and click in this cell.

6. Press ctrl-V.

7. Then go and have a cup of tea. When you come back, Excel will have calculated all the values in column F.

8. Now select the whole of column F and copy it using ctrl-C.

9. Paste it into column G but do NOT use ctrl-V to do it. Instead, use the Paste Special option and paste only the values (not the formulas!).

10. You now have the values you want in the G colummn. Delete the F column. It is essential to do this, otherwise every time you do something to the spreadsheet, Excel will recalculate everything and take forever.

11. Now you can just use the normal Excel filter function on column F to select the words that occur 2, 3, 4, or whatever number of times you want.

The SUMIFS formula works in Excel 2007 onwards. If you are using earlier versions of Excel, you will have to make do with the less powerful SUMIF formula. It takes a bit more work but is certainly possible.