The Shakespeare Conference: SHK 12.0134  Sunday, 21 January 2001

From:           David Kathman <This email address is being protected from spambots. You need JavaScript enabled to view it.>
Date:           Saturday, 20 Jan 2001 02:45:09 -0600
Subject: 12.0121 Re: Shaxicon
Comment:        Re: SHK 12.0121 Re: Shaxicon

Gabriel Egan wrote:

>David Kathman wrote
>>Don was and is sincere in his desire to make
>>SHAXICON widely available to all on the web.
>>Unfortunately, he has been prevented from doing
>>so by those twin bugbears of academics everywhere:
>>lack of time and lack of money. Before SHAXICON
>>can be put on the web, it will have to be transferred
>>to a database format, and this would require a lot of
>>money and/or time, both of which have been in short
>>supply lately.  SHAXICON will eventually be made
>>public (and believe me, nobody in the world wants
>>that more than Don Foster), but it's hard to
>>say right now when that will happen.
>Let me make sure I understand SHAXICON right:
>1) You make a list of all the characters in all the plays (using unique
>identifiers for the multiple Claudios, Antonios, etc).
>2) You make a list of all the rare words in all the plays (rare in the
>sense that they are words Shakespeare rarely uses).

Specifically, they are words that appear 12 times or fewer in the
canonical Shakespeare plays.  I don't remember off the top of my head
exactly what the boundaries of the "canonical plays" are for this
purpose (e.g. with regard to Pericles, Henry VIII, Two Noble Kinsmen,
etc.), but I do know that appearances in the non-dramatic poems do not
count toward the 12-word cutoff.  That is, if a word occurs 13 times in
the canonical plays, it is not indexed in Shaxicon, whereas if it
appears 12 times in the canonical plays and once or more in the
nondramatic poems, it is included.

I should also note that homophones (e.g. "bear" = to carry vs.  "bear" =
ursine mammal) count as separate "words" for the purposes of Shaxicon,
as do different parts of speech based on the same semantic root (e.g.
"reading" as a present participle verb -- "I am reading a book" -- vs.
"reading" as an adjective -- "the reading public" -- or a noun -- "This
week's reading for my Shakespeare class is very difficult").  However,
inflected forms belonging to the same part of speech count as the same
word -- e.g. "I walk", "he walks", "he is walking", "he walked".
Basically, a "word" is anything that would require a separate dictionary
entry.  Letters of the alphabet, proper names, and unassimilated foreign
words (e.g. from the French scenes in Henry V) are not indexed.

>3) You count how many times each of the characters in (1) uses each of
>the rare words in (2).
>4) You take a sample text (one of the plays) and make a list of all its
>words and how frequently they appear.

Well, the way Shaxicon is set up, a computer can do all this
automatically, at least for the words indexed therein.  The core of
Shaxicon consists of a database of 48,000 specific instances (or lemmas)
of the 18,135 different "Shakespearean rare words" (as defined above).
Each lemma is annotated with the play it appears in and the character
who speaks it.  These 48,000 lemmas can be sorted in various ways:  by
word (so that one can see, for example, all the instances of "to
abase"(v.) used by Shakespeare), by character (so that one can see, for
example, all the "rare words" used by Ophelia in Hamlet, and where else
these words occur in Shakespeare's works), and so on.  Armed with a
provisional dating of the plays, one can also call up a list of (for
example) all instances of "to abandon"(v.) used by Shakespeare after
1600, and so on.

Beyond this core, Shaxicon also indexes 4,700 occurrences of the 18,135
"Shakespeare rare words" in Shakespeare's nondramatic poems, as well as
many thousands of further occurrences of these words in hundreds of
non-Shakespearean early modern  texts, which will eventually include
about 2,000 STC texts.  This allows one to take a specific "rare word"
used by Shakespeare and find out where else it occurs in early modern
writings, which in turn allows some interesting explorations of
intertextuality.  For example, one can make a list of all the words
which appear in *Romeo and Juliet* and nowhere else in Shakespeare's
works.  If one then takes each of these words and makes a list of other
early modern works in which it appears, Arthur Brooke's poem *Romeus and
Juliet* (generally acknowledged as Shakespeare's primary source for the
play) will appear disproportionately often in these lists.  If one
repeats the exercise with words which appear in *Romeo and Juliet* and
one other time in Shakespeare, Brooke's poem will again be at or near
the top of the list.  All kinds of permutations are possible.

>5) You check the list in (3) with each of the lists in (2) to see if
>there's a character whose rare words turn up much more often in the
>sample play than they do in the Shakespeare canon as a whole.
>6) If (4) yields a good match, that character was played by Shakespeare
>shortly before he wrote that play. (Hence those rare words were
>over-represented in the sample play: they were in Shakespeare's head
>from his having recently memorized them for his part.)
>Have I got it?

Well, not exactly.  You're talking about the "Shakespeare roles", which
have gotten the most publicity in discussions of Shaxicon, but which Don
Foster considers secondary in importance to the potential textual and
intertextual applications of the database.

Your description in (5) and (6) above is a decent approximation, but not
quite accurate, so let me try to explain it.  Using Shaxicon, it is easy
to take each character in a given play (call it P) and compile a list of
all the occurrences of "Shakespeare rare words" used by that character.
Using this list, it is then easy to compile a list of all occurrences of
that character's rare words elsewhere in Shakespeare.  For most
characters, these occurrences will be distributed roughly equally among
plays written before and after play P, with a greater concentration in
plays which are chronologically close to P (since specific words went in
and out of Shakespeare's usage at different times in his career).
However, for each play there is one character (or two or more smaller
characters) where these occurrences are disproportionately clustered in
plays written after play P -- particularly in plays written immediately
after P, but often persisting for quite a few years.  For whatever
reasons, Shakespeare in his subsequent writings "remembered" the words
spoken by these "Shakespeare roles" much more strongly than the words
spoken by the other characters in the same play.  Foster has
hypothesized that this pattern resulted from Shakespeare memorizing the
"Shakespeare roles" for performance, but he has always said that he's
not wedded to this hypothesis, and is glad to entertain other
suggestions.  However, it's interesting that the roles singled out by
these patterns have a certain consistency across the canon:  many of the
roles tend to be old men or chorus figures, and in virtually every case
the "Shakespeare role" is among the first group of characters on stage
at the beginning of the play, and is the first or second to speak.
Also, when two or more smaller roles are singled out, the characters
never appear on stage together and thus can be doubled.

>If so, what's the big deal? The following Perl script
>will, for a given etext of a play, throw out an alphabetized word list,
>with the number of occurrences of each word.
>while (<>) {
>  s/-\n//g;
>  tr/A-Z/a-z/;
>  @words = split(/\W*\s+\W*/, $_);
>  foreach $word (@words) {
>    $wordcount{$word}++;
>  }
>foreach $word (sort keys(%wordcount)) {
>  printf "%20s %d\n", $word, $wordcount{$word};
>This script is on page 39 of Larry Wall and Randal L Schwartz
>_Programming Perl_ (Sebastopol CA: O'Reilly, 1991) and it's meant to
>introduce learners to the basics. I've run a Hamlet etext through this
>script and put the results up on the web at
>www.totus.org/scratch/hamfreq.txt (you have to scroll past all the
>act.scene.linenumbers which have risen to the top because the list is

What you've done is sort all the character-strings in *Hamlet*; however,
as I've noted above, Shaxicon indexes words (in the sense of dictionary
entries) rather than strings.  Two occurrences of the same string may
represent different words, while several different strings may all
represent variants of the same word.  Sorting the text into words (as
opposed to strings) cannot be done by computer, though a computer can
give you a start.

>Surely there's more to SHAXICON since I seem to have done step (4),
>admittedly the easiest step, in under 10 minutes (including publishing
>the results). I'm ready for my close up, Mr Gleason.

Yes, there is more, as I've tried to explain (clearly, I hope).  There's
quite a bit more just to the core of Shaxicon, without even getting into
the hundreds of non-Shakespearean texts that are indexed.

Gabriel Egan also wrote:

>Those who spend a long time on trains-which in the UK doesn't exclude
>those taking short journeys-can find themselves wanting to 'work' but
>not on their everyday matters.
>For me tinkering with the programming language Perl fills this need, and
>since Donald Foster's SHAXICON database is, as David Kathman informs us,
>unlikely to be published soon, I've started to dabble in the same area.
>So far I've written scripts which do the following:
>* gather the names of all the characters in all the plays and assign
>each a unique identifier;
>* list alphabetically all the 'rare' words in the Shakespeare canon
>(i.e. words Shakespeare used 12 times or fewer).

I assume that your script lists all the *strings* which appear in
Shakespeare 12 times or fewer, which is similar to what Shaxicon
indexes, but not at all the same.

>The scripts and the resulting listings are available on the web at
>If I've understood SHAXICON correctly, the above represent first-draft
>completion of stages (1) and (2) of the 6 stages which are SHAXICON's
>main operation. A script for stage (4) was included in my lasting
>SHAKSPER posting which described the 6 stages. SHAKSPERians, especially
>those fluent in Perl, are invited to inspect, criticize, and improve
>upon SHAXICAN in order that this area of study may be progressed.
>Of course, if I've entirely misunderstood SHAXICON, all the above is
>monumental hubris...

The biggest misunderstanding is that you're dealing with strings rather
than words.  According to some old documentation I have, the following
are the first four words in Shaxicon, alphabetically, with the number of
occurrences of each word in the canonical plays:

to abandon (v.) (10)
abandoned (adj.) (1)
abandoner (n.) (1)
to abase (v.) (2)

In your string-based list, the following are the first entries
alphabetically (after those with the prefix a-):

abandon 3
abandoned 5
abase 2

I'm guessing that "abandoner" does not appear in your list because its
only appearance is in The Two Noble Kinsmen, which may not have been
included in your sample.  The numbers here are different enough that I
assume there are considerable differences in the rest of the list, both
in the number of occurrences of a given string/word and in the
strings/words which make the 12-token cutoff.  I'm not sure exactly what
would happen if you made a full version of Shaxicon based entirely on
strings rather than words.  But the word-based approach seems much more
sensible to me, since words (in the sense of dictionary entries), and
not letter-strings, are the raw materials used by writers.  As I said
above, the core of Don Foster's (word-based) Shaxicon has been complete
for years, and he's getting close to finishing the indexing of 2,000 STC
texts.  The problem is that it's in WordCruncher format (which seemed
the most sensible option when he began the project more than a decade
ago), and in order to be put on the web it would have to be converted to
a database format.  This can be done, but it will take time and money
that are unfortunately lacking right now.  However, I do know that Don
Foster is willing to make Shaxicon in its present form available to
seriously interested parties, and that he is extremely open to
suggestions for getting it into a form that can easily be searched on
the web.  If anybody has such suggestions, I'd be glad to pass them
along to Don.

Dave Kathman
This email address is being protected from spambots. You need JavaScript enabled to view it.

Subscribe to Our Feeds


Make a Gift to SHAKSPER

Consider making a gift to support SHAKSPER.