About a year ago I mentioned on this list that I'd started work on a
public-domain version of Donald Foster's SHAXICON database of
Shakespearian rare-word usage. Foster, long-time readers will remember,
created a lot of media interest in 1995-6 by convincing journalists he
had 'proved' that Funeral Elegy was a hitherto-neglected Shakespeare
poem. Foster promised that the SHAXICON database evidence on which this
ascription was made would become available for others to check his

Foster's SHAXICON database has still not be made available, so the media
interest (and not insignificant collection of peer-reviewed journal
publications) effectively rests on an unsubstantiated claim about what
happens inside Foster's computer.

They do things differently in the sciences: an unexpected claim isn't
believed until someone is able to replicate the experimental results. To
this end, I created a website (www.totus.org/SHAXICAN) where I put some
Shakespeare e-texts and Perl scripts which did the initial work of
dividing the canon into actors' parts (documents containing all the
lines spoken by one character) and counting how frequently each word
occurred in order to identify what Foster calls the 'rare' words: those
Shakespeare used 12 times or fewer in his dramatic career.

I didn't get terribly far with SHAXICAN and hoped that someone with
better programming skills would see what I'd done and would be
sufficiently interested to improve upon it. This person has turned up:
SHAKSPERian Steve Roth has greatly improved on my original scripts and
texts and tied them to a FileMaker database which produces tabular
reports on rare-word usage which may be directly compared with Foster's
published tables. This exciting work is at


(it's also linked from the homepage at www.totus.org/SHAXICAN)

Steve has already issued the caveat that his work needs to be checked by
the wider community of Shakespearians, especially those in the field of
computerized lexical analysis. The important point here is that Steve
has made available (on the above site) his Perl scripts, the tables they
produce, and the structure of his FileMaker database: everything is in
the open. The productivity benefits of this approach are best
illustrated by analogy with Linux, an entirely free and open alternative
to the software products of Microsoft and to proprietary Unix. In 1991 a
Scandinavian comp-sci student Linus Torvelds gave away the source code
of his first attempt at a free version of Unix, and the rest of the
academic computing community added to it so quickly that it is now
preferred as a platform by many serious academic and business users.

One obvious objection to the approach started by me and continued by
Steve is that we aren't really working with words at all but with
strings. For example, the scripts make no distinction between the
three-letter string 'r-o-w' when it occurs in the verb 'to row' and when
it occurs in the noun 'a row'. This is because the e-texts we begin with
make no such distinction, they just provide the electronic letters and
numbers much as these would appear in a printed text of the plays. One
might instead use a lemmatized e-text in which the strings are tagged to
indicate which category ('part of speech') they belong to: noun,
adjective, pronoun, verb, adverb, preposition, conjunction,
interjection, or article.  Lemmatized e-texts of Shakespeare are not, to
my knowledge, freely available but they must exist. (Foster either had
one or made his own, for example.) I have approached Joachim Neuhaus of
the Shakespeare Database Project (which aims to create lemmatized
e-texts of Shakespeare), but nothing substantive has come of this

Steve's work with strings could be adapted to work with a lemmatized
etext, but perhaps this isn't necessary. Should we assume that
Shakespeare thought in terms of parts of speech? As E A J Honigmann
observed (The Stability of Shakespeare's Texts pp. 67-7) Greg and Dover
Wilson thought it absurd to suppose that Shakespeare would substitute
one word for another which was graphically similar (eg hulkes for
bulkes, Indian for Iudean), but Milton can be shown to be doing just
this in the MS of Comus. Indeed, his image clusters sometimes are
organized by graphic relatedness (eg 'lords'-'lads'-'friends').  If this
principle is accepted, use of the verb 'row' in one place might
predispose Shakespeare to use the noun 'row' in another, and thus it
would not be appropriate to distinguish these as different 'words' for
the purpose of analysing Shakespeare's choices.  Obviously this
observation isn't a full-blown defence of not using non-lemmatized
e-texts, but I'd be interested to hear SHAKSPERians' comments on its
viability as a reason not to seek a lemmatized text.

I hope SHAKSPERians with an interest in Forster's work and lexical
analysis generally will take a look at Steve's exciting work and freely
comment upon it.

Gabriel Egan

