The Shakespeare Conference: SHK 12.0144  Tuesday, 23 January 2001

From:           Gabriel Egan <This email address is being protected from spambots. You need JavaScript enabled to view it.>
Date:           Sunday, 21 Jan 2001 19:38:07 -0000
Subject: 12.0134 Re: Shaxicon
Comment:        Re: SHK 12.0134 Re: Shaxicon

David Kathman writes

>Shaxicon indexes words (in the sense of dictionary
>entries) rather than strings.  Two occurrences of the
>same string may represent different words, while
>several different strings may all represent variants
>of the same word.  Sorting the text into words (as
>opposed to strings) cannot be done by computer,
>though a computer can give you a start

This extract from David's very full answer is the gist of it: SHAXICAN
needs a lemmatized etext of the complete works to be doing what SHAXICON
does. I shall go looking for one. (I recall that H. Joachim Neuhaus's
"Shakespeare Database" project was intended to help those of us for whom
this word/string distinction requires hard thinking.)

In the meantime, SHAXICAN has been updated so that Stage (3), producing
a list of the rare words (or rather strings, as David points out) for
each character, is half-way solved. I've written a script which
separates out the 1047 "parts" (i.e., characters' collections of
speeches) in the Shakespeare canon and writes each out to a separate
file. It still remains to sift out those words which aren't "rare" and
to count the occurrences of the remaining ones. (This shouldn't be too
hard, since Stage (2) produced a list of the rare words.)

Gabriel Egan

