Shakespeare Electronic Conference, Vol. 3, No. 72. Thursday, 26 Mar 1992.
Date: 		Tue, 24 Mar 1992 10:48:42 EST
From: 		This email address is being protected from spambots. You need JavaScript enabled to view it. (Hardy M. Cook)
Subject: 	A Long Discussion of Tagging
WordCruncher, TACT, and Micro OCP are all text-analysis software.
WordCruncher and TACT both bill themselves as text-retrieval programs.
Micro OCP "makes concordances, indexes, and word lists from texts in
a variety of languages and alphabets."  On the simplest level, all
three locate words and word patterns in a text or corpus, but far
more sophisticated analysis is possible. Micro OCP, according to the
*User Manual*, "can be used for many text analysis applications
including the investigation of style, vocabulary distribution,
grammatical forms, rhyme schemes, text editing, and language
acquisition and teaching."
I am most familiar with WordCrucher, and more familiar with TACT than
Micro OCP.  However, I have not prepared texts to be used with these
programs, relying instead on previously prepared texts: *The Riverside
Shakespeare* with WordCruncher, *Hamlet In-TACT* (the three *Hamlet*
texts that Ken Steele prepared and shared with participants in a SAA
seminar on Shakespearean computing several years ago), and the Oxford
Text Archive's collections offered to SHAKSPEReans last fall with Micro
I don't mean this posting to be a tutorial in a subject about which I
know very little myself, but some details are necessary to continue our
discussion. Files prepared for WordCrucher seem to support "three levels
of Reference Codes that indentify three levels of a standard outline."
TACT, according to *User's Guide*, employs <angle brackets> to identify
Text References as does Micro OCP: herein lies the issue of tagging our
PD Shakespeare texts.
With the Sonnets, I chose to post two versions: one with minimal tagging
(basically only the <it> tag, which I am now inclined to replace with
{braces}) and another fully tagged version with <angle brackets> that
include T: Title, L: Line, P: Pagination, and S: Sonnet Number.  The
minimally tagged version could easily be reformated for TeX, while
either can be used with a word processor.  However, I suspect some
of us who want to work with these e-texts will want to use them with
one of the text-analysis programs.
Selecting what to tag in the sonnets was relatively easy; selecting
what to tag in the plays and deciding whether we would like to have
untagged as well as tagged versions is another matter.
Several months ago, Ken Steele proposed the following as possibilities
for tagging the plays:
	Play Title eg. <T Hamlet Q1>
	Act/Scene  eg. <A 3.2>
	Line [this should probably be added mechanically]
	Direction eg. <D Enter {Hamlet}.>
	Speech Prefix eg. <S {Ham.}>
	Font eg. {these words in italic}
	Language eg. <L Latin> ergo <L English>
	Some other things which might be added by an editor with sufficient
	resources and information are the following:
	Speaker (not always clear or the same as prefix)
	Verse/Prose (not always the same as the original ed.)
	Compositor Stints (usually a little theoretical)
With the sonnets, I tried to reproduce in ASCII as closely as possible
what I saw on the page.  Thus, I added spaces where I saw large gaps and
ran other elements together where I saw minimal spaces.  Because I was
trying to reproduce the "look" of the page, I included everything on the
page, including signatures, pages, and forms.  Thus, I would suggest that
pagination information be included in our tagging.
The Oxford Text Archives describes its encoding choices this way:
This file contains embedded markers for use by Oxford Concordance Program,
delimited by the characaters < and >
The following categories of reference are included:
    T  : Play title
    C  : Compositor identifier
    P  : Signature
    A  : authorial attribution
    Y  : (occasionally) type of copy
    S  : Speaker prefix
    Z  : Act/scene prefix stage direction etc
    D  : embedded stage direction
Lines begining with either a space or a star are text lines, one for each
line in the original text. Lines begining within a * are justified lines.
Hinman's lineation for the folio is followed. Lines begining with a reference
(i.e. <) are not included in the lineation.
The character # is used in some texts to distinguish homographs (e.g. Will and
Will); it is also used with the hyphen to indicate cases where hyphenation
is significant.
The characters { and } (curly braces) are used to enclose material in italics
     Words containing tildes in the original texts have been expanded.
     Words hyphenated across a line boundary have be joined together and
     included at the end of the first line.  A "%" marks the hyphen in this
     case. If the second part of the hyphenated word was the only thing
     on that line, an underscore "_" on that line is used to indicate that
     it is a non-blank line in the original text.
Turnovers are joined together on the first line, and the character "|" is
used to mark this point.
I am not sure if we would like to go into this detail, but here is what
the OTA texts look like:
Sample from OTA *King Lear* F1:
      <T KL><L 1><Y Q><P qq2><C B>
1      <Z {Actus  Primus. Scoena Prima}.>
2      <D {Enter Kent, Gloucester, and Edmond}.>
3      <S {Kent}.>
4     *I thought the King had more affected the
5      Duke of {Albany}, then {Cornwall}.
6     *<S {Glou}.> It did alwayes seeme so to vs: But
7     *now in the diuision of the Kingdome, it ap-peares
8     *not which of the Dukes hee valewes
9     *most, for qualities are so weigh'd, that curiosity in nei-ther,
10     can make choise of eithers moity.
11     <S {Kent}.> Is not this your Son, my Lord?
12    *<S {Glou}.> His breeding Sir, hath bin at my charge. I haue
13    *so often blush'd to acknowledge him, that now I am
14     braz'd too't.
15     <S {Kent}.> I cannot conceiue you.
16    *<S {Glou}.> Sir, this yong Fellowes mother could; where-vpon
17    *she grew round womb'd, and had indeede (Sir) a
18    *Sonne for her Cradle, ere she had a husband for her bed.
19     Do you smell a fault?
20    *<S {Kent}.> I cannot wish the fault vndone, the issue of it,
21     being so proper.
22    *<S {Glou}.> But I haue a Sonne, Sir, by order of Law, some
23    *yeere elder then this; who, yet is no deerer in my ac-count,
24    *though this Knaue came somthing sawcily to the
25    *world before he was sent for: yet was his Mother fayre,
26    *there was good sport at his making, and the horson must
27    *be acknowledged. Doe you know this Noble Gentle-man,
28     {Edmond}?
29     <S {Edm}.> No, my Lord.
30     <S {Glou}.> My Lord of Kent:
31     Remember him heereafter, as my Honourable Friend.
32     <S {Edm}.> My seruices to your Lordship.
33     <S {Kent}.> I must loue you, and sue to know you better.
34     <S {Edm}.> Sir, I shall study deseruing.
35    *<S {Glou}.> He hath bin out nine yeares, and away he shall
36     againe. The King is comming.
37    *<D {Sennet. Enter King Lear, Cornwall, Albany, Gonerill, Re-gan},
38     {Cordelia, and attendants}.>
I apologize for the length of this posting, but I felt it was important
to get the issue of tagging the PD Shakespeare texts clearly in front of
us.  Your responses are sought.
					Hardy M. Cook
					Bowie State University

