Shakespeare Electronic Conference, Vol. 3, No. 72. Thursday, 26 Mar 1992. Date: Tue, 24 Mar 1992 10:48:42 EST From:This email address is being protected from spambots. You need JavaScript enabled to view it. (Hardy M. Cook) Subject: A Long Discussion of Tagging WordCruncher, TACT, and Micro OCP are all text-analysis software. WordCruncher and TACT both bill themselves as text-retrieval programs. Micro OCP "makes concordances, indexes, and word lists from texts in a variety of languages and alphabets." On the simplest level, all three locate words and word patterns in a text or corpus, but far more sophisticated analysis is possible. Micro OCP, according to the *User Manual*, "can be used for many text analysis applications including the investigation of style, vocabulary distribution, grammatical forms, rhyme schemes, text editing, and language acquisition and teaching." I am most familiar with WordCrucher, and more familiar with TACT than Micro OCP. However, I have not prepared texts to be used with these programs, relying instead on previously prepared texts: *The Riverside Shakespeare* with WordCruncher, *Hamlet In-TACT* (the three *Hamlet* texts that Ken Steele prepared and shared with participants in a SAA seminar on Shakespearean computing several years ago), and the Oxford Text Archive's collections offered to SHAKSPEReans last fall with Micro OCP. I don't mean this posting to be a tutorial in a subject about which I know very little myself, but some details are necessary to continue our discussion. Files prepared for WordCrucher seem to support "three levels of Reference Codes that indentify three levels of a standard outline." TACT, according to *User's Guide*, employs <angle brackets> to identify Text References as does Micro OCP: herein lies the issue of tagging our PD Shakespeare texts. With the Sonnets, I chose to post two versions: one with minimal tagging (basically only the <it> tag, which I am now inclined to replace with {braces}) and another fully tagged version with <angle brackets> that include T: Title, L: Line, P: Pagination, and S: Sonnet Number. The minimally tagged version could easily be reformated for TeX, while either can be used with a word processor. However, I suspect some of us who want to work with these e-texts will want to use them with one of the text-analysis programs. Selecting what to tag in the sonnets was relatively easy; selecting what to tag in the plays and deciding whether we would like to have untagged as well as tagged versions is another matter. Several months ago, Ken Steele proposed the following as possibilities for tagging the plays: Play Title eg. <T Hamlet Q1> Act/Scene eg. <A 3.2> Line [this should probably be added mechanically] Direction eg. <D Enter {Hamlet}.> Speech Prefix eg. <S {Ham.}> Font eg. {these words in italic} Language eg. <L Latin> ergo <L English> Some other things which might be added by an editor with sufficient resources and information are the following: Speaker (not always clear or the same as prefix) Verse/Prose (not always the same as the original ed.) Signatures/Pages/Formes Compositor Stints (usually a little theoretical) With the sonnets, I tried to reproduce in ASCII as closely as possible what I saw on the page. Thus, I added spaces where I saw large gaps and ran other elements together where I saw minimal spaces. Because I was trying to reproduce the "look" of the page, I included everything on the page, including signatures, pages, and forms. Thus, I would suggest that pagination information be included in our tagging. The Oxford Text Archives describes its encoding choices this way: ************************************************************************** This file contains embedded markers for use by Oxford Concordance Program, delimited by the characaters < and > The following categories of reference are included: T : Play title C : Compositor identifier P : Signature A : authorial attribution Y : (occasionally) type of copy S : Speaker prefix Z : Act/scene prefix stage direction etc D : embedded stage direction Lines begining with either a space or a star are text lines, one for each line in the original text. Lines begining within a * are justified lines. Hinman's lineation for the folio is followed. Lines begining with a reference (i.e. <) are not included in the lineation. The character # is used in some texts to distinguish homographs (e.g. Will and Will); it is also used with the hyphen to indicate cases where hyphenation is significant. The characters { and } (curly braces) are used to enclose material in italics Words containing tildes in the original texts have been expanded. Words hyphenated across a line boundary have be joined together and included at the end of the first line. A "%" marks the hyphen in this case. If the second part of the hyphenated word was the only thing on that line, an underscore "_" on that line is used to indicate that it is a non-blank line in the original text. Turnovers are joined together on the first line, and the character "|" is used to mark this point. ***************************************************************************** I am not sure if we would like to go into this detail, but here is what the OTA texts look like: Sample from OTA *King Lear* F1: <T KL><L 1><Y Q><P qq2><C B> 1 <Z {Actus Primus. Scoena Prima}.> 2 <D {Enter Kent, Gloucester, and Edmond}.> 3 <S {Kent}.> 4 *I thought the King had more affected the 5 Duke of {Albany}, then {Cornwall}. 6 *<S {Glou}.> It did alwayes seeme so to vs: But 7 *now in the diuision of the Kingdome, it ap-peares 8 *not which of the Dukes hee valewes 9 *most, for qualities are so weigh'd, that curiosity in nei-ther, 10 can make choise of eithers moity. 11 <S {Kent}.> Is not this your Son, my Lord? 12 *<S {Glou}.> His breeding Sir, hath bin at my charge. I haue 13 *so often blush'd to acknowledge him, that now I am 14 braz'd too't. 15 <S {Kent}.> I cannot conceiue you. 16 *<S {Glou}.> Sir, this yong Fellowes mother could; where-vpon 17 *she grew round womb'd, and had indeede (Sir) a 18 *Sonne for her Cradle, ere she had a husband for her bed. 19 Do you smell a fault? 20 *<S {Kent}.> I cannot wish the fault vndone, the issue of it, 21 being so proper. 22 *<S {Glou}.> But I haue a Sonne, Sir, by order of Law, some 23 *yeere elder then this; who, yet is no deerer in my ac-count, 24 *though this Knaue came somthing sawcily to the 25 *world before he was sent for: yet was his Mother fayre, 26 *there was good sport at his making, and the horson must 27 *be acknowledged. Doe you know this Noble Gentle-man, 28 {Edmond}? 29 <S {Edm}.> No, my Lord. 30 <S {Glou}.> My Lord of Kent: 31 Remember him heereafter, as my Honourable Friend. 32 <S {Edm}.> My seruices to your Lordship. 33 <S {Kent}.> I must loue you, and sue to know you better. 34 <S {Edm}.> Sir, I shall study deseruing. 35 *<S {Glou}.> He hath bin out nine yeares, and away he shall 36 againe. The King is comming. 37 *<D {Sennet. Enter King Lear, Cornwall, Albany, Gonerill, Re-gan}, 38 {Cordelia, and attendants}.> ----------------------------------------------------------------------------- I apologize for the length of this posting, but I felt it was important to get the issue of tagging the PD Shakespeare texts clearly in front of us. Your responses are sought. Hardy M. Cook Bowie State University