|
BOX 2.2
Information in the arts and humanities can take many forms, and it can be studied for many different purposes. Humanities primary source texts may take the form of literary works, historical documents, manuscripts, papyri, inscriptions, coins, transcriptions of spoken texts, or dictionaries, and they can be written in any natural language. This information is characterized more than anything by its complexity. It may include variant readings, variant spellings, marginal notes, annotations of various kinds, cancellations, and interlineations, as well as nonstandard characters. Literary texts rarely conform to the simple document structure required by most current text retrieval systems. Plays consist of acts, scenes, speeches, stage directions, cast lists, and so forth. There are very many different types of verse. Systems for processing literary texts must also be able to handle different languages and alphabets, both in terms of analyzing the structure of words and displaying the text in the correct script. Most textual information in electronic form is intended to be used in a document retrieval system where a typical user wants to find all the documents about a certain topic. For primary source material in textual form, this is less likely to be the major application. A user may want to locate a quotation, compare the vocabulary of the characters in a novel, examine the rhyme and sound patterns in verse, or even find out whether a particular word is used at all. These types of searches require an accurate text and one that is encoded to make certain features explicit. Ever since Father Busa began to create the first humanities electronic text in 1949, effort has concentrated on finding ways to represent the characteristics of texts such that they can be manipulated easily. Most early projects attempted to transcribe electronic texts by maintaining as accurate reproduction of the original as possible. Typographic features were faithfully encoded, and many texts were prepared before it was fully realized how ambiguous typography can be. For example, italic can be used to represent titles, foreign words, or emphasized words, making it impossible to retrieve only foreign words. In late 1987 the humanities computing community became one of the earliest groups to adopt the Standard Generalized Markup Language (SGML) when it began a major project called the Text Encoding Initiative (TEI). The TEI has developed an SGML application that can handle many different types of humanities electronic texts. It includes tags not only for the structural features of the text, but also for analysis and interpretation. The TEI includes SGML tag sets for prose, verse, drama, transcriptions of speech, dictionaries, and terminological data, as well as analytical features, transcription of manuscripts, names and dates, language corpora, and a sophisticated method for hypertext linking both within and outside the current document. The TEI consists of about 400 possible tags, but very few are mandatory. The philosophy is that one person can encode a text for the features he or she is interested in. Another person can then take that text and add encoding for other features. The TEI makes it possible to encode multiple and possibly conflicting views in the same document, thus allowing for differences of opinion in interpretations of the material.
SOURCE: Excerpted from position paper by Susan Hockey available on-line at http://www2.nas.edu/CSTBWEB. |