FIGURE 11 Blank document.

characters. For a mathematical equation, the presence of subscripts and superscripts generates additional small peaks. A theory of blank space in lines and documents would aid in building tools for accessing text.

Why is it important to analyze text automatically? There are many reasons. As has already been mentioned, allowing standard database queries of English text documents in the case where the document has tables is one reason. Another reason involves automatic indexing. Consider WAIS, a tool created by Apple Computer, Dow Jones, Thinking Machines, and Dunn and Bradstreet for locating information from wire services and other text sources (Kahle, 1989). Intuitively, WAIS operates by taking a document, sorting the words, and creating a vector whose components indicate the number of occurrences of various words in the document. One retrieves a document by providing a set of keywords. The keywords are used to form a search vector. The dot product of the search vector is computed with each vector representing a document in the collection. (This is basically the SMART technology (Salton, 1991). A short list of documents with the highest dot products is produced. The documents can then be displayed.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement