Actually, one wants to index different types of documents differently. The above method of indexing is very effective for a journal article or a simple document on a coherent topic. However, for an electronic mail file with 5,000 unrelated messages, the method makes no sense at all. A single index for a file of 5,000 unrelated documents may say something about the collection as a whole but is not useful for retrieving an individual document. Thus, a file of electronic mail messages should be indexed by creating a separate index for each message. If different computer files are to be indexed differently, then one needs to be able to identify the type of file automatically.
I have not mentioned electronic libraries and the potential of accessing from one's living room an image of any printed document or work of art in the world. The reason for the omission is that by and large the technology already exists for electronic libraries. The main impediments seem to be legal. That is, we need to solve the legal issue of how to protect intellectual property rights and how to charge for the services provided.
CONCLUSION
In this lecture, I have tried to sketch some of the technical issues that need development as we enter the computer and information age. This age promises to be even more beneficial than the agricultural and industrial revolutions preceding it. We live in an exciting time, one in which we need to work together and share our ideas to build and support the emerging technologies of the new age.
REFERENCES
Bern, Marshall, and David Eppstein. 1992. Mesh Generation and Optimal Triangulation, pp. 23-90 in Computing in Euclidean Geometry, F.K. Hwang and D. Du (eds.), World Scientific, River Edge, N.J.
Cremer, James F. 1989. An Architecture for General Purpose Physical System Simulation— Integrating Geometry, Dynamics, and Control, Department of Computer Science Technical Report 89-987, Cornell University, Ithaca, N.Y., April.
Hopcroft, John, and Peter J. Kahn. 1992. A Paradigm for Robust Geometric Algorithms, Algorithmica7:339-380 .
Hopcroft, John, C.M. Hoffmann, and M. Karasick. 1989. Robust Set Operations on Polyhedral Solids, IEEE Computer Graphics and Applications 9(November):50-59.
Kahle, Brewster. 1989. Wide Area Information Server (WAIS) Concepts,Thinking Machines Corporation, Cambridge, Mass., November.
Palmer, Richard S., and James F. Cremer. 1991. SIMLAB: Automatically Creating Physical Systems Simulators.Department of Computer Science Technical Report 91-1246, Cornell University, Ithaca, N.Y., November.
Rus, Daniela. 1991. On Dexterous Rotations of Polygons.Department of Computer Science Technical Report 91-1258, Cornell University, Ithaca, N.Y., December.
Salton, Gerard. 1991. Developments in Automatic Text Retrieval, Science253(30 August):974-980.
Stewart, Arthur James Kennedy. 1991. The Theory and Practice of Robust Geometric Computation, or, How to Build Robust Solid Modelers, Department of Computer Science Technical Report 91-1229, Cornell University, Ithaca, N.Y., August.
characters. For a mathematical equation, the presence of subscripts and superscripts generates additional small peaks. A theory of blank space in lines and documents would aid in building tools for accessing text.
Why is it important to analyze text automatically? There are many reasons. As has already been mentioned, allowing standard database queries of English text documents in the case where the document has tables is one reason. Another reason involves automatic indexing. Consider WAIS, a tool created by Apple Computer, Dow Jones, Thinking Machines, and Dunn and Bradstreet for locating information from wire services and other text sources (Kahle, 1989). Intuitively, WAIS operates by taking a document, sorting the words, and creating a vector whose components indicate the number of occurrences of various words in the document. One retrieves a document by providing a set of keywords. The keywords are used to form a search vector. The dot product of the search vector is computed with each vector representing a document in the collection. (This is basically the SMART technology (Salton, 1991). A short list of documents with the highest dot products is produced. The documents can then be displayed.