National Academy of Sciences | 150 Year Anniversary

Questions? Call 800-624-6242

| Items in cart [0]

The National Academies Press

PAPERBACK
price:$63.75
add to cart

Rights & Permissions

topleft topright

Massive Data Sets: Proceedings of a Workshop (1997)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Citation Manager

. "Information Retrieval: Finding Needles in Massive Haystacks." Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press, 1997.

Please select a format:

BibTeX EndNote RefMan


Page
24
bottomleft bottomright

The following HTML text is provided to enhance online readability. Many aspects of typography translate only awkwardly to HTML. Please use the page image as the authoritative form to ensure accuracy.


return articles about semiconductors, food of various kinds, small pieces of wood or stone, golf and tennis shots, poker games, people named Chip, etc.

The other side of the problem is that we miss relevant information (and this is much harder to know about!). In controlled experimental tests, searches routinely miss 50-80% of the known relevant materials. There is tremendous diversity in the words that people use to describe the same idea or concept (synonymy). We have found that the probability that two people assign the same main content descriptor to an object is 10-20%, depending some on the task (Furnas et al., 1987). If an author uses one word to describe an idea and a searcher another word to describe the same idea, relevant material will be missed. Even a simple concrete object like a "viewgraph" is also called a "transparency", "overhead", "slide", "foil", and so on.

Another way to think about these retrieval problems is that word-matching methods treat words as if they are uncorrelated or independent. A query about "automobiles" is no more likely to retrieve an article about "cars" than one "elephants" if neither article contains precisely the word automobile. This property is clearly untrue of human memory and seems undesirable in online information retrieval systems (see also Caid et al., 1995). A concrete example will help illustrate the problem.

2.0 A Small Example

A textual database can be represented by means of a term-by-document matrix. The database in this example consists of the titles of 9 Bellcore Technical Memoranda. There are two classes of documents -5 about human-computer interaction and 4 about graph theory.

Title Database:

c1: Human machine interface for Lab ABC computer applications

c2: A survey of user opinion of computer system response time

c3: The EPS user interface management system

c4: System and human system engineering testing of EPS

c5: Relation of user-perceived response time to error measurement

m1: The generation of random, binary, unordered trees

m2: The intersection graph of paths in trees

m3: Graph minors IV: Widths of trees and well-quasi-ordering

m4: Graph minors: A survey

The term-by-document matrix corresponding to this database is shown in Table 1 for terms occurring in more than one document. The individual cell entries represent the frequency with which a term occurs in a document. In many information retrieval applications these frequencies are transformed to reflect the ability of words to discriminate among documents. Terms that are very discriminating are given high weights and undiscriminating terms are given low weights. Note also the large number of 0 entries in the matrix-most words do not occur in most documents, and most documents do not contain most words

Page
24
FRONT MATTER (R1-R10)
Opening Remarks (1-2)
PART I Participant's Expectations for the Workshop (3-12)
PART II Applications Papers (13-14)
Earth Observation Systems: What Shall We Do with the Data we Are Expecting in 1998? (15-22)
Information Retrieval: Finding Needles in Massive Haystacks (23-32)
Statistics and Massive Data Sets: one View from the Social Sciences (33-38)
The Challenge of Functional Magnetic Resonance Imaging (39-46)
Marketing (47-50)
Massive Data Sets: Guidelines and Practical Experience from Health Care (51-68)
Massive Data Sets in Semiconductor Manufacturing (69-76)
Management Issues in the Analysis of Large-Scale Crime Data Sets (77-80)
Analyzing Telephone Network Data (81-92)
Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges (93-103)
PART III Additional Invited Papers (103-104)
Massive Data Sets and Artificial Intelligence Planning (105-114)
Massive Data Sets: Problems and Possiblities, with Application to Environmental Monitoring (115-120)
Visualizing Large Datasets (121-128)
From Massive Data Sets to Science Catalogs: Applications and Challenges (129-142)
Information Retrieval and the Statistics of Large Data Sets (143-148)
Some Ideas About the Exploratory Spatial Analysis of Large Data Sets (149-156)
Massive Data Sets in Navy Problems (157-168)
Massive Data Sets Workshop: The Morning After (169-184)
PART IV Fundamental Issues and Grand Challenges (185-186)
Panel Discussion (187-202)
Items for Ongoing Consideration (203-204)
Closing Remarks (205-206)
Appendix: Workshop Participants (207-208)