Cover Image


View/Hide Left Panel

net models are a subclass of probabilistic models which use network representations of the distribution of words. A variety of other formal and ad hoc statistical methods, including ones based on neural nets and fuzzy logic have been tried as well.

In IR systems documents are often represented as vectors of binary or numeric values corresponding directly or indirectly to the words of the document. Several properties of language, such as synonymy, ambiguity, and sheer variety make these representation far from ideal (but also hard to improve on [13]). A variety of unsupervised learning methods have been applied to IR, with the hope of finding structure in large bodies of text that would improve on straightforward representations. These include clustering of words or documents [10, 20], factor analytic decompositions of term by document matrices [1], and various term weighting methods [16].

Similarly, the retrieval query, routing profile, or category description provided by an IR system user is often far from ideal as well. Supervised learning techniques, where user feedback on relevant documents is used to improve the original user input, have been widely used [6, 15]. Both parametric and nonparametric (e.g. neural nets, decision trees, nearest neighbor classifiers) have been used. Supervised learning is particularly effective in routing (where a user can supply ongoing feedback as the system is used) [7] and in text categorization (where a large body of manually indexed text may be available) [12, 14].

2 The Future

These are exciting times for IR. Statistical IR methods developed over the past 30 years are suddenly being widely applied in everything from shrinkwrapped personal computer software, up to large online databases (Dialog, Lexis/Nexis, and West Publishing all fielded their first statistical IR systems in the past three years) and search tools for the Internet.

Until recently, IR researchers dealt mostly with relatively small and homogeneous collections of short documents (often titles and abstracts). Comparisons of over 30 IR. methods in the recent NIST/ARPA Text Retrieval Conferences (TREC), have resulted in a number of modifications to these methods to deal with large (one million documents or more) collections of diverse full text documents [2, 3, 4]. Much of this tuning has been ad hoc and heavily empirical. Little is known about the relationship between properties of a text base and the best IR methods to use with it. This is an undesirable situation, given the increasing variety of applications IR is applied to, and is perhaps the most important area where better statistical insight would be helpful.

Four observations from the TREC conferences give a sense of the range of problems where better statistical insight is needed:

  1. Term weighting in long documents: Several traditional approaches give a document credit for matching a query word proportional to the number of times the word occurs in a document. Performance on TREC is improved if the logarithm of the number of occurrences of the word is used instead. Better models of the distribution of word occurrences in documents might provide less ad hoc approaches to this weighting.
  2. Feedback from top ranked documents: Supervised learning methods have worked well in TREC, with some results suggesting that direct user input becomes of relatively little value

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement