Information Retrieval and the Statistics of Large Data Sets
David D. Lewis
&T Bell Laboratories
Providing content-based access to large quantities of text is a difficult task, given our poor understanding of the formal semantics of human language. The most successful approaches to retrieval, routing, and categorization of documents have relied heavily on statistical techniques. We briefly review some of those techniques and point out where better statistical insight could lead to further advances.
1 IR and Statistics Today
Information retrieval (IR) is concerned with providing access to data for which we do not have strong semantic models. Text is the most notable example, though voice, images, and video are of interest as well. Examples of IR tasks include retrieving documents from a large database in response to immediate user needs, routing or filtering documents of interest from an ongoing stream over a period of time, and categorizing documents according to their content (e.g. assigning Dewey Decimal numbers to abstracts of articles).
Statistical approaches have been widely applied to these systems because of the poor fit of text to data models based on formal logics (e.g. relational databases) . Rather than requiring that users anticipate exactly the words and combinations of words that will appear in documents of interest, statistical IR approaches let users simply list words that are likely to appear in relevant documents. The system then takes into account the frequency of these words in a collection of text, and in individual documents, to determine which words are likely to be the best clues of relevance. A score is computed for each document based on the words it contains, and the highest scoring documents are retrieved, routed, categorized, etc.
There are several variations on this approach [5, 17, 18, 19]. Vector space models treat the words suggested by the user as specifying an ideal relevant document in a high dimensional space. The distance of actual documents to this point is used as a measure of relevance. Probabilistic models attempt to estimate, for instance, the conditional probability of seeing particular words in relevant and nonrelevant documents. These estimates are combined under independence assumptions and documents are scored for probability of membership in the class of relevant documents. Inference
net models are a subclass of probabilistic models which use network representations of the distribution of words. A variety of other formal and ad hoc statistical methods, including ones based on neural nets and fuzzy logic have been tried as well.
In IR systems documents are often represented as vectors of binary or numeric values corresponding directly or indirectly to the words of the document. Several properties of language, such as synonymy, ambiguity, and sheer variety make these representation far from ideal (but also hard to improve on ). A variety of unsupervised learning methods have been applied to IR, with the hope of finding structure in large bodies of text that would improve on straightforward representations. These include clustering of words or documents [10, 20], factor analytic decompositions of term by document matrices , and various term weighting methods .
Similarly, the retrieval query, routing profile, or category description provided by an IR system user is often far from ideal as well. Supervised learning techniques, where user feedback on relevant documents is used to improve the original user input, have been widely used [6, 15]. Both parametric and nonparametric (e.g. neural nets, decision trees, nearest neighbor classifiers) have been used. Supervised learning is particularly effective in routing (where a user can supply ongoing feedback as the system is used)  and in text categorization (where a large body of manually indexed text may be available) [12, 14].
2 The Future
These are exciting times for IR. Statistical IR methods developed over the past 30 years are suddenly being widely applied in everything from shrinkwrapped personal computer software, up to large online databases (Dialog, Lexis/Nexis, and West Publishing all fielded their first statistical IR systems in the past three years) and search tools for the Internet.
Until recently, IR researchers dealt mostly with relatively small and homogeneous collections of short documents (often titles and abstracts). Comparisons of over 30 IR. methods in the recent NIST/ARPA Text Retrieval Conferences (TREC), have resulted in a number of modifications to these methods to deal with large (one million documents or more) collections of diverse full text documents [2, 3, 4]. Much of this tuning has been ad hoc and heavily empirical. Little is known about the relationship between properties of a text base and the best IR methods to use with it. This is an undesirable situation, given the increasing variety of applications IR is applied to, and is perhaps the most important area where better statistical insight would be helpful.
Four observations from the TREC conferences give a sense of the range of problems where better statistical insight is needed:
- Term weighting in long documents: Several traditional approaches give a document credit for matching a query word proportional to the number of times the word occurs in a document. Performance on TREC is improved if the logarithm of the number of occurrences of the word is used instead. Better models of the distribution of word occurrences in documents might provide less ad hoc approaches to this weighting.
- Feedback from top ranked documents: Supervised learning methods have worked well in TREC, with some results suggesting that direct user input becomes of relatively little value
- when large number of training instances are available. More surprisingly, applying supervised learning methods to the top ranked documents from an initial retrieval run, as if they were known to be relevant, has been found to be somewhat useful. This strategy had failed in all attempts prior to TREC. Is the size of the TREC collection the key to success? Can this idea be better understood and improved on (perhaps using EM methods)?
- Massive query expansion: Supervised learning approaches to IR often augment the original set of words suggested by a user with words from documents they have judged to be relevant. For probabilistic IR methods, adding only a few words from relevant documents has been found to work best. However, for vector space methods, massive expansion (adding most or all words from known relevant documents) seems to be optimal. Reconciling this with the usually omnipresent curse of dimensionality is an interesting issue.
- The difficulty of evaluating IR systems: TREC has reminded researchers that we do not have a good understanding of how to decide if one IR system is significantly better than another, much less how to predict in advance the level of effectiveness that an IR system can deliver to a user. Indeed, it is often unclear what reasonable measures of effectiveness are. These issues are of more interest than ever, given the large number of potential new users of IR technology.
TREC reveals just a few of the IR problems where better statistical insight is crucial. Others include dealing with time-varying streams of documents (and time-varying user needs), drawing conclusions from databases that mix text and formatted data, and choosing what information sources to search in the first place. On the tools side, a range of powerful techniques from statistics have seen relatively little application in IR, including cross-validation, model averaging, graphical models, hierarchical models, and many others. Curiously, highly computational methods have seen particularly little use. The author has been particularly interested in methods for actively selecting training data (ala statistical design of experiments) for supervised learning [9, 11]. Since vast quantities of text are now cheap, while human time is expensive, these methods are of considerable interest.
In summary, the opportunities for and need of more statistical work in IR is as vast as the flood of online text engulfing the world!
 Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41(6):391-407, September 1990.
 D. K. Harman, editor. The First Text REtrieval Conference (TREC-1 ), Gaithersburg, MD 20899, 1993. National Institute of Standards and Technology. Special Publication 500-207.
 D. K. Harman, editor. The Second Text REtrieval Conference (TREC-2) , Gaithersburg, MD 20899, 1994. National Institute of Standards and Technology. Special Publication 500-215.
 D. K. Harman, editor. Overview of the Third Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1995. National Institute of Standards and Technology. Special Publication 500-225.
 Donna Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 363-392. Prentice Hall, Englewood Cliffs, NJ, 1992.
 Donna Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241-263. Prentice Hall, Englewood Cliffs, NJ, 1992.
 Donna Harman. Overview of the third Text REtrieval Conference (TREC-3). In D. K. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1-27, Gaithersburg, MD, 1995. U. S. Dept. of Commerce, National Institute of Standards and Technology.
 David D. Lewis. Learning in intelligent information retrieval. In Eighth International Workshop on Machine Learning, pages 235-239, 1991.
 David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In William W. Cohen and Haym Hirsh, editors, Machine Learning: Proceedings of the Eleventh International Conference on Machine Learning, pages 148-156, San Francisco, CA, 1994. Morgan Kaufmann.
 David D. Lewis and W. Bruce Croft. Term clustering of syntactic phrases. In Thirteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 385-404, 1990.
 David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International A CM-SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, London, 1994. Springer-Verlag.
 David D. Lewis and Philip J. Hayes. Guest editorial. ACM Transactions on Information Systems, 12(3):231, July 1994.
 David D. Lewis and Karen Sparck Jones. Natural language processing for information retrieval. Communications of the ACM, 1996. To appear.
 David D. Lewis and Marc Ringuette. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-93, Las Vegas, NV, April 11-13 1994. ISRI; Univ. of Nevada, Las Vegas.
 Gerard Salton and Chris Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990.
 Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management , 24(5):513-523, 1988.
 Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York, 1983.
 H. R. Turtle and W. B. Croft. A comparison of text retrieval models. The Computer Journal, 35(3):279-290, 1992.
 C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.
 Peter Willett. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577-598, 1988.