net models are a subclass of probabilistic models which use network representations of the distribution of words. A variety of other formal and ad hoc statistical methods, including ones based on neural nets and fuzzy logic have been tried as well.
In IR systems documents are often represented as vectors of binary or numeric values corresponding directly or indirectly to the words of the document. Several properties of language, such as synonymy, ambiguity, and sheer variety make these representation far from ideal (but also hard to improve on [13]). A variety of unsupervised learning methods have been applied to IR, with the hope of finding structure in large bodies of text that would improve on straightforward representations. These include clustering of words or documents [10, 20], factor analytic decompositions of term by document matrices [1], and various term weighting methods [16].
Similarly, the retrieval query, routing profile, or category description provided by an IR system user is often far from ideal as well. Supervised learning techniques, where user feedback on relevant documents is used to improve the original user input, have been widely used [6, 15]. Both parametric and nonparametric (e.g. neural nets, decision trees, nearest neighbor classifiers) have been used. Supervised learning is particularly effective in routing (where a user can supply ongoing feedback as the system is used) [7] and in text categorization (where a large body of manually indexed text may be available) [12, 14].
These are exciting times for IR. Statistical IR methods developed over the past 30 years are suddenly being widely applied in everything from shrinkwrapped personal computer software, up to large online databases (Dialog, Lexis/Nexis, and West Publishing all fielded their first statistical IR systems in the past three years) and search tools for the Internet.
Until recently, IR researchers dealt mostly with relatively small and homogeneous collections of short documents (often titles and abstracts). Comparisons of over 30 IR. methods in the recent NIST/ARPA Text Retrieval Conferences (TREC), have resulted in a number of modifications to these methods to deal with large (one million documents or more) collections of diverse full text documents [2, 3, 4]. Much of this tuning has been ad hoc and heavily empirical. Little is known about the relationship between properties of a text base and the best IR methods to use with it. This is an undesirable situation, given the increasing variety of applications IR is applied to, and is perhaps the most important area where better statistical insight would be helpful.
Four observations from the TREC conferences give a sense of the range of problems where better statistical insight is needed: