1
Basic Concepts in Information Retrieval
Nicholas Belkin
1.1 DEFINITIONS AND SYSTEM DESIGN
Information retrieval and information filtering are different functions. Information retrieval is intended to support people who are actively seeking or searching for information, as in Internet searching. Information retrieval typically assumes a static or relatively static database against which people search. Search engine companies construct these databases by sending out “spiders” and then indexing the Web pages they find. By contrast, information filtering supports people in the passive monitoring for desired information. It is typically understood to be concerned with an active incoming stream of information objects.
The problem in information retrieval and information filtering is that decisions must be made for every document or information object regarding whether or not to show it to the person who is retrieving the information. Initially, a profile describing the user’s information needs is set up to facilitate such decision making; this profile may be modified over the long term through the use of user models. These models are based on a person’s behavior—decisions, reading behaviors, and so on, which may change the original profile. Both information retrieval and information filtering attempt to maximize the good material that a person sees (that which is likely to be appropriate to the information problem at hand) and minimize the bad material.
When people refer to filtering, they often really mean information retrieval. That is, they are not concerned with dynamic streams of documents but rather with databases that are already constructed and in which
there is some way to represent the information objects and relate them to one another. Thus, filtering corresponds to the Boolean filter in information retrieval: a yes/no decision.
Most search engines designed for the World Wide Web use the principle of “best match,” that is, not making yes/no decisions but, rather, ranking information objects with respect to some representation of the information problem. Thus, the basic processes in information retrieval or information filtering are the representations of information objects and of information needs, or more generally, the problem or goal that the person has in mind. The retrieval techniques themselves then compare needs with objects.
The interaction of the user with other components of the system is important. In fact, the prevailing view in information retrieval research is that the most effective approach for helping a user obtain the appropriate information is relevance feedback, in which the system takes into account whether a person likes or dislikes a document as it automatically re-represents the user’s query. This leads to performance improvements of as much as 150 percent—much better than any other technique. Thus, the person’s judgment of the information objects is an important part of the process. The user is an actor in the information retrieval system, because many of the processes depend on his or her expression and interpretation of the need. The relevance of a document cannot be determined unless the person is considered a part of the system.
The second important part of the system is the information resource, a collection of information objects that has been selected, organized, and represented according to some schema. The third component is the intermediary—a device or person that mediates between the information resource and the user and that has knowledge of the user, the user’s problem, and the types of users that exist, as well as the information resource, the way the resource is organized, what it contains, and so on. The intermediary supports the interaction between people and the information objects and knowledge resource, through prediction and other means.
1.2 PROBLEMS
The representation of information problems is inherently uncertain, because people look for that which they do not know, and it is probably inappropriate to ask them to specify what they do not know. The representation of information objects requires interpretations by a human indexer, machine algorithm, or other entity. The problem is that anyone’s interpretation of a particular text is likely to be different from anyone else’s, and even different for the same person at different times. As our state of knowledge or problems change, our understanding of a text
changes. Everyone has experienced the situation of finding a document not relevant at some point but highly relevant later on, perhaps for a different problem or perhaps because we, ourselves, are different. The easiest and most effective way to deal with this problem is to support users’ interactions with information objects and let them take control.
Because of these uncertainties, the comparison of needs and information objects, or retrieval process, is also inherently uncertain and probabilistic. The understanding of information objects is subjective, and, therefore, representation is necessarily inconsistent. We do not know how well we are representing either the person’s need or the information object. An extensive literature on interindexer consistency shows that when people are asked to represent an information object, even if they are highly trained in using the same meta-language (indexing language), they might achieve as much as only 60 to 70 percent consistency in tasks such as assigning descriptors. We will never achieve “ideal” information retrieval— that is, all the relevant documents and only the relevant documents, or precisely that one thing that a person wants.
The implication is that we must think of probabilistic ways of representing information problems. Even if computers were as smart as people, they probably could not do the job. A standard information retrieval result is that automatic indexing—in which algorithms do statistical word counting and indexing—leads to performance that is no worse, and often better, than systems in which people do manual indexing.
There is no reason to suppose that people will do a better job than machines, and neither one will do a perfect job, ever. Making absolute predictions in an inherently probabilistic environment is not a good idea.
Algorithms for representing information objects, or information problems, do give consistent representations. But they give one interpretation of the text, out of a great variety of possible representations, depending on the interpreter. Language is ambiguous in many ways: polysemy, synonymity, and so on. For example, a bank can be either a financial institution or something on the side of a river (polysemy). The context matters a lot in the interpretation.
The meta-language used to describe information objects, or linguistic objects, often is construed to be exactly the same as the textual language itself. But they are not the same. The similarity of the two languages has led to some confusion. In information retrieval, it has led to the idea that the words in the text represent the important concepts and, therefore, can be used to represent what the text is about. The confusion extends to image retrieval, because images can be ambiguous in at least as many ways as can language. Furthermore, there is no universal meta-language for describing images. People who are interested in images for advertis-
ing purposes have different ways to talk and think about them than do art historians, even though they may be searching for the same images. The lack of a common meta-language for images means that we need to think of special terms for images in special circumstances.
In attempting to prevent children from getting harmful material, it is possible to make approximations and give helpful direction. But in the end, that is the most that we can hope for. It is not a question of preventing someone from getting inappropriate material but, rather, of supporting the person in not getting it. At least part of the public policy concern is kids who are actively trying to get pornography, and it is unreasonable to suppose that information retrieval techniques will be useful in achieving the goal of preventing them from doing so.
There are a variety of users. The user might be a concerned parent or manager who suspects that something bad is going on. But mistakes are inevitable, and we need to figure out some way to deal with that. It is difficult to tell what anything means, and usually we get it wrong. Generally we want to design the tools so that getting it wrong is not as much of a nuisance as it otherwise might be.