National Academies Press: OpenBook
« Previous: Front Matter
Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

1
Basic Concepts in Information Retrieval

Nicholas Belkin

1.1 DEFINITIONS AND SYSTEM DESIGN

Information retrieval and information filtering are different functions. Information retrieval is intended to support people who are actively seeking or searching for information, as in Internet searching. Information retrieval typically assumes a static or relatively static database against which people search. Search engine companies construct these databases by sending out “spiders” and then indexing the Web pages they find. By contrast, information filtering supports people in the passive monitoring for desired information. It is typically understood to be concerned with an active incoming stream of information objects.

The problem in information retrieval and information filtering is that decisions must be made for every document or information object regarding whether or not to show it to the person who is retrieving the information. Initially, a profile describing the user’s information needs is set up to facilitate such decision making; this profile may be modified over the long term through the use of user models. These models are based on a person’s behavior—decisions, reading behaviors, and so on, which may change the original profile. Both information retrieval and information filtering attempt to maximize the good material that a person sees (that which is likely to be appropriate to the information problem at hand) and minimize the bad material.

When people refer to filtering, they often really mean information retrieval. That is, they are not concerned with dynamic streams of documents but rather with databases that are already constructed and in which

Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

there is some way to represent the information objects and relate them to one another. Thus, filtering corresponds to the Boolean filter in information retrieval: a yes/no decision.

Most search engines designed for the World Wide Web use the principle of “best match,” that is, not making yes/no decisions but, rather, ranking information objects with respect to some representation of the information problem. Thus, the basic processes in information retrieval or information filtering are the representations of information objects and of information needs, or more generally, the problem or goal that the person has in mind. The retrieval techniques themselves then compare needs with objects.

The interaction of the user with other components of the system is important. In fact, the prevailing view in information retrieval research is that the most effective approach for helping a user obtain the appropriate information is relevance feedback, in which the system takes into account whether a person likes or dislikes a document as it automatically re-represents the user’s query. This leads to performance improvements of as much as 150 percent—much better than any other technique. Thus, the person’s judgment of the information objects is an important part of the process. The user is an actor in the information retrieval system, because many of the processes depend on his or her expression and interpretation of the need. The relevance of a document cannot be determined unless the person is considered a part of the system.

The second important part of the system is the information resource, a collection of information objects that has been selected, organized, and represented according to some schema. The third component is the intermediary—a device or person that mediates between the information resource and the user and that has knowledge of the user, the user’s problem, and the types of users that exist, as well as the information resource, the way the resource is organized, what it contains, and so on. The intermediary supports the interaction between people and the information objects and knowledge resource, through prediction and other means.

1.2 PROBLEMS

The representation of information problems is inherently uncertain, because people look for that which they do not know, and it is probably inappropriate to ask them to specify what they do not know. The representation of information objects requires interpretations by a human indexer, machine algorithm, or other entity. The problem is that anyone’s interpretation of a particular text is likely to be different from anyone else’s, and even different for the same person at different times. As our state of knowledge or problems change, our understanding of a text

Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

changes. Everyone has experienced the situation of finding a document not relevant at some point but highly relevant later on, perhaps for a different problem or perhaps because we, ourselves, are different. The easiest and most effective way to deal with this problem is to support users’ interactions with information objects and let them take control.

Because of these uncertainties, the comparison of needs and information objects, or retrieval process, is also inherently uncertain and probabilistic. The understanding of information objects is subjective, and, therefore, representation is necessarily inconsistent. We do not know how well we are representing either the person’s need or the information object. An extensive literature on interindexer consistency shows that when people are asked to represent an information object, even if they are highly trained in using the same meta-language (indexing language), they might achieve as much as only 60 to 70 percent consistency in tasks such as assigning descriptors. We will never achieve “ideal” information retrieval— that is, all the relevant documents and only the relevant documents, or precisely that one thing that a person wants.

The implication is that we must think of probabilistic ways of representing information problems. Even if computers were as smart as people, they probably could not do the job. A standard information retrieval result is that automatic indexing—in which algorithms do statistical word counting and indexing—leads to performance that is no worse, and often better, than systems in which people do manual indexing.

There is no reason to suppose that people will do a better job than machines, and neither one will do a perfect job, ever. Making absolute predictions in an inherently probabilistic environment is not a good idea.

Algorithms for representing information objects, or information problems, do give consistent representations. But they give one interpretation of the text, out of a great variety of possible representations, depending on the interpreter. Language is ambiguous in many ways: polysemy, synonymity, and so on. For example, a bank can be either a financial institution or something on the side of a river (polysemy). The context matters a lot in the interpretation.

The meta-language used to describe information objects, or linguistic objects, often is construed to be exactly the same as the textual language itself. But they are not the same. The similarity of the two languages has led to some confusion. In information retrieval, it has led to the idea that the words in the text represent the important concepts and, therefore, can be used to represent what the text is about. The confusion extends to image retrieval, because images can be ambiguous in at least as many ways as can language. Furthermore, there is no universal meta-language for describing images. People who are interested in images for advertis-

Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

ing purposes have different ways to talk and think about them than do art historians, even though they may be searching for the same images. The lack of a common meta-language for images means that we need to think of special terms for images in special circumstances.

In attempting to prevent children from getting harmful material, it is possible to make approximations and give helpful direction. But in the end, that is the most that we can hope for. It is not a question of preventing someone from getting inappropriate material but, rather, of supporting the person in not getting it. At least part of the public policy concern is kids who are actively trying to get pornography, and it is unreasonable to suppose that information retrieval techniques will be useful in achieving the goal of preventing them from doing so.

There are a variety of users. The user might be a concerned parent or manager who suspects that something bad is going on. But mistakes are inevitable, and we need to figure out some way to deal with that. It is difficult to tell what anything means, and usually we get it wrong. Generally we want to design the tools so that getting it wrong is not as much of a nuisance as it otherwise might be.

Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 1
Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 2
Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 3
Suggested Citation:"1 Basic Concepts in Information Retrieval." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 4
Next: 2 Text Categorization and Analysis »
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop Get This Book
×
 Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop
Buy Paperback | $48.00 Buy Ebook | $38.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

In response to a mandate from Congress in conjunction with the Protection of Children from Sexual Predators Act of 1998, the Computer Science and Telecommunications Board (CSTB) and the Board on Children, Youth, and Families of the National Research Council (NRC) and the Institute of Medicine established the Committee to Study Tools and Strategies for Protecting Kids from Pornography and Their Applicability to Other Inappropriate Internet Content.

To collect input and to disseminate useful information to the nation on this question, the committee held two public workshops. On December 13, 2000, in Washington, D.C., the committee convened a workshop to focus on nontechnical strategies that could be effective in a broad range of settings (e.g., home, school, libraries) in which young people might be online. This workshop brought together researchers, educators, policy makers, and other key stakeholders to consider and discuss these approaches and to identify some of the benefits and limitations of various nontechnical strategies. The December workshop is summarized in Nontechnical Strategies to Reduce Children's Exposure to Inappropriate Material on the Internet: Summary of a Workshop. The second workshop was held on March 7, 2001, in Redwood City, California. This second workshop focused on some of the technical, business, and legal factors that affect how one might choose to protect kids from pornography on the Internet. The present report provides, in the form of edited transcripts, the presentations at that workshop.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!