Most search engine companies do not want to reveal what their technology is or does, because they consider that to be a trade secret. Every company claims to do retrieval better than every other company, and they do not want to lose their competitive edge. I will provide a broad overview of how search technology works in current engines, based on the old standard models of information retrieval.
Two players are involved: the information system and the people who want the information stored in the system. The searchers go through a process of formulating a query, that is, describing what they seek in ways that the system can process. The same sort of thing happens on the other end, where the system has to extract information from the documents included in its database. Those documents need to be described in such a way that someone posing a query can find them.
In general, the emphasis in the design and development of search engines has been to make the document finding process as effective as possible—today, however the goal seems to be to exclude some searchers. The idea is to prevent some people from getting things that we think they should not get. This is anathema to someone from a library background, where we tend to think that everyone should have access to everything and that it is up to Mom and Dad to say no.
In between the information system and the searcher are the search engine’s processing functions (the “rules of the game”)—how the languages are structured, all the information that can be acquired from the
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop 4 The Technology of Search Engines Ray Larson 4.1 OVERVIEW Most search engine companies do not want to reveal what their technology is or does, because they consider that to be a trade secret. Every company claims to do retrieval better than every other company, and they do not want to lose their competitive edge. I will provide a broad overview of how search technology works in current engines, based on the old standard models of information retrieval. Two players are involved: the information system and the people who want the information stored in the system. The searchers go through a process of formulating a query, that is, describing what they seek in ways that the system can process. The same sort of thing happens on the other end, where the system has to extract information from the documents included in its database. Those documents need to be described in such a way that someone posing a query can find them. In general, the emphasis in the design and development of search engines has been to make the document finding process as effective as possible—today, however the goal seems to be to exclude some searchers. The idea is to prevent some people from getting things that we think they should not get. This is anathema to someone from a library background, where we tend to think that everyone should have access to everything and that it is up to Mom and Dad to say no. In between the information system and the searcher are the search engine’s processing functions (the “rules of the game”)—how the languages are structured, all the information that can be acquired from the
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop documents that come in, and how that gets mapped to what a searcher wants. The usual outcome is a set of potentially relevant documents. The searcher does not know whether a retrieved document is really relevant until he or she looks at it and says, “Yes, that is what I wanted.” Much of what happens in search engines, which generally use the “bag-of-words” model for handling data, is structure recognition. Search engines often treat titles differently than they do the body of a Web page; titles indicate the topic of a page. If the system can extract structure from documents, it often can be used as an indicator for additionally weighting the retrieval process. Often the search engine normalizes the text, stripping out capitalization and most other orthographic differences among words. Some systems do not throw this information away automatically but rather attempt to identify things such as sequences of capitalized words possibly indicating a place or person’s name. The search engine then usually removes stop words, a list of words that it chooses not to index. This would be a likely place to put a filter. But this can become problematic because, when using a bag-of-words model, one occurrence of a word does not indicate other nonproblematic occurrences of the same word. If the usual suspect words were placed on the list of stop words, then suddenly the American Kennel Club Web site no longer would be accessible, because of all of the words that refer to the gender of female dogs, and so on. Rarely, the search engine also may apply natural language processing (NLP) to identify known phrases or chunks of text that properly belong together and indicate certain types of content. 4.2 BOOLEAN SEARCH LOGIC What is left is a collection of words that need to be retrieved in some way. There are many models for doing this. The simplest and most widely available—used in virtually every search engine and the initial commercial search model—is the Boolean operator model. Simple Boolean logic says either “this word and that word occur,” or “this word or that word occur,” and, therefore, the documents that have those words should be retrieved. Boolean logic is simple and easy to implement. Almost all search engines today, because of the volume of data on the Internet, include an automatic default setting that, in effect, uses the AND operator with all terms provided to the search engine. If the searcher turns this function off, then the search engine usually defaults to a ranking algorithm that attempts to do a “best match” for the query. All of these combinations can be characterized in a simple logic model that says that this word either occurs in the document or that it does not. If it does occur, you have certain matches; if not, you have other matches.
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop Any combination of three words, for example, can be specified, such that the document has this word and not the other two, or all three together, or one and not the other of two. You can specify any combination of the words. But if you do not specify the word exactly as it is stored in the index, then you will not get it. It cannot be a synonym (unless you supply that synonym), or an alternative phrasing, or a euphemism. 4.3 THE VECTOR SPACE MODEL Another approach is the vector space model. This model was developed over 30 years of intensive research into a finely honed set of tools. Probabilistic models are also being used much more commonly these days. Many other models combine many of the same aspects, including attempts to automatically recognize structures of information within documents that would indicate relevance. Alternatively, one could look at all of the documents in a collection and consider each individual word that occurs in any of those documents. But most large collections have tens of thousands of words, even hundreds of thousands. A large proportion of those words are nonsense, misspellings, or other problems that occur once or twice, whereas other words occur often (e.g., the, and, of). The vector space model attempts to consider each term that occurs in a document as if it were a dimension in Euclidean space. (This is why we use three terms as an example; if there are more than three dimensions, it becomes difficult for people to think about.) In a vector space model, each document has a vector that points in a certain direction, depending on whether it contains a term or not. The documents are differentiated on this basis. This example shows a system where there is a simple yes/no process; a document either has the term or does not have it. You also can consider each term as having a particular weight, which can be measured in a variety of ways, such as how frequently the word occurs in a particular document. In this model, you are calculating the cosine of the angle between two vectors in imaginary space. The smaller the angle between the vectors, the more similar the document is to the query. You can rank documents based on that closeness or similarity.1 Therefore, in most vector space models, you do not need to match all the words. As long as you match 1 Nick Belkin said that similarity in text documents is relatively easy to compute, assuming constant meaning of words, whereas similarity of images is very difficult to compute. David Forsyth gave the example of the Pope kissing a baby versus a picture of a politician kissing a baby; they are the same picture in some ways, but different in others.
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop many or even some of the words, you will get closer to a particular document that has those words in it. This model uses “term frequency/inverse document frequency” (TF-IDF), a measure of the frequency of occurrence of a particular term in a particular document, as well as how often that term occurs in the entire collection of interest. If a term occurs frequently in one document but also occurs frequently in every other document in the collection, then it is not a very important word, and the TF-IDF measure reduces the weight placed on it. A common term is considered less important than rare terms. If a term occurs in every document, then the inverse document frequency is zero; if it occurs in half of the documents, it will be 0.3; and if it occurs in 20 of 10,000 documents, it will be 2.6. If a term occurs in just one document, then the IDF measure would be 4—the highest weight possible. Unfortunately, most pornographic words, given the distribution of porn on the Internet, are not rare. Once you have extracted the words from the documents, you have to put the words somewhere. They usually are placed in an inverted file, which puts the words into a list with an indication of which documents they came from. Then the list is sorted to get all the terms in alphabetical order, and duplicates are merged; if there are multiple entries for a particular document or term, then you increment the frequency for that item. This is the simplest form of an inverted file. Many search engines also keep track of where a word occurs in a document, to provide proximity information. They also keep track of many other things, such as how many links there are to the page that a word is on. Finally, you differentiate the file to make a unique list for every term that occurs in the entire database, with pointers that say in which documents they occurred and how frequently. With that information, you can then calculate the magical-looking formulas that provide a ranking for a document. 4.4 SEARCHING THE WORLD WIDE WEB Most Web search engines use versions of the vector space model and also offer some sort of Boolean ranking. Some search engines use probabilistic techniques as well. Others do little more than a coordination-level matching, looking for documents that have the highest number of specified terms. Some use natural language processing (Lycos, for example, was based on some NLP work by Michael Mauldin). Excite’s concept-based search may be a development of Latent Semantic Indexing (developed at Bell Labs). The Inktomi search engine formerly used a form of retrieval based on logistic regression.
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop Virtually all search engines use the bag-of-words of model.2 Some use additional page weight methods, looking not only at frequency of a word in a document, but also at other things like the number of links to a page. Google uses in-links, for example. If no one links to your page, then you would get a lower rank than someone who had the same words but many in-links. Most search engines also include every string of characters on a page, even if they are total garbage. Therefore, in addition to comparing one word to another, you have to compare all of the numbers, which is difficult. Exact algorithms are not available for most commercial Web search engines. Most search engines appear to be hybrids of rank and Boolean searching. They allow you to do a guess-match symbolized by the vector space model and also very strict Boolean matching. But most users never click to the “advanced search” page, which explains how to do all of these things; they usually just type in what they think would be an appropriate search. Most people looking at search logs would say, “That’s ridiculous. How are they ever going to find anything?” The search engine obtains this material by sending out a “spider” to retrieve the pages from Web sites. They retrieve only static pages, not pages that are hiding as databases or are dynamically generated. Most crawlers also obey the robot.txt file on a Web site; if the file says, “Do not index this site,” they do not index that site. They can store millions of words and hundreds of sites. There are different methods of crawling. In a depth-first crawl, you go down as deep as you can within any particular site before going on to the next site. Another way is a breadth-first search, where you start across many different sites and work your way down slowly.3 Part of the reason 2 David Forsyth observed that it might be logical to ask why people use the bag-of-words model, which they know to be bad. The answer is, it is very difficult to use anything else. Most reasonable people know about 60,000 words. You need to count how often each one appears in text. You need a lot of text to do this. If you are modeling the probability of seeing a new word, given an old word, there are 60,000 choices for the old word and 60,000 choices for the new word. The table would be 60,000 by 60,000, and it would be difficult to collect enough data to fill the table. Ray Larson noted that 60,000 words is a very small size compared to the indexes used by search engines. 3 Nick Belkin noted that a crawler is limited by the size of its own memory. As soon as it finds as much as it can hold, it stops. Milo Medin observed that this is not an ideal approach. Rather, you want to rank order the types of things that you will either archive or not. If you cannot store all the useful things, then, rather than stop, a better approach is to go back and prune out some of the duplicate or irrelevant material. Ray Larson said finding duplicates is a big deal, because many things either have the same name or have different names but are on the same pages. For database storage and efficiency reasons, it is important to find those things.
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop for this is, if a spider comes to your Web site and hits you 50,000 times in a row to get every single page that you have, you will get upset. Instead, breadth-first spiders spread out the hits over time among a number of sites. The main message here is that the pages have to be connected somehow to the starting points or else you never will get them—that is, unless someone has sent you a pointer saying, “Here is the new starting point. Here’s our site, please index it.”4 Some people sell algorithms that ensure that a given page gets ranked higher than others. Search engine companies spend a lot of their time figuring out how to identify and counteract the “spammed” pages from those people. It is an “arms race.”5 A paper published in Nature in 1999 estimated the types of material indexed, excluding commercial sites.6 Scientific and educational sites were the largest population. Health sites, personal sites, and the sites for societies (scholarly or other) are all larger than the percentage estimated for pornography.7 No search engine has 100 percent coverage, and they often cover quite different things. There can be overlap, as well. There are also issues of numbers of links. If one site indexes something, then 4 Milo Medin said that some sites generate indexes by asking other search engines and indexing what they already have. He also said that no catalog inventories show up in searches because the inventory is designed for a database query. The exception is when that site has created an index page with a set of stored queries. 5 Winnie Wechsler said that there seems to be a fundamental tension between search engines striving to provide the greatest accuracy to users in terms of retrieval or filtering and Web publishers trying to trick or mislead the search engines to make sure their sites are listed as much and as high in rank as possible. How does this tension resolve itself? It does not seem resolvable, certainly in the case of pornography. Nick Belkin said one approach is to use more words in a query to make the conditions more restrictive. A query with 10 words will get a much better result than one with only 2 words because it defines much more context. The difficulty is that, even though the average number of words per query on the Web has been going up, it is still only about 2.3 words, up from 1.7 words a few years ago. With very simple search engine technology, it may help to encourage people to use more words in their queries. 6 Steve Lawrence and C. Lee Giles, “Accessibility and Distribution of Information on the Web,” Nature 400(6740): 107-109, July 8, 1999. 7 Milo Medin said that the declining cost of Web serving—generally a good thing—has made it easier for amateur pornographers to get published. Medin’s service offers free Web hosting for a certain amount of material. Subscribers are not allowed to post pornography or objectionable material, but there is no cost or punishment if they do, so they take advantage of this situation. The company audits sites based on the amount of traffic to them. When a site attracts a certain amount of traffic, it triggers a red flag and generates a query to the people in charge of investigating abuse. Medin recalled that, when he worked for NASA, data on international links had to be controlled. When someone put up a porn site, the link utilization to that region would rise. A wiretap would reveal where the traffic was going.
OCR for page 16
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop another site will index it. Things that are unique tend to stay unique within a particular search engine.8 In looking for images, text retrieval technology looks for text that is associated with images. It looks for an image link tag within the HTML and the sentences that surround it on either side. This can be highly deceptive. The words “Oh, look at the cute bunnies” mean one thing on a children’s Web site and something entirely different on Playboy’s site. Thus, the words alone may not indicate what those images are about. 8 Milo Medin emphasized the business dynamic, noting that creating the search capability to find an obscure Web page may not be worth the cost in terms of its impact on the subscriber base. Say a search engine fails to find 5 percent of the material on the Internet. To some people whose content is in that 5 percent, this is important. But if the cost of finding that 5 percent is double the cost of finding the other 95 percent and the bulk of searchers are satisfied with that performance, it may not be worth it. Search engines are not librarians; they exist for a business purpose.