4
The Technology of Search Engines
Ray Larson
4.1 OVERVIEW
Most search engine companies do not want to reveal what their technology is or does, because they consider that to be a trade secret. Every company claims to do retrieval better than every other company, and they do not want to lose their competitive edge. I will provide a broad overview of how search technology works in current engines, based on the old standard models of information retrieval.
Two players are involved: the information system and the people who want the information stored in the system. The searchers go through a process of formulating a query, that is, describing what they seek in ways that the system can process. The same sort of thing happens on the other end, where the system has to extract information from the documents included in its database. Those documents need to be described in such a way that someone posing a query can find them.
In general, the emphasis in the design and development of search engines has been to make the document finding process as effective as possible—today, however the goal seems to be to exclude some searchers. The idea is to prevent some people from getting things that we think they should not get. This is anathema to someone from a library background, where we tend to think that everyone should have access to everything and that it is up to Mom and Dad to say no.
In between the information system and the searcher are the search engine’s processing functions (the “rules of the game”)—how the languages are structured, all the information that can be acquired from the
documents that come in, and how that gets mapped to what a searcher wants. The usual outcome is a set of potentially relevant documents. The searcher does not know whether a retrieved document is really relevant until he or she looks at it and says, “Yes, that is what I wanted.”
Much of what happens in search engines, which generally use the “bag-of-words” model for handling data, is structure recognition. Search engines often treat titles differently than they do the body of a Web page; titles indicate the topic of a page. If the system can extract structure from documents, it often can be used as an indicator for additionally weighting the retrieval process.
Often the search engine normalizes the text, stripping out capitalization and most other orthographic differences among words. Some systems do not throw this information away automatically but rather attempt to identify things such as sequences of capitalized words possibly indicating a place or person’s name. The search engine then usually removes stop words, a list of words that it chooses not to index. This would be a likely place to put a filter. But this can become problematic because, when using a bag-of-words model, one occurrence of a word does not indicate other nonproblematic occurrences of the same word. If the usual suspect words were placed on the list of stop words, then suddenly the American Kennel Club Web site no longer would be accessible, because of all of the words that refer to the gender of female dogs, and so on. Rarely, the search engine also may apply natural language processing (NLP) to identify known phrases or chunks of text that properly belong together and indicate certain types of content.
4.2 BOOLEAN SEARCH LOGIC
What is left is a collection of words that need to be retrieved in some way. There are many models for doing this. The simplest and most widely available—used in virtually every search engine and the initial commercial search model—is the Boolean operator model. Simple Boolean logic says either “this word and that word occur,” or “this word or that word occur,” and, therefore, the documents that have those words should be retrieved. Boolean logic is simple and easy to implement. Almost all search engines today, because of the volume of data on the Internet, include an automatic default setting that, in effect, uses the AND operator with all terms provided to the search engine. If the searcher turns this function off, then the search engine usually defaults to a ranking algorithm that attempts to do a “best match” for the query.
All of these combinations can be characterized in a simple logic model that says that this word either occurs in the document or that it does not. If it does occur, you have certain matches; if not, you have other matches.
Any combination of three words, for example, can be specified, such that the document has this word and not the other two, or all three together, or one and not the other of two. You can specify any combination of the words. But if you do not specify the word exactly as it is stored in the index, then you will not get it. It cannot be a synonym (unless you supply that synonym), or an alternative phrasing, or a euphemism.
4.3 THE VECTOR SPACE MODEL
Another approach is the vector space model. This model was developed over 30 years of intensive research into a finely honed set of tools. Probabilistic models are also being used much more commonly these days. Many other models combine many of the same aspects, including attempts to automatically recognize structures of information within documents that would indicate relevance. Alternatively, one could look at all of the documents in a collection and consider each individual word that occurs in any of those documents. But most large collections have tens of thousands of words, even hundreds of thousands. A large proportion of those words are nonsense, misspellings, or other problems that occur once or twice, whereas other words occur often (e.g., the, and, of).
The vector space model attempts to consider each term that occurs in a document as if it were a dimension in Euclidean space. (This is why we use three terms as an example; if there are more than three dimensions, it becomes difficult for people to think about.) In a vector space model, each document has a vector that points in a certain direction, depending on whether it contains a term or not. The documents are differentiated on this basis. This example shows a system where there is a simple yes/no process; a document either has the term or does not have it. You also can consider each term as having a particular weight, which can be measured in a variety of ways, such as how frequently the word occurs in a particular document.
In this model, you are calculating the cosine of the angle between two vectors in imaginary space. The smaller the angle between the vectors, the more similar the document is to the query. You can rank documents based on that closeness or similarity.1 Therefore, in most vector space models, you do not need to match all the words. As long as you match
many or even some of the words, you will get closer to a particular document that has those words in it.
This model uses “term frequency/inverse document frequency” (TF-IDF), a measure of the frequency of occurrence of a particular term in a particular document, as well as how often that term occurs in the entire collection of interest. If a term occurs frequently in one document but also occurs frequently in every other document in the collection, then it is not a very important word, and the TF-IDF measure reduces the weight placed on it. A common term is considered less important than rare terms. If a term occurs in every document, then the inverse document frequency is zero; if it occurs in half of the documents, it will be 0.3; and if it occurs in 20 of 10,000 documents, it will be 2.6. If a term occurs in just one document, then the IDF measure would be 4—the highest weight possible. Unfortunately, most pornographic words, given the distribution of porn on the Internet, are not rare.
Once you have extracted the words from the documents, you have to put the words somewhere. They usually are placed in an inverted file, which puts the words into a list with an indication of which documents they came from. Then the list is sorted to get all the terms in alphabetical order, and duplicates are merged; if there are multiple entries for a particular document or term, then you increment the frequency for that item. This is the simplest form of an inverted file. Many search engines also keep track of where a word occurs in a document, to provide proximity information. They also keep track of many other things, such as how many links there are to the page that a word is on.
Finally, you differentiate the file to make a unique list for every term that occurs in the entire database, with pointers that say in which documents they occurred and how frequently. With that information, you can then calculate the magical-looking formulas that provide a ranking for a document.
4.4 SEARCHING THE WORLD WIDE WEB
Most Web search engines use versions of the vector space model and also offer some sort of Boolean ranking. Some search engines use probabilistic techniques as well. Others do little more than a coordination-level matching, looking for documents that have the highest number of specified terms. Some use natural language processing (Lycos, for example, was based on some NLP work by Michael Mauldin). Excite’s concept-based search may be a development of Latent Semantic Indexing (developed at Bell Labs). The Inktomi search engine formerly used a form of retrieval based on logistic regression.
Virtually all search engines use the bag-of-words of model.2 Some use additional page weight methods, looking not only at frequency of a word in a document, but also at other things like the number of links to a page. Google uses in-links, for example. If no one links to your page, then you would get a lower rank than someone who had the same words but many in-links. Most search engines also include every string of characters on a page, even if they are total garbage. Therefore, in addition to comparing one word to another, you have to compare all of the numbers, which is difficult.
Exact algorithms are not available for most commercial Web search engines. Most search engines appear to be hybrids of rank and Boolean searching. They allow you to do a guess-match symbolized by the vector space model and also very strict Boolean matching. But most users never click to the “advanced search” page, which explains how to do all of these things; they usually just type in what they think would be an appropriate search. Most people looking at search logs would say, “That’s ridiculous. How are they ever going to find anything?”
The search engine obtains this material by sending out a “spider” to retrieve the pages from Web sites. They retrieve only static pages, not pages that are hiding as databases or are dynamically generated. Most crawlers also obey the robot.txt file on a Web site; if the file says, “Do not index this site,” they do not index that site. They can store millions of words and hundreds of sites.
There are different methods of crawling. In a depth-first crawl, you go down as deep as you can within any particular site before going on to the next site. Another way is a breadth-first search, where you start across many different sites and work your way down slowly.3 Part of the reason
for this is, if a spider comes to your Web site and hits you 50,000 times in a row to get every single page that you have, you will get upset. Instead, breadth-first spiders spread out the hits over time among a number of sites. The main message here is that the pages have to be connected somehow to the starting points or else you never will get them—that is, unless someone has sent you a pointer saying, “Here is the new starting point. Here’s our site, please index it.”4 Some people sell algorithms that ensure that a given page gets ranked higher than others. Search engine companies spend a lot of their time figuring out how to identify and counteract the “spammed” pages from those people. It is an “arms race.”5
A paper published in Nature in 1999 estimated the types of material indexed, excluding commercial sites.6 Scientific and educational sites were the largest population. Health sites, personal sites, and the sites for societies (scholarly or other) are all larger than the percentage estimated for pornography.7 No search engine has 100 percent coverage, and they often cover quite different things. There can be overlap, as well. There are also issues of numbers of links. If one site indexes something, then
another site will index it. Things that are unique tend to stay unique within a particular search engine.8
In looking for images, text retrieval technology looks for text that is associated with images. It looks for an image link tag within the HTML and the sentences that surround it on either side. This can be highly deceptive. The words “Oh, look at the cute bunnies” mean one thing on a children’s Web site and something entirely different on Playboy’s site. Thus, the words alone may not indicate what those images are about.