Skip to main content

Currently Skimming:

4 The Technology of Search Engines
Pages 16-22

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 16...
... I will provide a broad overview of how search technology works in current engines, based on the old standard models of information retrieval. Two players are involved: the information system and the people who want the information stored in the system.
From page 17...
... If the usual suspect words were placed on the list of stop words, then suddenly the American Kennel Club Web site no longer would be accessible, because of all of the words that refer to the gender of female dogs, and so on. Rarely, the search engine also may apply natural language processing (NLP)
From page 18...
... Therefore, in most vector space models, you do not need to match all the words. As long as you match iNick Belkin said that similarity in text documents is relatively easy to compute, assuming constant meaning of words, whereas similarity of images is very difficult to compute.
From page 19...
... 4.4 SEARCHING THE WORLD WIDE WEB Most Web search engines use versions of the vector space model and also offer some sort of Boolean ranking. Some search engines use probabilistic techniques as well.
From page 20...
... Most search engines appear to be hybrids of rank and Boolean searching. They allow you to do a guess-match symbolized by the vector space model and also very strict Boolean matching.
From page 21...
... Search engine companies spend a lot of their time figuring out how to identify and counteract the "spammed" pages from those people. It is an "arms race."5 A paper published in Nature in 1999 estimated the types of material indexed, excluding commercial sites.6 Scientific and educational sites were the largest population.
From page 22...
... ~Milo Medin emphasized the business dynamic, noting that creating the search capability to find an obscure Web page may not be worth the cost in terms of its impact on the subscriber base. Say a search engine fails to find 5 percent of the material on the Internet.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.