Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 281
Signposts in Cyberspace: The Domain Name System and Internet Navigation 6 Internet Navigation: Emergence and Evolution As the previous chapters show, the Domain Name System has been a foundation for the rapid development of the Internet. Domain names appear on the signposts designating origins and destinations linked by the Internet and in the addresses used by the principal applications traversing the Internet—e-mail and the World Wide Web. And they have been useful for navigating across the Internet: given a domain name, many Web browsers will automatically expand it into the Uniform Resource Locator (URL) of a Web site; from a domain name, many e-mail users can guess the e-mail address of an addressee. For these reasons, memorable domain names may often acquire high value. Their registrants believe that searchers can more readily find and navigate to their offerings. However, as the Internet developed in size, scope, and complexity, the Domain Name System (DNS) was unable to satisfy many Internet users’ needs for navigational assistance. How, for example, can a single Web page be found from among billions when only its subject and not the domain name in its URL is known? To meet such needs, a number of new types of aids and services for Internet navigation were developed.1 While, in the end, these generally rely on the Domain Name System to find specific Internet Protocol (IP) addresses, they greatly expand the range of ways in which searchers can identify the Internet location of the resource they seek. 1 The difference between a navigation aid and a navigation service is one of degree. A navigation service, such as the offerings of a search engine service provider, are more elaborate and extensive than those offered by a navigation aid, such as the bookmark feature of a Web browser.
OCR for page 282
Signposts in Cyberspace: The Domain Name System and Internet Navigation These navigational aids and services have, in return, relieved some of the pressure on the Domain Name System to serve as a de facto directory of the Internet and have somewhat reduced the importance of the registration of memorable domain names. Because of these tight linkages between the DNS and Internet navigation, this chapter and the next ones address—at a high level—the development of the major types of Internet navigational aids and services. This chapter is concerned with their past development. The next chapter deals with their current state. And the final chapter on Internet navigation, Chapter 8, considers the technological prospects and the institutional issues facing them. After describing the distinctive nature of Internet navigation, this chapter traces the evolution of a variety of aids and services for Internet navigation. While its primary focus is on navigating the World Wide Web, it does not cover techniques for navigation within Web sites, which is the subject of specialized attention by Web site designers, operators, and researchers.2 6.1 THE NATURE OF INTERNET NAVIGATION Navigation across the Internet is sometimes compared to the well-studied problem of readers navigating through collections of printed material and other physical artifacts in search of specific documents or specific artifacts. (See the Addendum to this chapter: “Searching the Web Versus Searching Libraries.”) That comparison illustrates the differences in the technical and institutional contexts for Internet navigation. Internet navigation for some purposes is similar to searches in library environments and relies on the same tools, whereas navigation for other purposes may be performed quite differently via the Internet. The multiple purposes and diverse characteristics listed below combine to make navigating to a resource across the Internet a much more varied and complex activity than those previously encountered. The library examples provide a point of reference and a point of departure for discussion in subsequent chapters. 6.1.1 Vast and Varied Resources for Multiple Purposes First, the Internet connects its users to a vast collection of heterogeneous resources that are used for many purposes, including the dissemination of information; the marketing of products and services; communication with others; and the delivery of art, entertainment, and a wide range of commercial and public services. The kinds of resources connected to the Internet include: 2 See, for example, Merlyn Holmes, Web Usability & Navigation: A Beginner’s Guide, McGraw-Hill/Osborne, Berkeley, Calif., 2002; and Louis Rosenfeld and Peter Morville, Information Architecture for the World Wide Web: Designing Large Scale Sites, 2nd edition, O’Reilly & Associates, Sebastopol, Calif., 2002.
OCR for page 283
Signposts in Cyberspace: The Domain Name System and Internet Navigation Documents that differ in language (human and programming), vocabulary (including words, product numbers, zip codes, latitudes and longitudes, links, symbols, and images), formats (such as the Hypertext Markup Language (HTML), Portable Document Format (PDF), or Joint Photographic Experts Group (JPEG) format), character sets, and source (human or machine generated). Non-textual information, such as audio and video files, and interactive games. The volume of online content (in terms of the number of bytes) in image, sound, and video formats is much greater than that of most library collections and is expanding rapidly. Transaction services, such as sales of products or services, auctions, tax return preparation, matchmaking, and travel reservations. Dynamic information, such as weather forecasts, stock market information, and news, which can be constantly changing to incorporate the latest developments. Scientific data generated by instruments such as sensor networks and satellites are contributing to a “data deluge.”3 Many of these data are stored in repositories on the Internet and are available for research and educational purposes. Custom information constructed from data in a database (such as product descriptions and pricing) in response to a specific query (e.g., price comparisons of a product listed for sale on multiple Web sites). Consequently, aids or services that support Internet navigation face the daunting problem of finding and assigning descriptive terms to each of these types of resource so that it can be reliably located. Searchers face the complementary problem of selecting the aids or services that will best enable them to locate the information, entertainment, communication link, or service that they are seeking. 6.1.2 Two-sided Process Second, Internet navigation is two-sided: it must serve the needs both of the searchers who want to reach resources and of the providers that want their resources to be found by potential users. From the searcher’s perspective, navigating the Internet resembles to some extent the use of the information retrieval systems that were developed 3 See Tony Hey and Anne Trefethen, “The Data Deluge: An e-Science Perspective,” Grid Computing: Making the Global Infrastructure a Reality, Fran Berman, Geoffrey Fox, and Anthony J.G. Hey, editors, Wiley, 2003.
OCR for page 284
Signposts in Cyberspace: The Domain Name System and Internet Navigation over the last several decades within the library and information science4 and computer science communities.5 However, library-oriented retrieval systems, reflecting the well-developed norms of librarians, were designed to describe and organize information so that users could readily find exactly what they were looking for. In many cases, the same people and organizations were responsible both for the design of the retrieval systems and for the processes of indexing, abstracting, and cataloging the information to be retrieved. In this information services world, the provider’s goal was to make description and search as neutral as possible, so that every document relevant to a topic would have an equal chance of being retrieved.6 While this goal of retrieval neutrality has carried over to some Internet navigation services and resource providers, it is by no means universal. Indeed, from the perspective of many resource providers, particularly commercial providers, attracting users via the Internet requires the application to Internet navigation of non-neutral marketing approaches deriving from advertising and public relations as developed for newspapers, magazines, radio, television, and yellow-pages directories.7 Research on neutral, community-based technology for describing Internet resources is an active area in information and computer science and is a key element of the Semantic Web (see Box 7.1).8 4 For an overview, see Elaine Svenonius, The Intellectual Foundation of Information Organization, MIT Press, Cambridge, Mass., 2000; and Christine L. Borgman, From Gutenberg to the Global Information Infrastructure: Access to Information in the Networked World, MIT Press, Cambridge, Mass., 2000. 5 For an overview, see Ricardo Baeza-Yates and Berthier Ribiero-Neto, Modern Information Retrieval, Addison-Wesley, Boston, 1999; and Karen Sparck Jones and Peter Willett, editors, Readings in Information Retrieval, Morgan Kaufmann, San Francisco, 1997. For typical examples of early work on information retrieval systems, see, for example, George Schecter, editor, Information Retrieval—A Critical View, Thompson Book Company, Washington, D.C., 1967. For work on retrieval from large databases, see the proceedings of the annual text retrieval conference (TREC), currently sponsored by the National Institute of Standards and Technology and the Advanced Research and Development Activity, available at <http://trec.nist.gov>. 6 See Svenonius, The Intellectual Foundation of Information Organization, 2000. 7 See, for example, John Caples and Fred E. Hahn, Tested Advertising Methods, 5th edition, Prentice-Hall, New York, 1998. 8 E. Bradley, N. Collins, and W.P. Kegelmeyer, “Feature Characterization in Scientific Datasets,” pp. 1-12 in Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (Lecture Notes in Computer Science, Vol. 2189), Springer-Verlag, 2001; V. Brilhante, “Using Formal Metadata Descriptions for Automated Ecological Modeling,” pp. 90-95 in Environmental Decision Support Systems and Artificial Intelligence, AAAI Press, Menlo Park, Calif., 1999; E. Hovy, “Using an Ontology to Simplify Data Access,” Communications of the ACM 46(1):47-49, 2003; OWL Web Ontology Language Guide, “W3C Recommendation (10 February 2004),” November 24, 2004, available at <http://www.w3.org/TR/owl-guide>; and P. Wariyapola, S.L. Abrams, A.R. Robinson, K. Streitlien, N.M. Patrikalakis, P. Elisseeff, and H. Schmidt, “Ontology and Metadata Creation for the Poseidon Distributed Coastal Zone Management System,” Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries, IEEE Computer Society, Los Alamitos, Calif., 1999, pp. 180-189.
OCR for page 285
Signposts in Cyberspace: The Domain Name System and Internet Navigation For commercial providers, therefore, the challenge is how to identify and reach—in the complex, diverse, and global audience accessible via the Internet—potential users who are likely to be interested (or can be made interested) in the provider’s materials. That is done in traditional marketing and public relations through the identification of media and places (television or radio programs, magazines, newspapers in specific locations) that an audience with the desired common characteristics (for example, 18- to 24-year-old males) frequents. Similar approaches can be applied on the Internet (see Section 7.2.2), but unlike the traditional media, the Internet also offers providers the distinctive and extremely valuable opportunity to capture their specific audience during the navigation process itself, just when they are searching for what the provider offers—for example, by paying to be listed or featured in a navigation service’s response to specific words or phrases. (See “Monetized Search” in Section 7.1.7.) Marketers have found ways to use the specific characteristics of the Internet,9 just as they have developed methods appropriate for each new medium.10 This has led, for example, to the establishment of companies that are devoted to finding ways to manipulate Internet navigation services to increase the ranking of a client’s Web site and, in response, to the development of countermeasures by the services. (See “Search Engine Marketing and Optimization” in Section 7.1.7.) For non-commercial resources, the situation is somewhat different, since the providers generally have fewer resources and may have less incentive to actively seek users, at least to the extent of paying for Web advertising or search engine marketing. At the same time, the existence of a specific non-commercial resource may be well known to the community of its potential users. For example, the members of a scholarly community or a non-profit organization are likely to be aware of the Internet resources relevant to their concerns. Those new to a community or outside it are dependent on Internet navigation tools to locate these resources. Internet navigation is a complex interplay of the interests of both the searchers for and the providers of resources. On the Internet, the 9 See, for example, Joe Cappo, The Future of Advertising: New Media, New Clients, New Consumers in the Post-Television Age, McGraw-Hill, New York, 2003; and Barbara Cox and William Koelzer, Internet Marketing, Prentice-Hall, New York, 2004. 10 It should be noted that the Internet, by making the marginal cost of an e-mail message extremely low, has also enabled providers to conduct a non-discriminating search for potential users by broadcasting spam e-mail. Although this might be considered a variant of Internet navigation, where the provider actively advertises its location to a vast audience whose members may or may not be interested, its one-sided benefits and frequent use for dishonest or illegal purposes disqualify it for inclusion in this report, which focuses on searcher-beneficial navigation aids and services.
OCR for page 286
Signposts in Cyberspace: The Domain Name System and Internet Navigation librarian’s ideal of neutral information retrieval often confronts the reality of self-interested marketing. 6.1.3 Complexity and Diversity of Uses, Users, and Providers Third, the complexity and diversity of uses of resources on the Internet, of their users, and of their providers significantly complicate Internet navigation. It becomes a multidimensional activity that incorporates behaviors ranging from random browsing to highly organized searching and from discovering a new resource to accessing a previously located resource.11 Studies in information science show that navigation in an information system is simplest and most effective when the content is homogeneous, the purposes of searching are consistent and clearly defined, and the searchers have common purposes and similar levels of skills.12 Yet, Internet resources per se often represent the opposite case in all of these respects. Their content is often highly heterogeneous; their diverse users’ purposes are often greatly varied; the resources the users are seeking are often poorly described; and the users often have widely varying degrees of skills and knowledge. Thus, as the resources accessible via the Internet expand in quantity and diversity of content, number and diversity of users, and variety of applications, the challenges facing Internet navigation become even more complex. Indeed, prior to the use of the Internet as a means to access information, many collections of information resources, whether in a library or an online information system, were accessed by a more homogenous collection of users. It was generally known when compiling the collection whether the content should be organized for specialists or lay people, and whether skill in the use of the resource could be assumed. Thus, navigation aids, such as indexes or catalogs, were readily optimized for their specific content and for the goals of the people searching them. Health information in a database intended for searching by physicians could be indexed or cataloged using specific and highly detailed terminology that assumed expert knowledge. Similarly, databases of case law and statute law assumed a significant amount of knowledge of the law. In fields such as medicine and law, learning the navigation tools and the vocabularies of the field is an essential part of professional education. Many such databases are now accessible by specialists via the Internet and continue to assume a skillful and knowledgeable set of users, even though in some 11 See Shan-ju Chang and Ronald E. Rice, “Browsing: A Multidimensional Framework,” Annual Review of Information Science and Technology 28:231-276, 1993. 12 See Borgman, From Gutenberg to the Global Information Infrastructure, 2000.
OCR for page 287
Signposts in Cyberspace: The Domain Name System and Internet Navigation cases they are also accessible by the general public. With the growth in Internet use, however, many more non-specialist users have ready access to the Web and are using it to seek medical, legal, or other specialized information. The user community for such resources is no longer well defined. Few assumptions of purpose, skill level, or prior knowledge can be made about the users of a Web information resource. Consequently, it is general-purpose navigation aids and services and less specialized (and possibly lower-quality) information resources that must serve their needs. 6.1.4 Lack of Human Intermediaries Fourth, the human intermediaries who traditionally linked searchers with specific bodies of knowledge or services—such as librarians, travel agents, and real estate agents—are often not available to users as they seek information on the Internet. Instead, users generally navigate to the places they seek and assess what they find on their own, relying on the aid of digital intermediaries—the Internet’s general navigation aids and services, as well as the specialized sites for shopping, travel, job hunting, and so on. Human intermediaries’ insights and assistance are generally absent during the navigation process. Human search intermediaries help by selecting, collecting, organizing, conserving, and prioritizing information resources so that they are available for access.13 They combine their knowledge of a subject area and of information-seeking behavior with their skills in searching databases to assist people in articulating their needs. For example, travelers have relied on travel agents to find them the best prices, best routes, and best hotels, and to provide services such as negotiating with hotels, airlines, and tour companies when things go wrong. Intermediaries often ask their clients about the purposes for which they want information (e.g., what kind of trip the seekers desire and how they expect to spend their time; what they value in a home or neighborhood; or what research questions their term paper is trying to address), and elicit additional details concerning the problem. These intermediaries also may help in evaluating content retrieved from databases and other sources by offering counsel on what to trust, what is current, and what is important to consider in the content retrieved. With the growth of the Internet and the World Wide Web, a profound change in the nature of professional control over information is taking place. Travel agents and real estate agents previously maintained tight control over access to fares and schedules and to listings of homes for 13 See Chapter 7, “Whither, or Wither, Libraries?,” in Borgman, From Gutenberg to the Global Information Infrastructure, 2000.
OCR for page 288
Signposts in Cyberspace: The Domain Name System and Internet Navigation sale. Until recently, many of these resources were considered proprietary, especially in travel and real estate, and consumers were denied direct access to their content. Information seekers had little choice but to delegate their searches to an expert—a medical professional, librarian, paralegal, records analyst, travel agent, real estate agent, and so on. Today, travel reservation and real estate information services are posting their information on the Internet and actively seeking users. Specialized travel sites, such as Expedia.com and travelocity.com, help the user to search through and evaluate travel options. Similar sites serve the real estate market. Travel agents that remain in business must get their revenue from other value-added services, such as planning customized itineraries and tours and negotiating with brokers. Although house hunters now can do most of their shopping online, in most jurisdictions they still need real estate agents with access to house keys to show them properties and to guide them in executing the legal transactions. Libraries have responded to the Web by providing “virtual reference services” in addition to traditional on-site reference services.14 Other entities, including the U.S. Department of Education, have supported the creation of non-library-based reference services that use the Internet to connect users with “people who can answer questions and support the development of skills” without going through a library intermediary.15 6.1.5 Democratization of Information Access and Provision Fifth, the Internet has hugely democratized and extended both the offering of and access to information and services. Barriers to entry, whether cost or credentials, have been substantially reduced. Anyone with modest skill and not much money can provide almost anything online, and anyone can access it from almost anywhere. Rarely are credentials required for gaining access to content or services that are connected to the public Internet, although paid or free registration may be necessary to gain access to some potentially relevant material. Not only commercial and technical information are openly accessible, but also the full range of political speech, of artistic expression, and of personal opinion are readily available on the public Internet—even though efforts are continually be- 14 See, for example, the more than 600 references about such services in Bernie Sloan, “Digital Reference Services Bibliography,” Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, November 18, 2003, available at <http://www.lis.uiuc.edu/~b-sloan/digiref.html>. 15 For example, the Virtual Reference Desk is “a project dedicated to the advancement of digital reference.” See <http://www.vrd.org/about.shtml>. This service is sponsored by the U.S. Department of Education.
OCR for page 289
Signposts in Cyberspace: The Domain Name System and Internet Navigation ing made to impose restrictions on access to some materials by various populations in a number of countries.16 In a great many countries, anyone can set up a Web site—and many people do. The freedom of the press does not belong just to those who own one; now nearly anyone can have the opportunity to publish via a virtual press—the World Wide Web.17 Whether or not what they publish will be read is another matter. That depends on whether they will be found, and, once found, whether they can provide content worthy of perusal—at least by someone. For the most part, in many places, provision of or access to content or services is uncensored and uncontrolled. On the positive side, the Internet enables access to a global information resource of unprecedented scope and reach. Its potential impact on all aspects of human activity is profound.18 But the institutions that select, edit, and endorse traditionally published information have no role in determining much of what is published via the Internet. The large majority of material reachable via the Internet has never gone through the customary editing or selection processes of professional journals or of newspapers, magazines, and books.19 So in this respect as well, there has been significant disintermediation, leaving Internet users with relatively few solid reference points as they navigate through a vast collection of information of varying accuracy and quality. In response to this widely acknowledged problem, some groups have offered evaluations of materials on the World Wide Web.20 One 16 These range from the efforts of parents to prevent their children from accessing age-inappropriate sites to those made by governments to prevent their citizens from accessing politically sensitive sites. The current regulations placed on libraries in the United States to filter content are available at <http://hraunfoss.fcc.gov/edocs_public/attachmatch/FCC03-188A1.pdf>. See also, information on the increasing sophistication of filtering efforts in China in David Lee, “When the Net Goes Dark and Silent,” South China Morning Post, October 2, 2002, available at <http://cyber.law.harvard.edu/people/edelman/pubs/scmp100102-2.pdf>. For additional information, see Jonathan Zittrain and Benjamin Edelman, Empirical Analysis of Internet Filtering in China, Beckman Center for Internet & Society, Harvard University, 2002, available at <http://cyber.law.harvard.edu/filtering/china/>. 17 The rapid growth in the number of blogs (short for Web logs) illustrates this. According to “How Much Information?,” there were 2.9 million active Web logs in 2003. See Peter Lyman and Hal R. Varian, “How Much Information?,” 2003, retrieved from <http://www.sims.berkeley.edu/research/projects/how-much-info-2003> on April 27, 2005. 18 See Borgman, From Gutenberg to the Global Information Infrastructure, 2000; and Thomas Friedman, “Is Google God?”, New York Times, June 29, 2003. 19 However, it must be acknowledged that many “traditional” dissemination outlets (e.g., well-known media companies) operate Web sites that provide material with editorial review. 20 See, for example, The Information Quality WWW Virtual Library at <http://www.ciolek.com/WWWVL-InfoQuality.html> and Evaluating Web Sites: Criteria and Tools at <http://www.library.cornell.edu/okuref/research/webeval.html>.
OCR for page 290
OCR for page 291
Signposts in Cyberspace: The Domain Name System and Internet Navigation The incorporation of context is also related to the degree to which the navigation service is itself specialized. When people searched in traditional information sources, the selection of the source usually carried information about the context for a search (e.g., seeking a telephone number in the White Pages for Manhattan in particular, or looking for a home in the multiple listing service for Boston specifically). Furthermore, there were often intermediaries who could obtain contextual information from the information seeker and use that to choose the right source and refine the information request. Although the former approach is also available on the Internet through Internet white pages and Internet real estate search sites, the latter is just developing through the virtual reference services mentioned earlier. General-purpose navigation services are exploring a variety of mechanisms for incorporating context into search. For example, both Yahoo! and Google now allow local searches, specified by adding a location to the search terms. The site-flavored Google Search customizes searches originating at a Web site to return search results based on a profile of the site’s content. Other recent search engines such as Vivisimo (www.vivisimo.com) return search results in clusters that correspond to different contexts—for example, a search on the keyword “network” returns one cluster of results describing cable networks, another on network games, and so on.24 Still, issues of context complicate Internet navigation because the most widely used navigation services are general purpose—searching the vast array of objects on the Internet with equal attention—and because many users are not experienced or trained and have no access to intermediaries to assist them. 6.1.7 Lack of Persistence Seventh, there is no guarantee of persistence for material at a particular location on the Internet. While there is no reason to believe that everything made accessible through the Internet should persist indefinitely, there are a great many materials whose value is such that many users would want to access them at the same Internet address at indefinite times in the future. For example, throughout this report there are two kinds of references: those to printed materials—books, journals, newspapers—and those to digital resources that are located via Web pages. There is a very high probability that whenever this report is read, even years after its publication, that every one of the referenced printed materials will be ac- 24 See Chris Gaither, “Google Offers Sites Its Services,” Los Angeles Times, June 19, 2004, p. c2. Also see <http://www.google.com/services/siteflavored.html>, accessed on June 18, 2004.
OCR for page 302
Signposts in Cyberspace: The Domain Name System and Internet Navigation tion services took two primary forms: directories and search engines. Directories organized Web resources by popular categories. Search engines indexed Web resources by the words found within them. The sequence of key developments from 1993 through 2004 in both forms of Web navigation system is shown in detail in Box 6.3. A broad overview of the development follows. Web navigation service development began in 1993, the year the Mosaic browser became available, first to universities and research laboratories, and then more generally on the Internet. The first directory and the first search engines were created in that year. But it was 1994 when the first of the widely used directories—Yahoo!—and the first full-text search engine—WebCrawler—were launched. Over the next few years, technological innovation occurred at a rapid pace, with search engines adding new features and increasing their speed of operation and their coverage of the Web as computing and communication technology and system design advanced. Lycos, launched in 1994, was the first Web search engine to achieve widespread adoption as it indexed millions of Web pages. It was followed in 1995 by Excite and Alta Vista. Alta Vista in particular offered speed improvements and innovative search features. With the launch of Google in beta in 1998 and as a full commercial offering in 1999, the general nature of the technology of search engines appeared to reach a plateau, although there is continual innovation in search algorithms and approaches to facilitate ease of use. The commercial evolution of search services continued rapidly both through additional entries into the market and an increasingly rapid pace of consolidation of existing entries. The evolution of directory technology has been less visible and probably less rapid. The focus of that evolution appears, rather, to have been on the means of creating and maintaining directories and on the addition of offerings, including search engines, to the basic directory structure. By the first years of this century, the two worlds of search engines and directories had merged, at least commercially. In 2004, Google offered Google directory (supplied by Open Directory) and Yahoo! offered Yahoo search (provided by its acquisition—Inktomi) with paid ads (provided by its acquisition—Overture). By 2004 most commercial navigation services offered advertisements associated with specific responses. These paid ads are the principal source of funding and profit for commercial navigation services. (Commercial Internet navigation is discussed in further detail in Section 7.2.) Associated with this latter development has been the rapid rise of the business of search engine marketing, which helps commercial Web sites to decide in which search engines and directories they should pay for ads; and search engine optimization, which helps to design Web sites so search engines will easily find and index them. (Organizations
OCR for page 303
Signposts in Cyberspace: The Domain Name System and Internet Navigation that provide these services generally also provide assistance with bidding strategies and assistance in advertising design.) Finally, by 2004, many search services had established themselves as portals—sites whose front pages offer access to search; news, weather, stock prices, entertainment listings, and other information; and links to travel, job search, and other services. The development of Internet navigation aids and services, especially those focused primarily on the Web, stands in interesting contrast to the development of the Domain Name System as described in Chapter 2. Conclusion: A wide range of reasonably effective, usable, and readily available Internet navigation aids and services have been developed and have evolved rapidly in the years since the World Wide Web came into widespread use in 1993. Large investments in research and development are currently being made in commercial search and directory services. Still, many of the unexpected innovations in Internet navigation occurred in academic institutions. These are places with strong traditions of information sharing and open inquiry. Research and education are “problem-rich” arenas in which students and faculty nurture innovation. Conclusion: Computer science and information science graduate students and faculty played a prominent role in the initial development of a great many innovative Internet navigation aids and services, most of which are now run as commercial enterprises. Two of those services have become the industry leaders and have achieved great commercial success—Yahoo! and Google. Conclusion: Because of the vast scale, broad scope, and ready accessibility of resources on the Internet, the development of navigation aids and services opens access to a much wider array of resources than has heretofore been available to non-specialist searchers. At the same time, the development of successful Internet navigation aids and services opens access to a much broader potential audience than has heretofore been available to most resource providers. One cannot know if the past is prelude, but it is clear that the number and the variety of resources available on the Internet continue to grow, that uses for it continue to evolve, and that many challenges of Internet navigation remain. Some of the likely directions of technological development are described in Section 8.1
OCR for page 304
Signposts in Cyberspace: The Domain Name System and Internet Navigation BOX 6.3 Key Events in the Development of Navigation Aids and Services for the World Wide Web 1989 Work started on the InQuery engine at the University of Massachusetts that eventually led to the Infoseek engine. 1993 First Web robot or spider.1 The World Wide Web Wanderer was created by MIT student Matthew Gray. It was used to count Web servers and create a database—Wandex—of their URLs. Work on the Architext search engine using statistical clustering was started by six Stanford University undergraduates, which was the basis for Excite search engine launched in 1995. First directory. WWW Virtual Library was created by Tim Berners-Lee. ALIWEB, an Archie-like index of the Web based on automatic gathering of information provided by webmasters, was created by Martijn Kosters at Nexor Co., United Kingdom. First robot-based search engines launched. The World Wide Web Worm, JumpStation, and Repository-Based Software Engineering (RBSE) were launched. None indexed the full text of Web pages. 1994 The World Wide Web Worm indexed 110,000 Web pages and Web-accessible documents; it received an average of 1500 queries a day (in March and April). First searchable directory of the Web. Galaxy, created at the MCC Research Consortium, provided a directory service to support electronic commerce. First widely used Web directory. Yahoo! was created by two Stanford graduate students, David Filo and Jerry Yang, as a directory of their favorite Web sites. Usage expanded with the growth in entries and addition of categories. Yahoo! became a public company in 1995. First robot-based search engine to index full text of Web pages. Web Crawler was created by Brian Pinkerton, a student at the University of Washington. Lycos was created by Michael Mauldin, a research scientist at Carnegie Mellon University. It quickly became the largest robot-based search engine. By January 1995 it had indexed 1.5 million documents and by November 1996, over 60 million—more than any other search engine at the time.
OCR for page 305
Signposts in Cyberspace: The Domain Name System and Internet Navigation Harvest was created by the Internet Research Task Force Research Group on Resource Discovery at the University of Colorado. It featured a scalable, highly customizable architecture and tools for gathering, indexing, caching, replicating, and accessing Internet information. Infoseek Guide, launched by Infoseek Corporation as a Web directory, was initially fee based, and then free. The OpenText 4 search engine was launched by Open Text Corporation based on work on full-text indexing and string search for the Oxford English Dictionary. (In 1996 it launched “Preferred Listings,” enabling sites to pay for listing in top-10 search results. Resultant controversy may have hastened its demise in 1997.) 1995 The Infoseek search engine was launched in February and in December became Netscape’s default search service, displacing Yahoo! First metasearch service. SearchSavvy, created by Daniel Dreilinger, a graduate student at Colorado State University, queried multiple search engines and combined their results. First commercial metasearch service. MetaCrawler, developed by graduate student Erik Selberg and faculty member Oren Etzioni at the University of Washington, was licensed to go2net. The Excite commercial search engine, based on the Stanford Architext engine, was launched. The Magellan directory was launched by the McKinley Group. It was complemented by a book, The McKinley Internet Yellow Pages, that categorized, indexed, and described 15,000 Internet resources and that accepted advertising. Search engine achieved record speed: 3 million pages indexed per day. AltaVista, launched by Digital Equipment Corporation, combined computing power with innovative search features—including Boolean operators, newsgroup search, and users’ addition and removal of their own URLs—to become the most popular search engine. 1996 First search engine to employ parallel computing for indexing: 10 million pages indexed per day. Inktomi Corporation launched the HotBot search engine based on work of faculty member Eric Brewer and graduate student Paul Gauthier of the University of California, Berkeley. HotBot used clusters of inexpensive workstations to achieve supercomputer speeds. It adopted the OEM search model—providing search services through others—and was licensed to Wired magazine’s Web site, HotWired.
OCR for page 306
Signposts in Cyberspace: The Domain Name System and Internet Navigation First paid listings in an online directory. LookSmart was launched as a directory of Web site listings. Containing both paid commercial listings and non-commercial listings submitted by volunteer editors, it also adopted the OEM model. First Internet archive. Archive.org was launched by Brewster Kahle as an Internet repository with the goal of archiving snapshots of the Web’s content on a regular basis. Consolidation begins: Excite acquired WebCrawler and Magellan. 1997 First search engine to incorporate automatic classification and creation of taxonomies of responses and to use multidimensional relevance ranking. The Northern Light search engine also indexed proprietary document collections. AOL launched AOL NetFind, its own branded version of Excite. The Mining Company directory service, started by Scott Kurnitt and others, used a network of “guides” to prepare directory articles. First “question-answer” style search engine. Ask Jeeves was launched. The company was founded in 1996 by a software engineer, David Warthen, and a venture capitalist, Garrett Gruener. The service emphasized ease of use, relevance, precision, and ability to learn. Alexa.com was launched by Brewster Kahle. It assisted search users by providing additional information—site ownership, related links, and a link to Encyclopedia Britannica—and also provided a copy of all indexed pages to Archive.org. Alta Vista, the largest search engine, indexed 100 million pages total and received 20 million queries per day. Open Text Corporation ceased operation. 1998 First search engine with paid placement (“pay-per-click”) in responses. Idealab! launched the GoTo search engine. Web sites were listed in an order determined by what they paid to be included in responses to a query term. First open source Web directory. Open Directory Project was launched (initially with the name GNUhoo and then NewHoo) with the goal of becoming the Web’s most comprehensive directory through the use of the open source model—contributions by thousands of volunteer editors.
OCR for page 307
Signposts in Cyberspace: The Domain Name System and Internet Navigation First search engine to use “page rank,” based on number of links to a page, in prioritizing results of Web keyword searches. Stanford graduate students Larry Page and Sergey Brin announced Google, which was designed to support research on Web search technology. Google became available as a “beta version” on the Web. Microsoft launched MSN Search using the Inktomi search engine. The Direct Hit search engine was introduced. It ranked responses by the popularity of sites among previous searchers using similar keywords. Yahoo! Web search was powered by Inktomi. Consolidation heats up: GoTo acquired WWW Worm; Lycos acquired Wired/HotBot; Netscape acquired the Open Directory; Disney acquired a large stake in Infoseek. 1999 Google, Inc. (formed in 1998) opened a fully operational search service. AOL/Netscape adopted it for search on its portal sites. The Norwegian company Fast Search & Transfer (FAST) launched the AllTheWeb search engine. The Mining Company was renamed About.com. Northern Light became the first engine to index 200 million pages. The FindWhat pay-for-placement search engine was launched to provide paid listings to other search engines. Consolidation continued: CMGI acquired AltaVista; At Home acquired Excite. 2000 Yahoo! adopted Google as the default search results provider on its portal site. Google launched an advertising program to complement its search services, added Netscape Open Directory to augment its search results, and began 10 non-English-language search services. Consolidation continued: Ask Jeeves acquired Direct Hit; Terra Networks S.A. acquired Lycos. By year’s end, Google had become the largest search engine on the Web with an index of over 1.3 billion pages, answering 60 million searches per day. 2001 Google acquired the Deja.com Usenet archive dating back to 1995.
OCR for page 308
Signposts in Cyberspace: The Domain Name System and Internet Navigation Overture, the new name for GoTo, became the leading pay-for-placement search engine. The Teoma search engine, launched in April, was bought by Ask Jeeves in September. The Wisenut search engine was launched. Magellan ceased operation. By the end of the year, Google had indexed over 3 billion Web documents (including a Usenet archive dating back to 1981). 2002 Consolidation continued: LookSmart acquired Wisenut. The Gigablast search engine was launched. 2003 Consolidation heated up: Yahoo acquired Inktomi and Overture, which had acquired AltaVista and AllTheWeb. FindWhat acquired Espotting. Google acquired Applied Semantics and Sprinks. Google indexed over 3 billion Web documents and answered over 200 million searches daily. 2004 Competition became more intense: Yahoo! switched its search from Google to its own Inktomi and Overture services. 6.3 ADDENDUM—SEARCHING THE WEB VERSUS SEARCHING LIBRARIES Searching on the public Web has no direct analog with searching libraries, which is both an advantage and a disadvantage. Library models provide a familiar and useful comparison for explaining the options and difficulties of categorizing Web resources. First, locating an item by its URL has no direct equivalent in library models. The URL usually combines the name of a resource with its precise machine location. The URL approach assumes a unique resource at a unique location. Library models assume that documents exist in multiple
OCR for page 309
Signposts in Cyberspace: The Domain Name System and Internet Navigation Amazon.com entered the market with A9.com, which added Search Inside the Book™ and other user features to the results of a Google search. Google included (in February) 6 billion items: 4.28 billion Web pages, 880 million images, 845 million Usenet messages, and a test collection of book-related information pages. Google went public in an initial public offering in August. Ask Jeeves acquired Interactive Search Holdings, Inc., which owned Excite and iWon. In November, Google reported that its index included over 8 billion Web pages.2 In December, Google, four university libraries, and the New York Public Library announced an agreement to scan books from the library collections and make them available for online search.3 SOURCES: Search Engine Optimization Consultants, “History of Search Engines and Directories,” June 2003, available at <http://www/seoconsultants.com/search-engines/history.asp>; iProspect, “A Brief History of Search Engine Marketing and Search Engines,” 2003, available at <http://www.iprospect.com/search_engine_placement/seo_history.htm>; Danny Sullivan, “Search Engine Timeline,” SearchEngineWatch.com, available at <http://www.searchenginewatch.com/subscribers/factfiles/article.php/2152951>; and Wes Sonnenreich, “A History of Search Engines,” Wiley.com, available at <http://www.wiley.com/legacy/compbooks/sonnenreich/history.html>. 1 A spider is a program that collects Web pages by traversing the Web, following links from site to site in a systematic way. See Box 7.2. 2 See David A. Vise, “Search Swagger,” Washington Post, November 11, 2004, p. E1. 3 John Markoff and Edward Wyatt. 2004. “Google Is Adding Major Libraries to Its Database,” New York Times, December 14, 2004, available at <http://www.nytimes.com/2004/12/14/technology/14google.html>. locations, and separate the name of the document (its bibliographic description, usually a catalog record or index entry) from its physical location within a given library. Classification systems (e.g., Dewey decimal or Library of Congress) are used for shelf location only in open-stack libraries, which are prevalent in the United States. Even then, further coding is needed to achieve unique numbering within individual libraries.56 In 56 The shelf location ZA3225.B67 2000 for Borgman, From Gutenberg to the Global Information Infrastructure, 2000, at the University of California at Berkeley library consists of the Library of Congress Call Number (ZA3225) plus a local number to order the author name (B67 for Borgman) and the date (2000) to create a unique shelf placement in this library.
OCR for page 310
Signposts in Cyberspace: The Domain Name System and Internet Navigation closed-stack libraries where users cannot browse the shelves, such as the Library of Congress, books usually are stored by size and date of acquisition. The uniqueness of name and location of resources that is assumed in a URL leads to multiple problems of description and persistence, as explained in this chapter. Second, resources on the Web may be located by terms in their pages because search engines attempt to index documents in all the sites they select for indexing, regardless of type of content (e.g., text, images, sound; personal, popular, scholarly, technical), type of hosting organization (e.g., commercial, personal, community, academic, political), country, or language. By comparison, no single index of the contents of the world’s libraries exists. Describing such a vast array of content in a consistent manner is an impossible task, and libraries do not attempt to do so. The resource that comes closest to being a common index is WorldCat,57 which “is a worldwide union catalog created and maintained collectively by more than 9,000 member institutions” of the Online Computer Library Center (OCLC) and in 2004 contained about 54 million items.58 The contents of WorldCat consist of bibliographic descriptions (cataloging records) of books, journals, movies, and other types of documents; the full content of these documents is not indexed. Despite the scope of this database, it represents only a fraction of the world’s libraries (albeit most of the largest and most prestigious ones), and only a fraction of the collections within these libraries (individual articles in journals are not indexed, nor are most maps, archival materials, and other types of documents). The total number of documents in WorldCat (54 million) is small compared with those indexed by Google, InfoSeek, AltaVista, or other Internet search engines. Rather than create a common index to all the world’s libraries, consistent and effective access to documents is achieved by dividing them into manageable collections according to their subject content or audience. Library catalogs generally represent the collections of only one library, or at most a group of libraries participating in a consortium (e.g., the campuses of the University of California59). These catalogs describe books, journals (but not individual journal articles), and other types of documents. Many materials are described only as collections. For example, just one catalog record describes all the maps of Los Angeles made by the U.S. Geological Survey from 1924 to 1963. It has been estimated that individual records in the library catalog represent only about 2 percent of the separate items in a typical academic library collection.60 Thus, library catalogs are far less 57 See <http://www.oclc.org/worldcat/default.htm>. 58 Accessed May 7, 2004. 59 See <http://melvyl.cdlib.org/>. 60 See David A. Tyckoson, “The 98% Solution: The Failure of the Catalog and the Role of Electronic Databases,” Technicalities 9(2):8-12, 1989.
OCR for page 311
Signposts in Cyberspace: The Domain Name System and Internet Navigation comprehensive than most library users realize. However, online library catalogs are moving away from the narrower model of card catalogs. Many online catalogs are merging their catalog files with records from journal article databases. Mixing resources from different sources creates a more comprehensive database but introduces the Web searching problem of inconsistent description. Nor do libraries attempt the international, multilingual indexing that search engines do. The catalogs of libraries may be organized consistently on a country-by-country basis, at best. The descriptive aspects of cataloging (e.g., author, title, date, publisher) are fairly consistent internationally, as most countries use some variation of the Machine Readable Cataloging (MARC) metadata structure. (Metadata are data about data or, more generally, about resources.) However, many variations of the MARC format exist, each tied to a national or multinational set of cataloging rules. The United States and United Kingdom share the Anglo-American Cataloging Rules but store their data in USMARC and UKMARC formats, respectively. These formats are finally being merged, after nearly 40 years of use. In 2004, the national libraries of the United States and the United Kingdom implemented a common format, MARC 21, and other countries are following suit.61 Other MARC formats include UNIMARC, HUMARC (Hungary), and FINNMARC (Finland). The OCLC WorldCat database merges these into an OCLC MARC format. Each catalog describes its holdings in its local language and may also include descriptions in the language of the document content. For example, OCLC WorldCat contains records describing resources in about 400 languages; each record has some descriptive entries in English. Thus, libraries achieve interoperability through highly decentralized cataloging activities. The cataloging enterprise is economically feasible in the United States because most published resources are described by the publishers and the Library of Congress and contributed to OCLC WorldCat and other shared databases. Despite the relative national and international success in establishing cataloging rules and formats among libraries, incompatibilities continue to exist within these communities, and the archives and museum communities employ yet other metadata formats. Given the difficulty of achieving agreement on basic descriptive models among these established institutions run by information management professionals, the likelihood of getting universal agreement on 61 See British Library, MARC 21 and UKMARC, British Library, London, November 24, 2004, available at <http://www.bl.uk/services/bibliographic/nbsils.html>; and MARC 21 Concise Format for Bibliographic Data, Concise Edition, Library of Congress, Washington, D.C., November 24, 2003, available at <http://www.loc.gov/marc/bibliographic/ecbdhome.html>.
OCR for page 312
Signposts in Cyberspace: The Domain Name System and Internet Navigation descriptive standards for Web documents is low. Decentralized models for data creation, combined with mapping between similar formats, are the most feasible way to achieve interoperability. Third, while a rich subject index to the Web would certainly be extremely useful, universal subject access is almost impossible to achieve. American libraries attempt general subject access via the Library of Congress Subject Headings (LCSH), but these apply only two or three headings per book. Few other countries appear to use the LCSH, as many of the concepts are specific to U.S. culture. Unified access via classification systems such as the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) are more common. These are also country-and culture-specific. DDC and LCC are little used outside the United States; the Universal Decimal Classification (UDC) system has broader international adoption. Many other country-specific classifications exist. Consistent subject access in any depth is feasible only within topical areas due to a number of well-understood linguistic problems.62 These include synonymy (multiple terms or phrases may have the same meaning), polysemy (the same terms and phrases may have multiple meanings), morphological relationships (structure of words, such as variant endings—e.g., acid and acidic; dog and dogs; mouse and mice, and semantic relationships (conceptual relationships (e.g., two words may have the same meaning in one context and different meanings in other contexts). Both automatic and manual methods to provide consistent retrieval by controlling the meaning of words work best when the subject area is constrained. Controlled sets of terms (e.g., thesauri, ontologies) can be constrained to their meaning within one field, such as computer science, economics, arts, or psychology. Libraries construct or purchase indexes specific to each field within the scope of their collections. Fourth, Web directories are more analogous to topic-specific library indexes than to library catalogs. However, Web directories cover only a small portion of the content of the Web, and their descriptions of each item are often less complete and can be less reliable than those created by professional librarians. Consequently, structured and formal characterizations of material can be accomplished most effectively in a library catalog or bookstore database, rather than in a general Web directory. Selecting the proper resource to search remains an important starting point in seeking information, whether online or offline. 62 See Richard K. Belew, Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW, Cambridge University Press, Cambridge, U.K., 2000; Peter Brusilovsky and Carlo Tasso, “Preface to Special Issue on User Modeling for Web Information Retrieval,” User Modeling and User-Adapted Interaction: The Journal of Personalization Research 14(2-3):147-157, 2004; and William A. Woods, “Searching vs. Finding,” ACM Queue 2(4), 2004, available at <http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=137>.
Representative terms from entire chapter: