National Academies Press: OpenBook
« Previous: II. Doing it with the Internet
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

III. Surfing Boards

Digital Libraries: Technologies, Tools, and People

Jamie Callan

University of Massachusetts, Amherst

Callan stated that computer research efforts at the University of Massachusetts are focused on leading-edge document retrieval and filtering technologies. The university's Center for Intelligent Information Retrieval (CIIR) also emphasizes technology transfer and seeding its research in the government and industrial environments. The problem of technology transfer encourages academics to focus on issues of interest in the “real world” and provides a measure of the “real world” value of the research.

Callan's talk focused on the three key elements involved in information systems: technology, tools, and people. It ended with lessons learned from deploying CIIR technology in a variety of systems on the WWW. CIIR processes include indexing, transforming information needs into queries, and document retrieval or filtering. Callan gave examples of how each of these tasks can be customized for specific users. For example, the indexing of “stop” words and word stems can often be customized. Domain-specific concepts and meta-terms can also be added automatically to the index—for example, indicating that the word “Apple” found in a specific place refers to the company and not the fruit.

The transformation of a natural language or form-based information need into a query can also be customized—for example, to recognize phrases, to discard “stop phrases,” to weight different fields differentially, or to include meta-terms.

The technology is augmented by a variety of tools that facilitate its use, including query expansion and relevance feedback. Query expansion tools analyze a query and then suggest additional words or phrases that might be added to the query. For example, a query about illegal immigration might be improved by adding such terms as “undocumented aliens” and “immigration reform.” Relevance feedback analyzes documents identified as “relevant” and then produces a new query that is (presumably) better able to retrieve related documents. The intent is to minimize the effort required to create an effective query.

Customization improves the ability of the system to deliver consistently useful results to inexperienced users. Equally important are tools that enable a user to understand the results obtained by retrieval or to browse a collection of documents. Callan gave a series of examples, largely drawn from well-known WWW systems. Yahoo and InfoSeek (commercial information-search services) provide hierarchical access to data. The Yahoo hierarchy is static, while InfoSeek's is generated dynamically in response to user queries. Each approach has its advantages. InfoSeek uses relevance feedback to identify related documents. The Library of Congress system, for example, explains the results obtained by the user.

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

Search systems are already commercially available. They include Personal Library Software (PLS), Verity, Fulcrum, OpenText, Architext, and Sovereign Hill Software. They use a number of different approaches, including Boolean (the most common and the most limited), vector space, probabilistic, and concept-based. Although commercial systems have lagged behind systems developed in research environments, they have been greatly improved. Systems should be judged on the basis of both traditional measures, such as recall and precision, and on user satisfaction.

The Center for Intelligent Information Retrieval (CIIR) has compiled various observations regarding the WWW. Search engines must be easy to use in order to attract users. Users are often attracted to search engine features that are not necessarily functional but make using search tools more satisfying. Rating search engine output is often not very useful, but users seem to like to do it.

Another CIIR finding is that current WWW interfaces, with their short query boxes, discourage long queries. It is well known by the research community that longer queries improve search effectiveness. People are reluctant to pay to obtain information from the Internet because the value is unknown, and because financial transactions raise security concerns. Callan predicted that location brokering (i.e., telling people where to look for information) and “branded” information (identification of the owner or supplier) will become important in the near future as ways to filter information.

Discussion

Cargill asked how information searches could be made more effective. Callan answered that in the absence of a centralized database, information suppliers and users must learn to use a common vocabulary so that the distributed sources will appear to the users as a single database. Consequently, the query identifier words will not have to be as extensive. If information is maintained at a central site, customized retrieval mechanisms are possible.

Sonwalker commented that information-retrieval scoring systems don 't appear to be very effective. Callan agreed. However, the best systems do a good job of ranking documents, even if the scores are difficult to interpret.

LeClair asked how search engines could provide users with better results. Callan replied that search engines could be improved by matching queries to the structure of documents.

Baglin said that his search strategy is a cross between a stepwise search and browsing. He then asked if anybody was working on a search engine utilizing such an approach. Callan replied that the role of human perception is critical in conducting information searches and that search engines need to support it. Browsing aids are being studied, but it is difficult to create effective browsing tools for the Internet.

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

Rumble asked about full text searches. Callan answered that many systems exist to search full texts, and that in most cases they are effective. The longer the article, the better the search results.

Knowles asked if completely distributed publishing and searching might eliminate the need for information brokers. Callan replied that in such a situation brokers would become more important because each retrieval system would be customized for a different audience.

Baskes asked why the research community had done a better job of developing search tools than the commercial sector. Callan answered that the commercial information providers have sought to distinguish themselves by content (e.g., size of database) rather than effectiveness in retrieving information. He added that while commercial vendors have optimized search tools for databases at a single location, the research community has begun attacking the problem of making searches on a global scale.

Expert Database Support for Distributed Materials Property Databases

Lawrence Kerschberg

George Mason University

Kerschberg's presentation was done by accessing his WWW page at George Mason University. The presentation focused on how expert database technology can be used to achieve intelligent access, retrieval, and integration of information from multiple databases on materials.

He began by describing the issues outlined in the 1995 National Research Council report called “Computer-Aided Material Selection During Structural Design.” The first issue was the need for standardization in the quality, capture, storage, analysis, and exchange of data. This suggests a need for system flexibility to accommodate varying taxonomies and a system for linking databases on the properties of materials. A second issue is the need for design knowledge bases. The third issue is the huge number of sources of materials databases, which makes standardization more difficult.

These recommendations from the NRC report were presented by Kerschberg:

  • Materials and computer scientists should collaborate to develop knowledge capture systems.

  • They should create design databases on discussions, decisions, rationale, and lessons learned in multimedia format.

  • Specialized knowledge should be categorized, indexed, and filtered into a design knowledge base.

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
  • A national team of users, suppliers, materials societies, and standards organizations should be created to develop integrated materials qualification programs.

  • Results (independently verified data) should be made available over NII to provide a realistic initial appraisal of the advantages of a material.

Kerschberg then described a federated client server architecture for supporting and integrating distributed materials property databases (MPDB). The architecture consists of an information web to access MPDB information and knowledge bases consisting of a thesaurus for managing materials taxonomies, domain knowledge and rules about materials data and provider quality, and reliability assessments.

The functions that a federated architecture serves are listed below:

  • reformulates high-level requests into detailed queries

  • locates, selects, and queries individual data sources

  • combines individual results into an integrated response

  • abstracts, filters, and summarizes data into useful information

A description of the federated service developed at George Mason University followed.

  • The Global Thesaurus Services provide integrated name space for access to information.

  • The Client-Server Services enable construction of each Federation Interface Manager (FIM) for constituent information systems.

  • The Temporal Mediation Service resolves conflicts in time measurements and differing time granularity.

  • The Harmony Agent Service reconciles inconsistent query answers from multiple sources.

  • The Active View Agent maintains consistency between subscribed objects in views and database objects.

The Human Genome Project and the Earth Observing System Data and Information System (EOSDIS) were given as examples of other federated scientific databases. Kerschberg concluded that a federated client-server approach to integrating materials properties databases is feasible. Mediation services permit access to and information integration among heterogeneous distributed systems. The Internet and the WWW can serve as excellent tools for MPDB providers to maintain control over data quality while providing information to engineers and scientists.

He added that an important issue for many scientists at NASA is called data pedigree or data lineage. As data are passed from one group to another, information about its quality and source needs to be preserved. The people

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

responsible for these tasks are sometimes called “curators of data.” As databases from different sources grow in number, the pedigree issue becomes more urgent.

Discussion

Referring to the Harmony Agent Service, Woodall said that users should know who wrote something and be given complete information about its pedigree. This can be used to reconcile inconsistent answers from various sources.

Kerschberg said large amounts of metadata were needed to fully inform users. For the Earth Observing System Data and Information System (EOSDIS) study, we worked on data pedigree to allow the correct interpretation of data. Davis commented that Woodall's concern about the authenticity of data was a very valid concern.

A participant asked about the meaning of “federated.” Cargill said it was a software layer which is applied over the existing software used in a materials database applications. This overarching software then allows interoperability with other databases having the same overarching, or federated, software layer.

Internet Directory Services

Michael Schwartz

@Home Network

Schwartz began his talk with a description of some of the information directory services that are available on the Internet. He characterized commercial Internet directory services such as Lycos and Yahoo as useful services of first and last resort for searching the World Wide Web but added that they have limitations in perspective and scale. A more recent development is SEARCH.COM, a commercial service which acts as a clearinghouse for topically focused search services. Other directory services include BigBook and Switchboard for phone number information, and Four11 and Netfind for e-mail addresses.

Global services like Yahoo provide general services but typically do not supply comprehensive information. In contrast, specialized services can provide much more comprehensive content but to smaller audiences. Fact-finding services are useful in providing specific information, while news clipping services aid the user in searching for information on current events. Most services are centralized, which may lead to scaling problems in the future. As a result, distributed directory services may become necessary, although there will be difficulties in coordinating such services.

Content directories are, ideally, highly visible services that exist either as commercial ventures supported by advertising or in more specialized venues as

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

community or enterprise supported efforts—for example, clubs, industry, and online courses financed by tuition or user fees. Directory services that contain information about people need to ensure security, and content has to be distributed with local autonomy and control. Users have resisted paying for such services, so in all likelihood they will need to be free to the user (e.g., supported by advertising). User directory services also need to be integrated with e-mail and electronic commerce services.

The @Home Network provides broadband access to consumers by hosting information providers. @Home's directory services will include global Internet directories augmented with distributed directories of @Home 's hosted content. @Home will also offer a listing of current offerings, and will support parental control of access to potentially objectionable Internet content. @Home will also deploy white pages, user profile, and news clipping services. Initially, @Home white pages will only contain information about @Home subscribers (giving subscribers the option to be unlisted), but in the longer term will work on standards for global white pages directories, in part through the use of the Lightweight Distributed Access Protocol (LDAP).

Schwartz described the Harvest Research Project, of which he was the principal investigator, that was conducted between November 1993 and November 1995 at the University of Colorado, Transarc, the University of Arizona, and the University of Southern California. The key ideas driving the Harvest project were an efficient, distributed gathering architecture, customizable content summarizers, standards for structured summary information interchange (SOIF, Summary Object Interchange Format), interfaces to multiple search engines, and topology-based caching and replication. To date, Harvest-based servers have been deployed at about 5,000 sites, and Harvest is now being commercialized as a technology (e.g., by Netscape) and as a service (e.g., by @Home).

The Harvest architecture consists of gatherers, brokers, caches, and replicators. The gatherer collects information specified as a set of enumerated URLs, with stop lists and other limits, and with document type-specific summarizing capabilities. The stock system supports 45 popular formats, and summarizers are easily customized or added. The Harvest Broker retrieves a compressed stream of information from the gatherer, supporting incremental updates and eliminating duplication. It also allows for customizable query and result processing, and provides integrated basic support for several search engines, including Glimpse, Nebula, Verity, GRASS (geographical resources analysis support system), PLS (personal library software), and free and commercial WAIS (wide area information servers).

Schwartz summarized his presentation by reiterating that directories of people must be handled differently from content directories. WWW directories are sufficient as first and last resorts but need to be augmented with enterprise, community, and topical directories. Furthermore, white pages have not yet become a basic service, in part because of a lack of deployment standards. As an added thought, he stated that multicast services are likely to become increasingly important.

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

Internet: A Standardization Paradigm under Threat

Carl Cargill

Sun Microsystems

Cargill's presentation traced the history and speculated on the future of standards for the Internet. His presentation emphasized the economic and operational importance of this issue. Although he predicted trouble ahead for standardization, he expressed faith that planning, discipline, and structure would bring about a good result in the end. He pointed out that the costs of setting standards run into the billions of dollars. The costs of producing a PC, for example, bear the costs of all of the standards to which the manufacturing company and its suppliers decided to adhere to. Thus, a company like Sun Microsystems is happy to employ standards experts like himself.

Standards for the Internet developed gradually after ARPA launched research in 1965 to develop a nuclear-attack-proof communications network for the military. For the first decade or so, the Internet was successful because few of those involved thought about profits or formal standards.

In 1977 a formal industry effort was mounted to write the Open System Interconnect (OSI) standards to counter IBM's Systems Network Architecture (SNA), which had become quite popular in Europe and had begun to be seen as a threat by European computer makers. The groups involved were the International Organization for Standardization (ISO) and the International Telecommunications Union (ITU), which produced such standards as the ISDN (Integrated Services Digital Network). The OSI standards were developed by thousands of volunteers over a period of seven years. In Cargill's opinion they are good as tactical, but not as strategic, standards, which is a fundamental failing.

By 1981, DARPA was concerned about the large number of people involved and established the Internet Configuration Control Board (ICCB), later renamed the Internet Activities Board (IAB). The IAB spawned the Internet Engineering Task Force (IETF), which in Cargill's opinion is doing good work. However, what was once 40 people working for ICCB is now 1,200 at IETF attempting to write “just-in-time” standards. Since these standards determine which equipment and software will be used in given applications, standards makers try to set standards that are matched by their company's products.

The OSI group persuaded NIST to create the Corporation for Open Systems as a testing organization but, according to Cargill, “burned” NIST in the process. By 1991, some 300 standards had been defined, but no products had ensued due to lack of demand. No one could enumerate the value of the multiple options as standardized by OSI. OSI failed, but the Internet existed as an alternative. After 1992 the IAB created the Internet Society to find substitutes for lost governmental funding of standards setting, and the IETF successfully

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

mandated acceptance of an addressing scheme over the objections of the more traditionally focused IAB.

But the IETF did not maintain control, according to Cargill, largely because of uncontrolled growth (hundreds of participants per year). Hence, standards can no longer be set in less than 10 months through the IETF. On the other hand, the market itself adopts standards quickly, as exemplified by Netscape, which found a way to deploy its standards and get wide acceptance in less than four months.

Security standards are doomed, Cargill said. They seem to be broken about every 10 months. That is simply too long to accommodate a fast-moving technology. Keeping information safe from hackers is a constant struggle.

Cargill suggested that the Internet now needs marketers more than it needs technologists. In fact, he stated that vendor marketing people are more attuned to user demands and applications. The standards community needs to plan where the Internet will go and when it will go there. As a corollary, it needs to understand where NOT to go. Universal acceptance of commercial technologies is needed through reform of intellectual property rights and practices. Although the IETF has lost control of the technology, Cargill said, there is some hope that market forces will produce a coherent set of Internet standards if managed properly.

Discussion

Goldie asked whether the IETF has looked at e-mail transfer of document bundles by SGML. Cargill replied that all of the 1,200 people in IETF have a voice in deciding which standards are applied to what information, and that there will be no clear choice in something like this.

Baglin asked how Cargill's comments on Internet standards apply to the rest of the world. Cargill replied that Europe and Japan are acting independently by following ITU standards. Baglin then suggested the need for a 20-year roadmap. Cargill agreed, but said that it would not come from the standards community.

Alexander said that the materials community will necessarily follow standards set by others. She asked Cargill which standards he would target, and, “Do you expect any major shifts in Internet standards? ” Cargill expects some changes in ITU and other standards, but such changes will be market driven.

These thoughts led to the question of the likelihood that the planning, discipline, and structure needed to set Internet standards will be forthcoming. Cargill was not optimistic that a deliberate approach will emerge, but added that the market will punish violators of existing norms. The “norms” are much more effective than the standards; a standard can be put in place if as few as four companies agree to it.

The discussion turned toward the materials community in particular. What is the materials market share? Will the standards community listen to us? Cargill's response was an emphatic “No... You must command billions to get attention. ” Recent public offerings (Netscape $3.7 billion, Yahoo $700 million) are

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×

a measure of the commercial value of the Internet and the potential impact of its standards.

Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
This page in the original is blank.
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 41
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 42
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 43
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 44
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 45
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 46
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 47
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 48
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 49
Suggested Citation:"III. Surfing Boards." National Research Council. 1996. Materials and Processes Research and the Information Highway: Summary Record of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/9770.
×
Page 50
Next: IV. Discussion Groups »
Materials and Processes Research and the Information Highway: Summary Record of a Workshop Get This Book
×
 Materials and Processes Research and the Information Highway: Summary Record of a Workshop
MyNAP members save 10% online.
Login or Register to save!

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!