often containing different types of data, but giving the appearance of a single, logical whole. Supporting the appearance of uniform retrieval across data sources requires standard protocol interfaces and transformation programs to change the representation of each type of data into a standard format. Transformations such as those for text, graphics, and image conversions could be generic across all disciplines. Others would be subject-matter specific, such as those for maps and sequences for genome research or for temperature and currents for physical oceanography. Although the external data formats can vary considerably, federation across data types is possible with standard formats for representing the internal data. In standard database technology the standard query language (SQL) and Open-SQL interfaces provide a fundamental part of this linkage, but other structures for mapping data dictionaries and semantic values must also be developed for a federated database to be constructed.

For sources containing textual (and other) information, some progress has been made in standardizing information search and retrieval protocols. The recently developed American National Standard Z39.50, Information and Retrieval Service Definition and Protocol Specifications for Library Applications, provides the means for performing queries on textual information and is being adapted by the International Organization for Standardization as an international standard. However, it is only one standard with a modest number of applications and a multitude of data formats to search across. Consequently, while the Z39.50 standard is a good start, much more needs to be done to extend this protocol and to further develop other appropriate standards and protocols for system-independent data search and information interchange.

In addition to conducting broad searches across many databases, scientists may wish to record logical associations they detect between items within a database or across databases. Such associative links may build on previously identified relationships or represent the exploration of new ones among the database elements. For example, genes might be represented by linking various elements in gene map and sequence databases. An ocean voyage might be represented by a set of linked oceanographic database items. Unusual and nonintuitive links among database items recorded by one researcher may well stimulate new insights or approaches to a problem by other scientists.

Implementing logical links between related items in different sources requires a standard format for representing the links and a series of methods for determining semantic relationships.5 These are areas for research and development. The commonest method for identifying semantic relationships relies on the use of standard terms or nomenclature such as the well-defined names that denote particular genes described in the literature, maps, and sequences. When standard nomenclature or terms have been used, it is possible to automatically generate links. For less obvious or novel relationships, a collaboratory system that supports data sharing should be able to support user-specified links.

Sharing and Applying Programs

Analysis of collected data lies at the heart of the scientific process. Data to be analyzed can be numeric or symbolic; some data, for example, may be in the form of literature. Increasingly, scientific analysis involves the use of software. Currently, genome researchers use analysis software to locate related items, such as gene sequences similar to a sequence being studied. In physical oceanography, simulation software is used to predict the results of future experiments, such as projecting ocean currents in a particular region. In space physics, modeling software is used to predict the behavior of observed phenomena, such as effects of the solar wind on the aurora. Community software such as IRAF and AIPS in the astronomy community and ORTEP and X-PLOR in the molecular biology community have proven very useful and have been widely disseminated and shared.

The committee found that sharing of software, application of external (i.e., not local) software to data, and application of local software to external data were three important capabilities sought by scientists contemplating useful collaboratory tools. Workshop participants observed that their research would be facilitated by technology that would allow them to call their specialized programs into action

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement