Click for next page ( 43


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 42
. 4 The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies Rapid progress in information technology continually alters both the quantity and the quality of scientific information and periodically stimulates fundamental modification of data management and archiving strategies. Recent technological advances have enabled new methods and strategies for data storage and retrieval and have created better ways of connecting users to data resources and to each other. Moreover, the evolving technologies are catalysts for revising organizational structures to manage scientific data archives much more effectively in a distributed manner. Assumptions about effective management of scientific data that have been long and firmly held are being directly challenged by new information technology. These assumptions have been based on experience with management of paper records, generally in domains outside of science. Some of the outdated assumptions that are rapidly losing their relevance include the following: . Physical possession of the data is essential to their management and archiving. This principle has outlived its usefulness in the context of electronic physical science data and has made access difficult for legitimate users. Electronic information is easily copied and disseminated. This feature removes constraints imposed by the limited physical access. Because most government physical science data are considered to be in the public domain, the constraints of copyright and fee collection to the free movement of data are removed as well. Cost of an archive increases in proportion to collection size and use. Physical archive cost is a function of space, as well as cataloging, repair, and access efforts. Improved inventory technology has eased some of the cost burden over the last several years, but, fundamentally, archives with large physical holdings operate in traditional ways with linearly scaling costs. Such costs nch,nilv lit. 11~. Ins.. physical handling of items scales with use, whereas budgets reflect usage indirectly. _ 1 ~ ~1 , , ~ ~ ~. ~. ~ J O -J ~- ---7 ~ ~ In contrast <;le~;~rurll`; 1nluImallon storage aria marlagemem COSTS nave aeclmeo as rapidly as the costs or computer technology and processing over the last 30 years. There is no foreseeable end to this process. Storing and using the next byte will be cheaper than storing and using the most recent byte for a long time to come. Only archivists and librarians have the capabilities to manage archived data. While librarians and archivists are important advisors and participants in scientific data management, the dominant management responsibility falls to the scientific community and its designated scientific data managers (who are a blend of scientist' computer scientist, and librarian/archivist). If practicing scientists do not participate in the management of scientific information, such data will fall into obscurity or obsolescence. 42

OCR for page 42
The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies 43 The locator information (catalogs about the managed objects is simple and compact. Finding relevant scientific information often requires searching the full content and this content generally is not in the conveniently compressed form of text. For example, to search for all data sets where the stratospheric ozone concentration is less than some ad hoc threshold in some region, one would need to execute a complex algorithm on every data sample covering the region in question. Queries such as this become even more complex if the region of interest is determined after retrieval (e.g., how many days in a row was the are al extent of the ozone hole over open ocean greater than 5,000 square kilometers?. The selection and use of scientific data to solve complex problems can be simplified through the use of the concept of browsing information based on content. Browsing often involves examination of large numbers or samples and data volumes. Specialized "browsing products" can be defined to locate records of interest. For the query examples above, low-resolution ozone maps could be used to find candidate data sets with high probability of relevance. Information about the processes (including sensor character- istics, computer program capabilities, and calibration points) used to develop the data set is needed for its proper use. Such information increases the size and complexity of the locator service. The remainder of this chapter describes how advancing information technologies enable the data manager, librarian, and archivist to deal with the ch~ll~n~ of .~.i~.ntifi~. flats m~n~.m~nt in collaborative fashion with the scientific user community. ~_= _ ~A ~^ ~ ~= ~^~ ^ ~ ENABLING TECHNOLOGIES AND RELATED DEVELOPMENTS Table 4. ~ provides a summary of aspects of scientific data management changed by new technologies and related developments. These six areas are discussed in more detail below. High-Performance Computer Networks The rapid expansion of computer networks and their use for electronic mad! and database access have obviated the need for researchers and other users of scientific and technical data to be in physical proximity to colleagues, information resources, and even advanced technical facilities. This has presented a menu of choices about the best means to distribute data and the responsibility of managing them. A worldwide, "virtual" library is being created on the Internet. Application programs such as Mosaic are demonstrating the power of free and simple navigation across an ocean of available resources. Improving network capacity, reliability, performance, and security measures are helping to make these resources more widely accessible and useful. High-performance networks also support movement of information for new applications (e.g., for producing safely managed backup copies, "profiling" information for individual user's needs, or staging data through a number of refinement steps in different locations for focused research). Networks support collaborative work and research projects that span traditional research boundaries. Such work requires easy access to a variety of data sources at once. High-performance networks enable scientific data resources to be widely distributed and managed by groups of scientists. Users thus are freed to concentrate on the most effective use of the data, rather than on their own data management issues. Networks can provide a vehicle for regularly distributing backup copies of data and metadata to ensure safe storage. Distribution of data to users can be done via the network in addition to, or instead of, via physical media such as tapes and CD-ROMs. Data can be linked together to help users navigate among related items. This kind of linking is at the heart of the World Wide Web concept and brought to users by Mosaic. The population of information providers (e.g., people who can contribute to the knowledge base) has now grown to include all networked members of a user

OCR for page 42
44 Preserving Scientific Data on Our Physical Universe TABLE 4.1 New Technologies and Related Developments That Enable a New Strategy for the Management of Scientific and Technical Data New Technology Trends and Related Developments Key Features What Is Enabled? High-performance computer networks Low and declining cost of storage Advanced data management Changing requirements for information technology professionals High reliability of technology components Development arid acceptance of standards Distnbuted functions; rapid delivery of large data volumes Inexpensive backup; continually declining cost; ease of migration Ability to rigorously and formally manage diverse data types Ability of personnel with lower technical skills to succeed in data management roles Availability of better components and connections; reduced procurement and operations costs Agreement on terms, interfaces, media, procedures Location of databases and archives where best managed; collaborative work; distributed organizations; distributed responsibility Deferral of archiving decisions; trust in distributed management due to safe storage backup More complex data structures (other than "flat files") handled in archives, with great potential advantages Ability to entrust scientific data management in a distributed environment Reduced cost and effort in data migration; trusted connections for communication and collaboration Reduced effort to communicate and apply results of others; ability to . . . concentrate on mission issues and not on technology support population. Such contributions can be as simple as an annotation on an existing item, or as complex as a fully processed and peer-reviewed new item. Most profoundly, the evolving network infrastructure enables new concepts for distribution of functions and responsibility in organizations (NRC, 1994~. Although networks can provide a quick and easy means to distribute data, it must be noted that CD- ROMs have been used to distribute data for several years and have been very successful. CD-ROMs not only permit users to have a huge local library of data, but they often come with a better set of data access tools than are normally available. Some data sets are large enough that the most cost-effective method to deliver them is on media such as Exabyte tapes (8 mm). Low and Declining Cost of Storage As for most aspects of computer hardware, the cost of storage has declined continuously and rapidly for the 30 years of the modern computer age. New storage technology is also increasingly compact and supports ever greater access speeds (Gelsinger et al., 1989~. The historical trends are expected to continue for up to 20 years. Already, laboratory engineering results confirm this projection for at least the next decade. The most significant implication is that the decisions about sampling or discarding scientific data can generally be deferred, particularly for data sets for which the necessary metadata exist and whose quality has been certified. For relatively smaller data sets, the deliberation regarding long- term retention may well cost more than the recurring acts of migration. The cost of storage is small in relation to overall mission or investigation costs and therefore should not be a decision driver. Experi

OCR for page 42
The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies 45 ence suggests, however, that the funds to meet these costs need to receive special protection in the annual agency budget cycles. The support for the data management aspects of scientific missions has typically had a lower priority than the data collection aspects. The low cost of storage also implies that the incremental cost of supporting a remote safe copy of data and metadata also will be small, except for the very large data sets. Therefore, over the next few decades, data received and stored may be expected to be cheaply and quickly migrated to new technologies when storage media reach their nominal limits of reliability or for convenience of improved access. It is important not to expect a perpetual advantage from this technological discontinuity. The fact that data require significant time periods for their migration must be considered. The cost decay trend will slow down at some point in the future, causing the overall cost of storage to return to something closer to the linear relationship to volume. We also must be realistic and expect that funds will not always be available to save and back up every data set. Decisions on retention or sampling will have to be made. Nevertheless, the already low and continually declining cost of storage allows a priori decisions to be made in certain circumstances to keep scientific data sets indefinitely. Backup or safe storage copies of data are becoming more affordable as data migration becomes less expensive with smaller, faster, and cheaper storage devices. Reliability also is improving with new software-based archive systems (including migration and backup features). However, there is an enhanced need for ongoing technology monitoring by an appropriate body for media, standards, and migration automation. Such monitoring should be incorporated in any scientific data management and archiving strategy. The rapid change of storage technologies suggests that efforts to protect today's scientific data legacy must be accelerated. The obsolescence of media types and recorders/players is occurring within shorter and shorter time periods. This implies that "salvage" activities will be increasingly difficult for data left out of migrations to new media. This "join or be left behind" by-product of rapid technological change intensifies short-term budget pressures on archives. It demands in response a strong management commitment to provide resources and save important data sets. If digital data are to survive, it is of fundamental importance to manage and constrain the costs of archive maintenance. The problem is that new data will be coming in, old data will need to be migrated to new media, the building will need to be repaired, and there usually will not be a lot of extra money for new equipment or added staff. To avoid problems, the data migration process in the system design must be almost totally automated. This refinement often has not been achieved, and it can cause unnecessary budget difficulties. Finally, it is essential for agencies to preserve all the hardware and software necessary to access all their data until the data have been successfully migrated or otherwise disposed of. Advanced Data Management There are signs that data management technology is beginning to address and, perhaps, to catch up with the complexities of the very large volumes of scientific data. Improvements have occurred in database management systems, hierarchical file systems, data representation standards, query optimizers, data distribution techniques, specialized access methods, and data security tools (Silberschatz et al., 1991~. Further, investment in standards and cooperative approaches is accelerating, fueled in part by the demands of medicine, education, entertainment, journalism, financial services, and other commercial applications. While competing approaches and inconsistent vocabulary create near-term confusion, the attention and investment levels bode well for the longer-term capability to go beyond "flat file" representations of data that need to be archived. The new tools and techniques are more descriptive of the data, their heritage, the processes that have worked upon the data, and the relationships of data to each other. New data management technology will enable easier representation of more diverse types of scientific data. Because of the rigor that new techniques require (e.g., for self-documentation or for precise definition of access methods), long-term archives will benefit from data structures other than flat

OCR for page 42
46 Preserving Scientific Data on Our Physical Universe files. The new technology also implies that the creation of a richer set of metadata will be easier to implement and that these data will be of high scientific value for content-based retrievals. To realize the potential of this enabled facility with metadata, the scientific community will have to accept and support efforts to develop and apply new metadata requirements. The Changing Requirements for Information Technology Professionals Information technology professionals with high skill levels can now be found in all parts of the United States and around the world. But as they bring the information technology industry to higher levels of maturity, the effect is to reduce the complexity of major tasks in managing information. Such tasks previously required their skilled use of sophisticated assembly language or job control language (JCL) programming. ICL programming refers to the steps in the old days that one used at the system console to get programs to run, attach the right files, print to the right printer, and similar functions. Today, much of this work is masked, made automatic, and controlled through icons and other means. These tasks can now be performed by competent scientists or professionals with lower technical skills, rather than by highly trained specialists. Because more functions can be completely handled by machines, management of the data can be greatly automated and operated by less skilled individuals. The data themselves can be widely distributed without fear of loss, particularly with a backup copy in safe storage. Over the next 5 to 10 years, the costs for information technology professionals at individual scientific data centers and archives can be dramatically reduced. The reasons for the reduction in costs include more automatic processes for storage management, rudimentary learning capability in systems, services performed by end users based on their preferences, improved systems management, higher component reliability, improved application of standards, and vendor consistency with standards. Although the dominant trend will be for a smaller, less technically skilled staff to manage the physical aspects of the archive, there will be a pressing demand for fewer, highly skilled people who blend the skills of physical scientist, computer scientist, and archivist. These people must be able to handle the intellectual challenges of bridging these disciplines while providing the coaching and direction to help develop data and operations standards for scientific communities. High Reliability of Technology Components Microprocessors, new storage media technologies, mature software, error correction capabilities, improved packaging, and reduced power consumption have all made si~nifi~.nnt o.ontnbiltion~ to th reliability of computer systems and networks r - ~ _ _ _ =~ __A~ ,^ ~ ~ _ _^ _ What was recently considered unreliable, requiring constant attention and expensive repair, is now regarded as reliable and not worthy of effort to repair. Although precautions have always been taken to protect against loss of valuable data, many of these precautions are now built into the base of mature software or are increasingly familiar parts of facilities' operating procedures. High reliability of technology supports a capacity for high levels of trust and the ability to widely distribute functions and databases. These distributed systems can achieve the same levels of quality and trust as centralized archives through the use of the same underlying hardware and software technology, operating procedures, safe storage of copies, and high-quality (error-corrected) telecommunication connections. High reliability has enabled new applications such as the World Wide Web, in which context switching from one machine to the next-on a worldwide basis is readily accomplished. Increased reliability also has allowed computing technology to be put into the hands of business managers, consumers, and shop clerks. Without such reliability, maintenance effort would outweigh productivity benefit. As a result, powerful organizational or operational frameworks can be built, much as new materials enable new architecture or new machines.

OCR for page 42
The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies Development and Acceptance of Standards 47 The development of effective standards has been pivotal to promoting the widespread use of electronic information. Communication protocols such as TCP/IP have fueled the growth of the Internet. Other format standards for documents support their interchange. For example, the Standard Generalized Markup Language (SGML) provides a uniform way of formatting textual documents so that they can be read by different document processing tools. The HyperText Markup Language (HTML) is a standard used to represent and link documents; it is used to describe pages viewed with Internet viewers such as Mosaic. Hardware and software standards such as the instruction set architectures for microprocessor- based computers, modem protocols, media formats, and query languages also have played critical roles. Standards can simplify many of the traditional data management jobs. For example, the time that would be used to decipher a tape format is saved and the job of installing a new application is facilitated. Having effective standards in place reduces the level of tedious, nonproductive effort and frees up time for new tasks for the archivist. Standards determined now will typically be in effect for long periods of time, perhaps a decade or more, with some small evolutionary augmentations. This means that a baseline of appropriate standards can be selected for a body of information with some reasonable expectation that they will not be quickly replaced. When it appears that the existing standards baseline needs to be updated, the information can then be migrated to a new one. A deliberate data migration strategy based on standards tracking is possible. The role of standards certainly is not limited to the general computing community. Scientific teams and discipline groups continuously work to codify best practices, definitions, and algorithms. These are propagated as community standards. Standards developed by the scientific community are often the most important to promote and apply. If properly promulgated, they can enable improved understanding, broader collaboration, and facilitation of the data management and related research. Finally, it should be emphasized that standards and guidelines to support long-term archiving must not inhibit innovation, or the evolution of information systems and technology. Often the best standards and guidelines are those that are independent of technology. OPPORTUNITIES FOR NEW ORGANIZATIONAL STRUCTURES With rapid technological improvements and newly enabled capabilities, it is sometimes easy to forget the importance of long-term commitment by managers to policy and resource requirements. No technological changes will by themselves replace the basic, unsung efforts of high-quaTity scientific data management. In fact, although technology itself can improve the availability of data, truly accessible and useful scientific information will be achieved only through such management commitment. This commitment must be based on a coherent strategy for life-cycle management of data, including technol- ogy acquisition, data and information management practices, and technology-independent standards to ensure that the minimum levels of data content and consistency for research uses are met. Further, such a comprehensive strategy will be successful only with the active and committed involvement of the scientific community itself. The level of effort and change that may be required to achieve this community involvement cannot be underestimated, and fundamental change to the value system of the community may be required. Nevertheless, as discussed above, technological advances allow the creation of new infrastructure, challenging existing organizational assumptions. Effective organizational designs based on new alloca- tions of responsibility are enabled. For scientific data management, the technological changes support organizations with the following attributes: Widlely distributed responsibility. New telecommunications, data management, and standards technology allows for high levels of trust in distributed data management. Physical possession of data by

OCR for page 42
48 Preserving Scientific Data on Our Physical Universe archivists is no longer essential. The wide availability of information technology professionals and other skilled data managers (along with the lower technical skill levels actually needed) enhances the ability to distribute the data more broadly and increase user participation. Such distribution of data and their ownership (whether actual or implied) by user groups improves the utility of the data and helps create important support for long-term retention. High-value peer-to-peer communication. With access to data and to people on line, a variety of new collaborative relationships can develop. Information can be broadcast to interested individuals in a timely fashion. Data can be provided directly to field researchers to focus new data collection. Physical proximity and formal lines of communication are no longer vital to effective organizational operation. Indeed, closed, highly structured organizations often will be uncompetitive or fad! to take full advantage of innovation. Specialized data centers. Distribution of resources implies that some specific locations can specialize and yet still contribute effectively to all. Specialized groups or institutions could be created in a scientific discipline or in some aspect of data management, archives, or standards. Designation of such specialized centers, in addition to those already in existence, is a significant mechanism for achieving economies of scale, reducing overall costs while enhancing the effectiveness of certain functions for the benefit of all. Explicit long-term (technology) strategies. A long-term technology strategy needs to be devel- oped. The rapidly changing base of technology requires that a deliberate sequence of phases be selected, through which data and data management will migrate. The constant evolution of information technolo- gies demands that an organizational element take on this "technology navigation" function. Measurement as a vital tool. In a fast-paced, and, perhaps, widely distributed effort, metrics are important to clearly communicate expectations of performance, register results, and help in detecting weak spots for corrective action. In particular, metrics could be established to determine data set use and to support archiving strategy decisions. Metrics also could be developed to help ensure high-quality service and proper data protection. . .