Click for next page ( 2


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
Summary Scientific data reflect both the organization and the chaos of the natural world. They stimulate us to develop concepts, theories, and models to make sense of the patterns they represent. The resulting abstractions are the formal and systematic ideas that constitute the understanding of relationships between causes and consequences, and perhaps may enable prediction of future sequences of events. Because scientists transform data from the material world into ideas, the observations of objects and processes in the physical world are the stimuli of scientific thought. Data are thus the seeds of scientific ideas. There are strong motivations for preserving scientific observations: . Many observations about the natural world are a record of events that will never be repeated exactly. Examples include observations of an atmospheric storm, a deep ocean current, a volcanic eruption, and the energy emitted by a supernova. Once lost, such records can never be replaced. . Observed data provide a baseline for determining rates of change and for computing the frequen- cy of occurrence of unusual events. They specify the observed envelope of variability. The longer the record, the greater our confidence in the conclusions we draw from it. . A data record may have more than one life. As scientific ideas advance, new concepts may emerge in the same or entirely different disciplines from study of observations that led earlier to different kinds of insights. New computing technologies for storing and analyzing data enhance the possibilities for finding or verifying new perspectives through reanalysis of existing data records. Thus, the relative importance of data, both current and historical, can change dramatically, often in entirely unanticipated directions. The substantial investments made to acquire data records justify their preservation. The cost of preservation will almost always be small in comparison with the cost of observation. Because we cannot predict which data will yield the most scientific benefit in years ahead, the data we discard today may be the data that would have been invaluable tomorrow. The assembled record of observational data thus has dual value: it is simultaneously a history of events in the natural world and a record of human accomplishment. The history of the physical world is an essential part of our accumulating knowledge, and the underlying data form a significant part of that heritage. They also portray a history of our scientific and technological development.

OCR for page 1
2 Preserving Scientific Data on Our Physical Universe There are numerous socioeconomic reasons, in addition to the compelling scientific and historical motivations, for the long-term retention of observational, as well as certain types of experimental, data. For example, historical climate data have had well-documented uses in a broad range of applications in the manufacturing, energy, agriculture, transportation, communications, engineering, construction, in- surance, and entertainment sectors. Such applications are common as well for other types of observation- al data on the Earth's environment. Experimental data in the physical sciences also have many industrial and other practical uses. Today we can foresee the possibility of using the national resource of scientific data more advanta- geously than ever before as technological advances open new vistas for managing scientific information. Advances in data storage technologies make the long-term retention of virtually all data both feasible and affordable. The existence of the Tnternet and of the emerging National Information Infrastructure (NII) enables nationwide sharing and application of data that reside in appropriately configured databases. Our new power to store, distribute, and access data and information is changing the way we work and think. However, the communities involved in the creation, retention, and use of scientific data about the physical world are not optimally organized. They commonly work toward disparate goals, are not well connected, and do not take full advantage of technological and conceptual advances in data management and communication. An entirely new approach to the Tong-term preservation of scientific data is now both feasible and essential. It must take advantage of advancing technology and of distributed communi- cations and management structures to empower both the creators and the users of such data. This study, performed at the request of the National Archives and Records Administration (NARA), and partially supported by the National Oceanic and Atmospheric Administration (NOAA) and the National Aeronautics and Space Administration (NASA), identifies the major issues regarding efforts to archive and use data in the physical sciences, establishes retention criteria and appraisal guidelines for those data, reviews important technological advances and related opportunities, and proposes a new strategy to help ensure access to the data by future generations. THE CHALLENGE OF EFFECTIVE PRESERVATION AND USE OF SCIENTIFIC DATA The results of scientific research are disseminated in this country through a hybrid system that includes professional society and other not-for-profit publishers, the commercial sector, and the govern- ment. The formal journals are published largely by the professional society and commercial sectors, while government agencies manage less formal reports (gray literature). Secondary abstracting and indexing services provide access to this literature, increasingly by electronic means. While there are strains in this system because of rising costs, increasing workload, and issues related to the protection of intellectual property, it has served U.S. science well and has been an invaluable link in the process of translating scientific advances into further advances, useful technology, and economic benefits. The current system, however, is not well suited to handle the scientific and technical electronic databases that are the focus of this study. The cost of maintaining these databases is typically too great to be covered by user fees; instead these databases must be considered part of the national scientific heritage. Some government agencies have accepted responsibility for maintaining and disseminating the data resulting from their research and development. hn some cases, this system is working reasonably well, but in others there are problems even with providing current access. Archiving for the long term raises questions in all cases, however. A general problem prevalent among all scientific disciplines is the low priority attached to data management and preservation by most agencies. Experience indicates that new research projects tend to get much more attention than the handling of data from old ones, even though the payoff from optimal utilization of existing data may be greater.

OCR for page 1
Summary 3 With regard to laboratory data, government programs have existed since the 1960s to compile results from the world scientific literature, to check the data carefully, and to prepare databases of critically evaluated data. Despite chronic underfunding, these programs have produced databases of lasting value to the nation, and the government investment in creating and maintaining these databases has been repaid many times over. In the area of observational databases, the situation is mixed. Federal agencies collect large amounts of observational data, which in many cases are continuously added to the available record of Earth and space processes. The data sets resulting from these activities are sometimes well-documented and maintained in readily accessible form; in many other cases, however, while the data are saved, they are exceedingly difficult or impossible to access or use, and thus are effectively unavailable. The most important deficiencies are in the documentation, access, and long-term preservation of data in usable form. Insufficient documentation is a generic problem that affects, in varying degrees, all the classes of data addressed in this study. Furthe:~ore, few of the federal data centers can give adequate attention to long-term archiving because they are stretched thin by current demands and inadequate resources. Even the data that are archived may become inaccessible because they are not regularly migrated to new storage media as the hardware and software used to access the data become obsolete or inoperable. Another major problem inhibiting access to data is the lack of directories that describe what data sets exist, where they are located, and how users can access them. In many cases the existence of the data is unknown outside the original scientific groups, and even if known, there frequently is not enough information for a potential user to assess their relevance and usefulness. The lack of adequate directories adversely affects the exploitation of our national data resources and leads to unnecessary duplication of effort. A significant fraction of the archived scientific data is held by the federal agencies that collected the data as n art of their mission. However. a large amount of valuable scientific data gathered with federal ~ ~ . . ~ . . ~ _ _ _ _ . . _ . . ~ _ funds Is never archived or made accessible to anyone other than the original investigators, many ot whom are not government employees. In many instances, the organizations and individuals that receive government contracts or grants for scientific investigations are under no obligation to retain the data collectedd, or to place them in an accessible archive at the conclusion of the project. Thus, data sets that commonly are gathered at great expense and effort are not broadly available and ultimately may be lost, squandering valuable scientific resources and much of the public investment spent in acquiring them. Clearly, there is a great need for the agencies to get more return on their investment in science by the simple expedient of making the data collected under their auspices accessible to others. Finally, the holdings of scientific and technical data by NARA in electronic or any other form are a ~I. .' ~. ~ ~. ~ . ~r ~ . . ~ very small In comparison with the data holdings of the federal agencies and the organizations supported by them. Moreover, NARA's budget for its Center for Electronic Records, which has the formal responsibility for archiving all types of federal electronic records, was only $2.5 million in FY 1994, a budget lower than that of many of the individual agency data centers reviewed by the committee in this study. Given NARA's current and projected level of effort for archiving electronic scientific data, it is obvious that NARA will be unable to take custody of the vast majority of these scientific data sets. Therefore, a coordinated effort involving NARA, other federal agencies, certain nonfederal entities, and the scientific community is needed to preserve the most valuable data and ensure that they will remain available in usable form indefinitely. The challenge is to develop data management and archiving procedures that can handle the rapid increases in the volumes of scientific data, and at the same time maintain older archived data in an easily accessible, usable form. An important part of this challenge is to persuade policymal~ers that scientific data and information are indeed a precious national resource that should be preserved and used broadly to advance science and to benefit society.

OCR for page 1
4 tional and evidential value. Preserving Scientific Data on Our Physical Universe RETENTION CRITERIA AND THE APPRAISAL PROCESS The National Archives and Records Administration appraises records on the basis of their informa- It is concerned with records of long-term value, those records that will probably have value long after they cease to have immediate, or primary, uses. The value of scientific and technical data is primarily informational and is based on the scientific content of the records, rather than on the evidence they provide concerning the activities of the agency that collected or created them. Recommendations The recommendations below regarding the retention criteria and appraisal process should be ap- plied by those responsible for stewardship-to all physical science data. Similar criteria and appraisal guidelines must be developed for data in other disciplines. This is a topic of primary concern not only to NARA, NOAA, and NASA, but to all scientists, data managers, and archivists who work with such records. As a general rule, all observational data that are nonredundant, useful, and documented well enough for most primary uses should be permanently maintained. Laboratory data sets are candidates for long-term preservation if there is no realistic chance of repeating the experiment, or if the cost and intellectual effort required to collect and validate the data were so great that long- term retention is clearly justified. For both observational and experimental data, the following retention criteria should be used to determine whether a data set should be saved: uniqueness, adequacy of documentation (metadata), availability of hardware to read the data records, cost of replacement, and evaluation by peer review. Complete metadata should define the content, format or representation, structure, and context of a data set. The appraisal process must apply the established criteria while allowing for the evolution of criteria and priorities and must be able to respond to special events, such as when the survival of data sets is threatened. All stakeholders scientists, research managers, information management professionals, archivists, and major user groups-should be represented in the broad overarching decisions regarding each class of data. The appraisal of individual data sets, however, should be performed by those most knowledgeable about the particular data primarily the principal investigators and project managers. In some cases, they may need to involve an archivist or information resources professional to assist with issues of long-term retention. Classified data must be evaluated according to the same retention criteria as unclassified data in anticipation of their long-term value when eventually declassified. Evaluation of the utility of classified data for unclassified uses needs to be done by stakehol~lers with the requisite clearances to access such data. ~- , OPPORTUNITIES CREATED BY TECHNOLOGICAL ADVANCES FOR NEW DATA USE AND RETENTION STRATEGIES Rapid progress in information technology continually alters both the quantity and the quality of scientific information and periodically stimulates fundamental modification of data management and archiving strategies. Recent technological advances have enabled new methods and strategies for data storage and retrieval and have created better ways of connecting users to data resources and to each other. Moreover, the evolving technologies are catalysts for revising organizational structures to manage distributed scientific data archives much more effectively. Table S.1 provides a summary of new technologies and related developments that enable a new strategy for the management of scientific and technical data. These advances in information technologies

OCR for page 1
Summary TABLE S. 1 New Technologies and Related Developments That Enable a New Strategy for the Management of Scientific and Technical Data s New Technology Trends and Related Developments Key Features What Is Enabled? lIigh-performance computer networks Low and declining cost of storage Advanced data management Changing requirements for information technology professionals High reliability of technology components Development and acceptance of standards Distributed functions; rapid delivery of large data volumes Inexpensive backup; continually declining cost; ease of migration Ability to rigorously and formally manage diverse data types Ability of personnel with lower technical skills to succeed in data management roles Availability of better components and connections; reduced procurement and operations costs Agreement on terms, interfaces, media, procedures Location of databases and archives where best managed; collaborative work; distributed organizations; distributed responsibility Deferral of archiving decisions; trust in distributed management due to safe storage backup More complex data structures (other than "flat files") handled in archives with great potential advantages Ability to entrust scientific data management in a distributed environment Reduced cost and effort in data migration; trusted connections for communication and collaboration Reduced effort to communicate and apply results of others; ability to concentrate on mission issues and not on technology support and data management support the creation of a highly distributed, federated management structure for our nation's scientific information resources. A NEW STRATEGY FOR ARCHIVING THE NATION'S SCIENTIFIC AND TECHNICAL DATA In order to respond adequately to the imperatives for preserving data about the physical universe and to take advantage of the technological advances described above, the federal government should create an integrated and adaptive infrastructure and related processes for providing ready access to the national resource of scientific and technical data and related information. Such an effort must support the needs of data originators, users, and custodians across all phases of the data life cycle, from origin to use by future generations. The committee believes that the following principles should guide the effort of the government agencies in the long-te~m retention of scientific and technical data: . Data are the lifeblood of science and the key to understanding this and other worlds. As such, data acquired in federal or federally funded endeavors, which meet established retention criteria, are a critical national resource and must be protected, preserved, and made accessible to all people for all time. . The value of scientific data lies in their use. Meaningful access to data, therefore, merits as much attention as acquisition and preservation.

OCR for page 1
6 Preserving Scientific Data on Our Physical Universe Adequate explanatory documentation, or metadata, can eliminate one of today's greatest barri ~ . . . . ers to use of sclentl~lc data. . . 7 arcnlvlng. A successful archive is affordable, durable, extensible, evolvable, and readily accessible. The only effective and affordable archiving strategy is based on distributed archives managed by those most knowledgeable about the data. Planning activities at the point of data origin must include long-term data management and The Proposed National Scientific Information Resource Federation The committee believes that the federal government should create a National Scientific Information Resource Federation-an evolutionary and collaborative network of scientific and technical data centers and archives to take on the challenge of providing effective access to and preservation of important data and related information. Such an initiative would begin to exploit fully our nation's significant investment in the physical (and other) sciences and the data acquired with that investment. Several critical concepts must govern any federated management structure for it to function properly (Handy, 1992): Subsidiarity the power is assumed to lie with the subordinate units of an organization. Power can be relinquished, but not taken away. The subordinate units typically are best qualified to make operational decisions that directly affect them and that they will be implementing. The central manage- ment is allowed only those powers needed to ensure that the subordinates do not damage the organiza- tion. It is clear that the strengths of the current system for managing scientific and technical data and information in the United States are distributed among a number of diverse data centers and archives, both within and outside the government. A successful federation of these existing institutions would recognize that they are the locations of expertise on their respective data holdings. Thus the central organization should be small and should not micromanage the day-to-day operations of the subsidiary organizations. Pluralism-the members are interdependent. In a federation, the individual subsidiary organiza- tions recognize the advantages of belonging to the federation, because of products or services that can be obtained from other elements in the federation. The existence of many specialized data centers and archives, as well as the possibility of creating new ones in a networked environment, can offer significant economies of scale and improved sharing of ideas and expertise. What is good for the subsidiary element also should be good for the whole. Pluralism, coupled with subsidiarily, guarantees a measure of democracy in the federation. Standardization interdependence requires compatible languages, communications, basic rules of conduct, and units of measurement. These elements may be summarized as technical and procedural standardization. Standards that are developed by consensus of the subsidiary elements (e.g., the participating data centers, archives, and researchers) are widely recognized as essential to the successful management of data. Separation of powers (responsibilities) a system of checks and balances is necessary to ensure that the central authority does not take on unnecessary power. This principle must be incorporated into the federation's organizational structure. Strong leadership-the central coordinating element or executive office must act as the standard bearer, promoting the federation's established goals and objectives while reminding the subsidiary organizations of the importance of carrying out their responsibilities. A federated data management system would be consistent with the goal of the National Information Infrastructure to distribute information resources broadly throughout our society. The technology is

OCR for page 1
Summary 7 available to make a fully networked, but highly distributed system of data centers and archives both feasible and desirable. Such a system would be efficient in providing access to scientific data and information to a large number of potential users and would maximize the government's return on the very large investment that initially went into acquiring those data. From an organizational standpoint, a federated management structure would allow the disparate elements to continue to specialize in what they each do best and to fulfill their individual organizational mandates, while providing some efficiencies of scale and political leverage in addressing the most pressing issues. The committee believes this approach is especially timely and important in an era of federal government budget reductions. Recommendations The committee thus recommends that the federal government take the following steps for adequately preserving and providing access to data about our physical universe: Adopt the National Scientific Information Resource (NSIR) Federation concept as an integral part of the National Information Infrastructure (NIT). This concept must encompass not only an electronic network, but also individuals, organizations, communities, data resources, procedures, guidelines, and associated activities of data generation, management, custodianship, and use. The NSIR Federation thus should provide the means for defining a coherent approach to managing the life cycle of scientific data. This approach should be developed and implemented through consensus of collaborating organizations with diverse and autonomous missions. The interagency Global Change Data and Information System is an example of a prototype NSIR Federation, focused on data for a specific set of interdisciplinary science problems. The NSTR Federation would build on such efforts, providing for better coordination and interaction among them, and would help organize fledgling efforts to preserve and provide broad access to data in other disciplines. The administration should take the steps necessary to fully define and create the NSIR Federation. There are at least two potential focal points within the administration for planning such an activity. These are the interagency Information Infrastructure Task Force for the NI] and the National Science and Technology Council. A convocation of representatives from the scientific, data and information management, and archiving communities would be a good way to help define and inaugurate . . . . . t Us ~n~t~at~ve. Following the formal authorization by the federal government for creating the NSIR Federa- tion, the principal parties, including NARA and NOAA, should conclude agreements for the implementation of a distributed archive system. The system should involve all relevant institu- tions, including nongovernmental entities that are funded by the federal government or that maintain data that were acquired with federal funds. As a general principle, data collected by an agency should remain with that agency indefinitely. The committee recognizes that this recommenda- tion may require significant operational changes for agencies other than NOAA, and even some changes with respect to NOAA's data activities. Furthermore, the associated agencies in the NSTR Federation must work together, under the lead of a small executive office with the expertise to establish data management guidelines and minimum criteria for adequate metadata that could be applied across the entire Federation. The executive office could be either a high-level interagency coordinating committee or a new office at an appropriate federal agency, such as the National Science Foundation, which has a broad scientific and technical as well as communication mandate. In any case, the executive office should resist the typical tendency toward bureaucratic accretion of power, personnel, and resources, as well as the tendency to consolidate and centralize data holdings. A management council consisting of representatives of the member organizations should be created to help ensure that the executive office function remains fully responsive to all members of the federation.

OCR for page 1
8 Preserving Scientific Data on Our Physical Universe Data access and preservation services should be implemented on the most cost-effective basis possible for the Federation. For example, one institution should provide a service to one or more other institutions in order to exploit potential economies of scale and focal points of expertise. This measure might increase the cost to the providing institution, but would decrease the overall cost to the federation, the government, and the taxpayer. The institutions belonging to the NSTR Federation should develop a process for collaborating effectively on specific initiatives. This process should provide a mechanism to define and prioritize data management and preservation initiatives, to establish the required agreements between collaborating organizations, and to secure funding for each initiative. Each participating organization would contribute to the federation according to its particular strengths and in a manner consistent with the founding charter. In addition, an independent advisory board consisting of experts from user groups should be formed in ~ . . . . support of eac n ~n~t~at~ve. The NSIR Federation should develop a national resource of information technology that is consistent with its chartered objectives and that can be effectively distributed to institutions that must manage data. These technologies would include complete products, designs, guidelines, stan- dards, and methodologies. A related long-term technology strategy, or "technology navigation" function, should be developed to help guide these efforts. The NSIR Federation should institute an independently managed process for awarding NSIR certification to member scientific institutions and their data and information systems on the basis of well-defined criteria and standards. The certification process should be managed by a nongovern- mental, not-for-profit organization, which would receive technical guidance from the participating federal agencies. The certification needs to have credibility in the community, so that nonmember institutions will aspire to attain certification and have it tagged to their products. The certification also should be something that commercial value-added providers seek to increase the credibility of their products. It also is important for the committee to state what the NSIR Federation should not be. It should not become an expensive bureaucratic entity. The executive office must not impose any standards or information technologies from above that have not been validated through a consensus process of the member organizations. Finally, the executive office must not attempt to micromanage the operations of the participants, nor should it have any direct control over their budgets and funding allocations. Recommendations Specifically for NARA Although NARA has a legislative mandate to preserve federal records, it cannot today, nor will it likely ever be able to, act as the custodian of most physical science data. The data volume is too great in relation to the very low funding appropriated to NARA, the NARA staff do not have the specialized scientific knowledge, the interagency linkages are not in place, and a huge infrastructure similar to that which already exists at other agencies would need to be duplicated by NARA. In addition, the designation of a federal record is sometimes irrelevant to the archival process for scientific and technical data, and many data of long-term interest do not meet the existing definition of a federal record.* Hence, *"' [Federal] records' includes all books, papers, maps, photographs, machine readable materials, or other documentary materials, regardless of physical form or characteristics, made or received by an agency of the United States Government under Federal law or in connection with the transaction of public business and preserved or appropriate for preservation by that agency or its legitimate successor as evidence of the organization, function, policies, decisions, procedures, operations, or other activities of the Government or because of the informational value of the data in them" (44 U.S.C. 3301J.

OCR for page 1
Summary 9 NARA has a special role as a partner in the archiving process for scientific and technical data sets that is different from its traditional role as the nation s archives. The committee makes the following specific recommendations to NARA in addition to those made elsewhere in this report: NARA should strengthen its liaison with each federal agency that produces scientific and technical data to ensure that appropriate attention is devotecl to their long-term retention in a distributed storage environment. NARA should form standing advisory committees with managers of scientific data, historians, and scientific researchers to address the retention and appraisal of scientific and technical data collections and related issues. NARA should collaborate with other agencies that maintain long-term custody of data to develop an effective access mechanism to these distributed archives. The initial step shoal focus on locator systems and evolve toward a transparent access system. Finally, NARA should work with the scientific community and potential sources of scientific data to develop adaptable performance criteria for data formats and mealy, rather than mandating narrow and inflexible product standards. Recommendations Specifically for NOAA As the largest holder of earth sciences data in the United States, NOAA has a vast amount of scientific data stored at a number of facilities across the country. NOAA thus has an especially important role in the preservation of our nation s observational data on the physical environment. The committee makes the following specific recommendations to NOAA: NOAA should place a higher priority on documenting and establishing directories of its data holdings. NOAA, with the active cooperation of NARA, should lead efforts to better define technology- independent standards for archiving, storing, and transmitting the data within its purview. Finally, NOAA, as well as every other federal science agency, should ensure that: . all its data are shared and readily available; it fulfills its responsibility for quality control, metadata structures, documentation, and creation of data products; it participates in electronic networks that enable access, sharing, and transfer of data; and it expressly incorporates the long-term view in planning and carrying out its data manage- ment responsibilities. The creation of the committee s proposed NSTR Federation would help provide a collaborative mechanism and more sustained peer pressure to meet these objectives, and thus enhance the value of scientific and technical data and information resources to the nation.