Click for next page ( 14


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 13
The Challenge: Preservation and Use of Scientific Data We advance our understanding of the physical universe by building on current and past studies in individual disciplines, by collecting and analyzing new types of data, and by using past observations in entirely new ways not envisioned when the data were initially collected. The more complete the record of scientific data and information, the more new understanding and knowledge we can extract from it. Observations of natural phenomena typically represent a record of events that will never be repeated in a dynamic universe that continually changes in time and varies in space. New scientific advances have had significant, sometimes profound, societal and economic impacts and may be expected to be equally important in the future. Scientific data and information are at the heart of these advances and are essential for new discoveries. Therefore, they constitute a precious national resource. The sections that follow describe briefly the two major types of data that are of critical importance in the physical sciences-experimental laboratory data in physics, chemistry, and materials sciences, and observational data in the earth and space sciences. In each of these broad areas the progress that has been made to date in terms of long-term preservation and accessibility is characterized, and the key issues identified. More comprehensive descriptions of the status of Tong-term data retention in the various physical science discipline areas are in the volume of working papers prepared as background for this report (NRC, 1995~. EXPERIMENTAL LABORATORY DATA The experimental sciences have progressed over the centuries by building on the concepts, theories, and factual information resulting from each generation of scientific inquiry. The observations of Tycho Brahe were used by Kepler to develop his laws of planetary orbits, and Newtob's formulation of mechanics drew upon the previous work of Galileo, Kepler, and others. A century of measurements on properties of the chemical elements provided the raw material needed for Mendeleev to construct his periodic table. The history of science is rich in examples where the introduction of new, often revolutionary, concepts rested on data that had been preserved from previous scientific investigations. Furthermore, the technology of tomorrow is often based on the laboratory Catalog today or yesterday. The explosive growth of science in this century provides many other examples of the key role of data from previous experiments. When Townes and Schawlow published their landmark 1958 paper that demonstrated the theoretical possibility of building a laser, intensive efforts were started to find a real . 13

OCR for page 13
14 physical system that would meet the necessary requirements. Data on atomic spectra, some of them 60 to 70 years old, provided the key to creation of the first working gas laser. If it had been necessary to make new measurements on every conceivable system in order to select the most promising for trial, the invention of the laser and all the new technology and economic benefits that it has brought would have been delayed for many years. The crash program to improve rocket propulsion systems following the launch of the first Soviet Sputnik provides another example. Data on the thermodynamic properties of a wide range of substances were essential to the efforts to optimize rocket engine performance. A concerted government program was started to build a database of thermodynamic properties for rocket engine design. Although some new laboratory measurements were required, many of the needed data were in the scientific literature, some published as early as 1880. The availability of these older data significantly aided the rocket engine program. Data generated by scientists and engineers in the fields of physics, chemistry, and materials science have traditionally been published in research journals, which serve both a current dissemination and an archival function. This journal system has served science well for 300 years. Many scientific libraries throughout the country provide access to these journals. Because back volumes are kept in libraries in many different places, there is little danger of irreparable loss from a natural catastrophe. Many scientific societies also have depository systems that allow authors to submit voluminous data sets that cannot be published in the journals because of lack of space. The societies maintain these archives, generally on microfilm, and supply copies on request. While the crowing use of electronic r~.~.r)r~lin~ Once ctnr~oP t~hnirill-~ is alr=~rl`r Fir the Preserving Scientific Data on Our Physical Universe ~. . ~ace= _ ~-_ ~_4~L$ ~1 ~LA1w ~1111 LO . ~ . traultlona1 Journal system, we can expect publishers to take advantage of the new technology to meet new needs. Scientific societies are beginning to implement electronic archives for preserving data that are too voluminous to publish in paper formats. For example, the American Chemical Society recently began to make data from papers in its leading journal Journal ofthe American Chemical Society) available on the Internet. It is a natural step from the paper and microfilm archives that such societies now maintain to the electronic archives of the future. Clearly, these private sector archives must be an integral part of the overall concept of a "National Scientific Information Resource." Electronically recorded data in the laboratory physical sciences are of two forms, original experimen- tal measurements and evaluated compilations of published data. These are examined here in turn. Original Experimental Measurements Recent decades have seen significant changes in the form of "original data." A raw experimental result was, in the past, typically a measured value such as a voltage or distance. The investigator read these measurements from instruments, wrote them in a notebook, treated them arithmetically to obtain the desired scientific variable from the raw measurement, and interpreted them. The original measure- ments were eventually discarded in most cases. Today, many raw data are acquired and processed electronically as soon as they are entered into the computer, so that only the processed data exist Tong enough for anyone to look at. With rapid, automated data acquisition and manipulation, the option exists to keep electronic data and reanalyze them as required. However, automated data collection often results in large volumes of insignificant data, so that in many experiments the data stream is screened and most of the data are discarded in real time by a computer program or by the experimenter. For example, spectroscopists used to keep, at least temporarily, the photographic plates or recorder charts from which they had taken measurements. Now the spectral features may be analyzed electronically immediately upon measurement, and only the attributes of relevant features are recorded. The fraction of the raw data that is saved after initial processing may be small, sometimes less than one part in 10,000. In virtually all cases, there is no justification for preserving the raw data, because the experiment can be repeated in those rare instances in which an unanticipatedfuture interest appears.

OCR for page 13
The Challenge: Preservation and Use of Scientific Data 15 When considering laboratory data of this kind, it is usually best to recognize that no one knows as much about the original data as the original experimenter. If the experimenter does not find the raw data worth preserving (and worth documenting), then the data are probably not going to be of use to anyone else. Because the number of stages of processing (e.g., replication, averaging, coordinate transforma- tions, applying corrections, and so on) differ for every type of measurement and undergo continual evolution as new techniques are introduced, it would be fruitless to try to formulate generic retention criteria for all types of laboratory data. However, there are certain classes of laboratory data (where "laboratory" is used in a broad sense) that should be candidates for preservation if properly documented, because it would be impossible or impractical to reproduce the measurements. Some of the data taken in large plasma physics facilities fall in this category, because reproduction of the facilities would be extremely costly. A more striking example is the spectroscopic and other measurements from nuclear tests in the atmosphere, which it is hoped will never be reproduced. On a more mundane level, properties of engineering materials, measured as a part of large government research and development programs, provide many data of possible interest in the future. Such data are acquired as a small step in a larger program and usually are not published in the scientific literature or disseminated by the usual channels. They would be costly to reproduce because many of the materials were specially prepared with unique fabrication technology. Examples include polymer and sensor data from the Strategic Defense Initiative, engineering data from the National Aeronautics and Space Administration (NASA), and the superconducting materials mea- surements carried out to develop magnet fabrication techniques for the canceled Superconducting Super Collider. Even though this project will not be completed, the materials measurements should be saved, because they may well be applicable to future engineering projects. Evaluated Compilations Compilations resulting from the critical analysis of a large body of data from the scientific literature are a separate area for consideration. Well-known examples include thermodynamic property compila- tions such as the National Institute of Standards and Technology's loins Army-Navy-Air Force (lANAF) tables and the thermophysical properties disseminated by the Department of Defense's Center for Information and Data Analysis and Synthesis at Purdue University (see the Physics, Chemistry, and Materials Sciences Data Panel report in the NRC (1995) report for a detailed discussion of these examples). The Department of Energy operates several data evaluation centers in nuclear physics and chemistry. In such centers, the data and backup documentation are not impossible to replace; they simply represent so much effort and exercise of specialized scientific judgment that it would be extremely costly to redo the work. The cost of not having the data available, although usually difficult to measure other than anecdotally, can be much higher than the cost of preserving them. In particular, if it becomes necessary in the future to expand or extend the compilation, the full documentation (e.g., data extracted from references, fitting programs, notes on the analysis techniques, and the like) will provide a valuable base for the new work. A major concern in considering these data collections is how the data and the underlying documentation can be preserved and made accessible if the centers producing them lose their funding or expert personnel. This concern increases as government agencies downsize their activities. OBSERVATIONAL DATA IN THE PHYSICAL SCIENCES Over the past two decades, the National Research Council and other groups have issued numerous reports that have addressed data management issues, including long-term retention require- ments, for digital observational data in the earth and space sciences (NRC, 1982, 1984, 1986a,b, l988a,b, 1990, 1992b, 1993; GAO, 1990a,b; Haas et al., 1985; NAPA, 1991~. Most of these reports have focused quite narrowly on the data management or archiving problems of specific disciplines or agencies, and

OCR for page 13
16 Preserving Scientific Data on Our Physical Universe none has addressed comprehensively the issues associated with the long-term retention of observational and experimental data in the physical sciences. Major Characteristics of Observational Data Observational data sets, like laboratory data, include digital infommation (in both written and electronic form), graphical records, and verbal descriptions. The records exist as ink on paper, punched paper, film (including microfomms), magnetic tape of many types (including videotape), magnetic disk, and digital optical media (including CD-ROM). Over the past three decades, however, the dominant form of data collection and storage has been electronic. Observational data can be characterized by the collection and management practices applied through- out the life cycle of their existence. One might characterize two major practices driven by the funding models for conducting the underlying science. The "big science" funding model creates a funding umbrella for multiple individuals and institutions to conduct coordinated data acquisition, investigation, and publication. Often, these large programs adopt a standard approach for life-cycle data management. However, there is usually little standardization among the big science programs. Examples of such programs include the World Ocean Circulation Experiment, the World Climate Research Program, and NASA's Mission to Planet Earth (CENR, 1994). The other funding model, "small science," funds individuals or small groups of individuals to conduct independent data acquisition, analysis, and publication. Typically, these investigators plan, design, and implement their own data management strategy with little interaction with the rest of the scientific community. The data generated under both models have long-term value, both for science and for the broader interests of the nation. Specific subdisciplines also impose different requirements on long-temm data management. For instance, while there is general agreement within the physical oceanography community on the definition of standard observation variables and the processes of measuring those variables, the same cannot be said for biological oceanography. Because of differences in measuring techniques, lack of community agreement on naming standards, and the scientific process by which biology progresses, data manage- ment for biological data sets is inherently more complex than in physical oceanography. The data from these two subdisciplines will have to accommodate multiple naming schemes and alternate taxonomies. Therefore, data managers and archivists have to deal with differing approaches and vocabularies among disciplines, evolution of discipline research paradigms over time, and diverging concepts and methods . . . . . watt bin a c scone. Scientific research leads to the creation of data that can be processed and interpreted at different levels of complexity. Typically, each level of processing adds value to the original (level-0) data by summarizing the original product, synthesizing a new product, or providing an interpretation of the original data. The processing of data leads to an inherent paradox that may not be readily apparent. The original unprocessed, or minimally processed, data are usually the most difficult to understand or use by anyone other than the expert primary user. With every successive level of processing, the data tend to become more understandable and often better documented for the nonexpert user. One might therefore assume that it is the most highly processed data products that have the greatest value for long-term preservation, because they are more easily understood by a broader spectrum of potential users. In fact, just the opposite is usually the case for observational data, for it is only with the original unprocessed data that it will be possible to recreate all other levels of processed data and data products. To do so, however, requires preservation of the necessary information about processing steps and ancillary data. Another important characteristic of observational data is their volume. In this respect, observational data can be divided into two different classes: small-volume and large-volume data sets. The majority of traditional ground-based, in situ observations form small-volume data sets because they are based on individually conducted measurements or sample collections. Satellite and other remotely sensed obser- vations generally form large-volume data sets.

OCR for page 13
The Challenge: Preservation and Use of Scientific Data 17 The committee defines small-volume data sets as those with volumes that are small in relation to the capacity of low-cost, widely available storage media and related hardware. The hardware and software to write and produce CD-ROMs are now generally available for less than $10,000, and personal computers capable of reading CD-ROMs are being marketed as home-use, consumer items. For example, the total volume of the small-volume oceanographic data is projected to be less than 50 gigabytes by 1995, and thus the entire historical data set for all observations could be stored on fewer than 100 CD-ROMs. This is fewer diskettes than many people have in their compact disk music collections. Issues such as archiving cost, longevity of media, and maintenance of the data holdings are not the dominant considerations with regard to retaining small-volume data sets. Rather, the major issue with respect to this class of data is the completeness of the descriptive information, or metadata. If a data set has been properly prepared and documented, the operations required to migrate the data should be amenable to significant automation and therefore pose only a minor challenge to the long-term mainte- nance of the archive. Further, these data may be widely distributed with simple replication of the media. For example, the various NOAA and NASA data centers have provided copies of their data sets to many users for a number of years. A different problem is posed by large-volume data sets. The biggest data sets typically come from Earth observation satellite sensors and space science missions, and are challenging to some contemporary storage devices. However, it is clear that for the data set to exist at all, an adequate storage medium capable of capturing and maintaining the data for some time period must exist when the data are generated. Further, the time period for reliable, initial storage should at least cover the lifetime of the data set at the organization acquiring and using the data before the records need to be migrated to new media or transferred to another organization, such as NOAA or NARA. In addition, during the initial storage period, there are likely to be major increases in the density of mass storage accompanied by significant decreases in the cost of storage of the data. Thus, data sets that are challenging today will gradually be transformed to "small-volume" status in the future, as advancing technology increases the capacity and lowers the cost of storage devices. Nevertheless, it is important to note that the largest data sets (e.g., larger that one terabyte) can present significant organizational and management problems that require special analysis of the data flow, volume, access, and timing characteristics. Observational Data in the Space and Earth Sciences Astronomy and Astrophysics Data Astronomy and astrophysics are observational sciences; that is, they are based on what the sky provides and we collect. Therefore, in many astronomical investigations there is no such thing as "repeating an experiment" with the expectation of getting the same results. Many objects have properties that change with time either because of their intrinsic nature (e.g., variable stars), evolution (e.g., stars going supernova), or reasons yet unknown. It happens quite frequently that a highly variable object is found in satellite data and subsequent archival research in optical plates allows its identification as a given type of star. Astronomy and astrophysics data are acquired by both ground-based and space-based observatories. Ground-based observatories, which are operated by universities or other nonprofit organizations (e.g., Association of Universities for Research in Astronomy, the Smithsonian Institution) and funded by these organizations or by the National Science Foundation (NSF), have traditionally been used to study the sky at visible wavelengths. Since the second World War, astronomers have used improving technologies to observe at radio and infrared wavelengths. Consortia of universities, including both U.S. and foreign institutions, are constructing new telescopes, which use advanced technology to build larger mirrors that will allow us to look deeper into the universe. Radio observatories range from smaller ones operated by universities to larger national facilities, such as the National Radio Astronomy Observatory, funded by

OCR for page 13
18 Preserving Scientific Data on Our Physical Universe NSF. Most telescopes are for individual observing programs, but some are dedicated to systematic sky surveys. Data from ground observations have traditionally been the property of the observer; therefore, observatories have no standard policies for data archiving. The exceptions are some big projects, such as the Palomar Sky Survey, where data either are made public and sold or are archived within the university or observatory. Some centers, such as the National Radio Astronomy Observatory, the National Optical Astronomy Observatories, and the Harvard-Smithsonian Center for Astrophysics, have begun to archive most data obtained from major telescopes. These data are valued and used broadly by astronomers. Nevertheless, archival activities remain of generally low priority. Although the older astronomical data consist of photographic plates and other analog data, virtually all data today are collected digitally. There also have been major efforts to digitize old photographic data to allow their analysis by computer. An example of this is the digitization of a whole-sky survey by the Space Telescope Science Institute, and this survey is now available for sale on CD-ROM from the Astronomical Society of the Pacific. Recently, the astronomical community adopted a standard format for transfers of digital files (FITS). With the advent of digital data, there also has been an evolution from individual data analysis packages to a few widely distributed packages (e.g., TRAP, AlPS, VISTA, XANADU), which provide standard tools for baseline analysis. Because of the filtering and distortion produced by the Earth's atmosphere, the amount of energy emitted by celestial bodies that can be detected on the ground is limited significantly. Observations from space above the atmosphere remove such limitations. From its inception, space astronomy and astro- physics have been mostly under NASA's purview, although some important experiments have been financed by the Department of Defense. The data are collected through telescopes and detectors placed on airborne devices (balloons or planes), rockets, NASA's Space Shuttle, and orbiting satellites. The largest volume of data is collected by satellites, and most of these missions are international collabora- tions. The U.S. portion has always been handled by NASA. Within NASA, space astronomy and astrophysics are organized in different wavelength-based disciplines, reflecting the organization in the scientific community. These disciplines include the infrared, whose main data center is the Infrared Processing and Analysis Center in Pasadena, California, where the data from the Infrared Astronomy Satellite mission are archived; the optical and ultraviolet, with data centers at the Space Telescope Science Institute in Baltimore, Maryland, where the Hubble Space Telescope data are archived, and at the NASA Goddard Space Flight Center in Greenbelt, Maryland, where the International Ultraviolet Explorer archive resides; and high-energy astrophysics, which maintains x-ray data at the Einstein Observatory Data Center in Cambridge, Massachusetts. Table 2.1 provides a representative sample of NASA Astrophysics Archives. The earlier NASA astrophysics projects were so-called "principal investigator" missions, where a contract was awarded to a group of principal investigators, who built the hardware, received the data from the experiments, and analyzed and interpreted them. These principal investigators had no clearly stated guidelines to prepare data for archiving, other than to deliver the reduced data to the NASA data depository at the National Space Science Data Center (NSSDC) at the NASA Goddard Space Flight Center. Documentation generally was minimal, and the data, which often were not well-documented or well-organized, were difficult to retrieve for scientific use, even if they were adequately physically preserved. It has become fully apparent, however, that the uniqueness and high acquisition cost of these space data make their effective preservation and archiving a high priority. Even after the active operation of a space observatory has ended, the data typically are retrieved and used by scientists for many more years. As a result, the situation has improved considerably at the NSSDC in recent years. Moreover, NASA now funds wavelength-specific scientific data centers to process the data, eliminate anomalies in the data, and provide software for scientific analysis.

OCR for page 13
19 ~ ~ 4- ~o Cal ~ Q O V ~ O ~ ~ ID o ._ Cal cq ._ Cal Cal . ~0 Cal o Ct V, ._ Ct Cal Cal Ct C) .s ~ O ~ ~ O - Lb so C.) _4 ~ V) O O Ct - ~ ~ O X _ Cal . 03 Cq o Sly ,,= Ct ~ so ._ Cal ~ ~ O O . o c: i_, ct ;^ V) _ _ o ~ V ~ ~8 _ ~ ~ ~o l z ~ ~ Ct - Ct - - o . ;> 50 - Ct .o o s cr C~ C~ Ct ~ ~oo k_ ~ Ct - Cd - C) - o . ;> Ct oo ,- _ o ~o Ct o g ~Ir, 0 1 Cq Ct ~o o o C ~l . ~o o ~ o~ ~ ~ C,) - C .s ,) Ct ~ ~ m Ct C< ~ ._ cq ~ t~O C) ~ Cd V V _ ~ ~d ~> ._ C) Ct U) - ~S~ C) Ct V: ~ _^ ~ V ~ . Ct - ~ ~ cd cd z ~ ~ ;^ ~ ~ s~ o ~ cd d ~ v c~ C;S . C d x ~3 ~ ~ o O O ~ ,` C - ~ ~ ~ ~ E~ ~ ~

OCR for page 13
20 Planetary Science Data Preserving Scientific Data on Our Physical Universe Planetary data also are acquired by both ground-based and space-based observations. Planetary data include observations of the entire physical system and forces affecting a planet or other body, including the geology and geophysics, atmosphere, rings, and fields. The sensors used collect data across much of the electromagnetic spectrum. Currently, most planetary observations are supported by NASA, either as the direct result of planetary missions or as ground-based observations that support a mission. Over the past three decades, NASA has sent robotic spacecraft to every planet in the solar system except Pluto, to two asteroids, and to a comet. Men have walked on the Moon, performed experiments there, and returned samples. The knowledge we have about the bodies in the solar system, with the exception of our own planet, comes mostly from space missions. In some cases, such as the gas giants Jupiter, Saturn, Uranus, and Neptune, robotic space probes have provided most of our current knowledge. Many of the satellites of the other planets were no more than points of light with minimal spectral and light-curve measure- ments before the Voyager mission. Now each is recognized as a separate world with highly individual characteristics. - The scientific and historical importance of space-based planetary observations, the realization that additional missions cannot replicate the original observations, and the expense of planetary missions all prompted NASA to create the Planetary Data System (PDS) to improve the acquisition, archiving, and distribution of planetary data. The developers and current staff of the PDS recognize that the data from planetary missions make up the scientific capital of the agency's planetary exploration program and that these data are a national resource. The PDS tries to acquire all existing planetary data from NASA's missions and even from international ventures, in order to have a complete archive of our exploration of the solar system. In addition to the space-based measurements, the PDS accepts relevant ground-based observations and laboratory measurements that support planetary missions by providing baseline or calibration data. A basic condition for acceptance is that the data set must be properly documented and include all relevant ancillary data, including planet and spacecraft ephemerides, calibration tables, and experimenter notes about the shortcomings of the data. Members of the PDS scientific staff and scientists in the community who have expertise within the relevant disciplines peer-review each data set. One of the more important contributions of the PDS, especially with regard to the ongoing preserva- tion of data in a useful form, is the electronic "publication" of the majority of the data from many planetary missions in the form of CD-ROMs. These include not only the data, but also documentation, format specifications, ancillary data, and even, in some cases, display and analysis tools. Space Physics Data Space physics involves the study of the largest structures in the solar system the plasma environ- ments of the planets and other bodies and the solar wind. Those environments consist of plasmas ranging from low energies (the thermal component) to charged particles of high energies including cosmic rays . . .. . .. , '. . . ~ . . A. ~ , . ~ . A accelerated by galactic processes. They also consist of the magnetic fields (if they exist) of planets or the Sun, as well as electrostatic and electromagnetic fields generated from natural instabilities in plasmas and charged-particle populations. Furthermore, in many locales, such as comets and the Earth's ionosphere, dust and neutral gases play an important role in mediating the behavior of plasmas and electromagnetic fields. As a consequence, the field of space physics requires a broad array of sensors and instruments at all levels of complexity. Many instruments make in situ observations, but novel techniques enable remote sensing of various plasma regimes. Because some of the most apparent manifestations of space physics processes result in the northern lights and in planetary-scale modifications of the terrestrial magnetic field (and subsequent catastrophic effects on power grids and communications), space physics relies heavily on a wide array of ground-based observations, including magnetometers, ionospheric sounders, incoherent radar facilities,

OCR for page 13
The Challenge: Preservation and Use of Scientific Data 21 all-sky cameras, and photometers. In addition, a broad range of ground-based and space-based solar monitors has become crucial to study the correlations between various disruptions in the terrestrial plasma environment and solar activity, including sunspots, flares, and prominences. For many reasons, it is essential to preserve space physics data for long periods of time. The Sun drives solar-terrestrial relationships, and many studies require observations over 22-year solar cycles. During this cycle the Sun reverses its magnetic polarity twice and goes through periods of increased activity with sunspots and associated flares. At solar activity minimum, flare and sunspot activity decreases, but expanded coronal holes appear. Long intervals of records are required because each solar cycle is different from previous ones and because there are long-term deviations, such as the Maunder minimum, from "normal" patterns. From the terrestrial point of view, there are motions of the magnetic dipole and even magnetic field reversals on time scales of thousands of years. Because many space physics observations are taken in situ, models of the magnetosphere need data collected by many spacecraft, having different kinds of orbits and trajectories. To make sense out of data from one of these missions, it is important to be able to examine what another spacecraft in a different orbit found. Only by preserving the data from numerous missions do we acquire a sufficient archive. Space physics has generated about 50 gigabytes of data per year over the last 30 years. The field has enjoyed this extraordinary productivity primarily because most missions were in Earth orbit and were tracked continuously for years. Many of these data sets were "archived" by sending the tapes and sometimes the relevant documentation to the NSSDC. Copies of the data on microfilm or on other media were sent there as well. Unfortunately, for every well-prepared, thoroughly documented space physics data set at the NSSDC, there are several poorly prepared and improperly documented data sets. For the earliest space missions, the archiving techniques were undeveloped, and archiving was not deemed a high priority. Thus, there are many data at the NSSDC that most scientists would find difficult to use with only the information originally supplied. Given the recent emphasis on the proper preserva- tion of data and the importance of archiving prompted in part by two General Accounting Office reports (199Oa,b) and also by a heightened awareness and desire for high-quality archives by the community many recently archived data sets are in better condition than their predecessors. Even though the Space Physics Data System has been in existence only since 1993, the more advanced data activities in other disciplines have influenced the space physics community favorably. Hence, it is becoming more likely that the data now being submitted are of a higher quality, have more adequate documentation, and are more complete than earlier data sets. NOAA, NSF, the Department of Defense, private and educational institutions, arid foreign organiza- tions typically support the ground-based observations. Most of these data, not managed by NASA, eventually come under the purview of the National Geophysical Data Center, operated by NOAA at Boulder, Colorado. The center's holdings consist of over 300 digital and analog databases, some of which are very large. However, many important data sets still reside solely in the hands of the original investigators, the military, or foreign sources. Atmospheric Science Data _ _ ,z, Atmospheric science data sets are diverse and present a variety of problems for distribution, archiving, and later interpretation. Some data sets on the atmosphere stand out as the largest in any scientific discipline, particularly those from remote sensing by satellite or radar; others consist of contributions from thousands of individuals all over the world, and the provenance of those data is sometimes uncertain. Many data sets span decades, and a few span more than a century, with accompanying problems due to lack of homogeneity in measurement techniques and sampling strategies. The largest atmospheric science data holdings in the United States are those of the federal government. However, significant amounts of material are available only from state or private sources.

OCR for page 13
22 Preserving Scientific Data on Our Physical Universe Not all atmospheric data sets are large and conspicuous; many are small. There are hundreds of data sets of only a few megabytes or less. There are also many medium-sized data sets that range from perhaps 100 megabytes to tens of gigabytes, as well as verY large data sets. manY terabYtes in volume. ~ .. ~ ~ .. .. ~ ~ .. . ~ ante ;~.;z provides a sampling or some or the larger data sets. Data volume does not drive the cost or archiving small-sized and medium-sized data sets if proper technical choices are made. Rather, it is the labor-intensive process of readying a data set for indefinite preservation that can be costly. Many atmospheric data sets are dynamic, continually growing or being otherwise modified. Because weather keeps occurring, observational time series from operational meteorological activities are never "complete." In contrast, field programs usually have finite extent, and the resulting data sets have a definite end. However, many recent large, complex field programs have spawned associated monitoring activities that have continued after the initial phases of the project. Despite the frequent usage of the term "experiment" to denote field programs, these intensive efforts are observational, rather than experimen- tal, exercises. Some truly experimental data exist, including a few data sets that include the results from such work as sensor development and tests, fluid dynamics experiments, thermodynamic measurements, and laboratory chemical studies. Nevertheless, the vast majority of atmospheric science data describe observations of ever-changing phenomena, and thus they are unique, valuable, and irreplaceable. For much meteorological and climate research, as well as for many applications, it is essential to have archives of global data. This goal has been largely achieved in the United States, although older data sets still need to be digitized. Collectively, U.S. archives have the best sets of global data of any nation, particularly for data since the early 1950s. However, many valuable data stored in other nations are inaccessible to U.S. scientists (and in some cases are inaccessible to those nations' scientists as well). Meteorological and other atmospheric data are used for varying purposes on different time scales. It IS convenient to delineate three: (1) real-time or current, (2) recent past or short-term retrospective, and (3) distant past or retrospective. Compared with other disciplines, meteorological data are probably used by a wider segment of the U.S. population than other scientific data, because they relate directly to practical, daily concerns. There is a large lay audience for weather and climate information. The real-time or current use of most data sets usually motivates decisions on collection strategies and therefore quality. For example, the primary reason for collecting most meteorological data is for operational weather forecasting and warning, including forecasting for aviation operations. These data . . . . . . .. . . . . . . . . . are perishable, and tlmelmess and spatial resolution are more important than absolute accuracy and continuity. There are many recent past or short-term retrospective uses of meteorological data that can be of great significance. In this context, short term typically means from yesterday to a few weeks, or occasionally a few months, ago. A good example of such usage of data is in monitoring the development of a drought, a significant function for predicting crop yields. The transportation industry uses past data for verification of weather conditions for delay claims. Most retrospective uses require data from several months old through the traditional (though now suspect) 30-year averaging periods used for climate normals. The National Climatic Data Center handles over 100,000 data requests per year. The state climatologists and regional climate centers also process about this many. Legal proceedings and insurance claims often require accurate meteorological records for corroboration of witness testimony, criminal investigations, and validations of weather claims related to accidents and property damage. Farmers and agronomists need data covering months to years for studies of pesticide residue and toxicology, decisions about pesticide spraying, planning of fertilizer usage, and crop selection. Architects and building engineers require site-specific data on heating and cooling needs, wind stresses, snow loads, and solar availability. Airport designers need prevailing wind patterns. Utility planners need aggregate heating and cooling loads for their areas. Long-term retrospective uses of atmospheric data are the primary concern in this study. These uses are highly diverse, difficult to predict, and make great demands on the data and their associated metadata.

OCR for page 13
The Challenge: Preservation and Use of Scientific Data TABLE 2.2 Volume of Selected Data Sets in Atmospheric Sciences Type of Data Set Comments Dates Years Volume 23 Atmospheric In Situ Observations World upper air Two times per day, 1,000 stations 1962-19933225 GB World land surface Every 3 hours, 7,500 stations 1967-19932760 GB World ocean surface Every 3 hours (~40,000 observations 1854-199313915 GB per day) World observations during Surface and aloft, but not satellite 1978-1979110 GB First GARP Global Experiment U.S. surface Daily, now 9,000 stations 1900-19939415 GB Selected Analyses (mostly global) Main National Meteorological Two times per day, 1945-19934850 GB Center analyses increasing at 4 GB/year National Meteorological Four times per day, 1990-1993458 GB Center advanced analyses increasing at 19 GB/year National Center for Atmosphenc Thirty-eight data sets 8 GB Research's ocean observations and analyses European Center for Medium Range Four times per day, 1985-1993976 GB Weather Forecasting advanced increasing at 8 GB/year analyses Selected Satellites NOAA geostationary satellites Half-hour, visible and infrared 1978-199316130 TB NOAA polar orbiting satellites 1978-199315 Sounders (TIROS Operational 15720 GB Vertical Sounder) Advanced Very High Resolution 155 TB Radiometer (4-km coverage, 5 channel) NASA Earth Observing In development, 88 TB/year, 1998 Satellite-AM level-1 data U.S. Radar Data Domains of 30 to 60 km 1973-1991191 GB Next Generation Radar 650 GB per radar each year, System (NEXRAD)a 104 TB/year for 160-site system 1997- 100s TB Notes: Many other atmospheric data sets have volumes of only 1 to 500 MB. 1 MB (megabyte) = 106 bytes; 1 GB (gigabyte) = 109 bytes; 1 TB (terabyte) = 10~2 bytes. aFirst radars were deployed in 1993. Most of the uses discussed above do not need data covering more than a few decades. Several of these applications, however, require the longest time series we can provide. When technology advances and alters the method of data collection, there is a strong impetus to scrap the data collected by "obsolete" technology. However, these old data may become critical in the future. A notable example involves upper air wind profiles. These were originally collected by kites and later by radiosondes carried on balloons. With the onset of the space program, there was an urgent need for detailed low-altitude wind data for analysis of stresses on rockets at launch. Appropriate data could not

OCR for page 13
24 Preserving Scientific Data on Our Physical Universe be obtained from radiosondes, because of their high ascent rate, but older kite-based data, which had been scheduled for disposal, were available. Fortunately, they had not yet been destroyed when they were again needed. There have been dramatic retrospective uses for military purposes (e.g., Jacobs, 1947~. Planning for the D-day invasion of France, bombing runs over Japan, and the recent desert war in Iraq all required detailed climatic information, some long thought useless but not yet discarded. Such unexpected uses require the retention of many types of data from many places for a long time. Since the first flights of meteorological satellites in 1959, we already have had several examples of important retrospective uses of satellite data sets. For instance, a combination of reprocessed Nimbus-7 satellite data and old data from the Dobson network helped to confirm the recurring seasonal loss of stratospheric ozone over the Antarctic in the early 1980s. If meteorologists are to study past weather events, such as severe hurricanes, damaging winter storms, or outbreaks of tornadoes, they must have at their disposal all data for the periods of time and geographical areas involved. Hurricane track records spanning more than a century are still regularly used for both research and operational purposes. An increasingly significant use of meteorological data is the monitoring of the climate of the planet. Although barely two decades ago the study of climate was not a very high priority, today climate research issues are prominent; some of the nation's leading scientists specialize in climate studies, and policymak- ers seek information on likely climatic conditions of the future. The importance of old atmospheric data has become clear, but the reanalysis of these old data in the search for trends has often found them inadequate and poorly documented. The growing interest in global climate change and the difficulties with historical data that it helped uncover have strongly motivated earth scientists to take a serious interest in the long-term preservation of atmospheric data. Similarly, studies of long-term water and land usage require time series of many decades, or more. Such data needs also apply to planning aquifer usage and studies on deforestation and desertification. Some historians examine connections between environmental conditions and human events. The time scales studied can range from the immediate, such as the influence of weather on battles, to the very long term, such as the rise or decline of a civilization affected by water availability. Workers in this field often search through the oldest existing data and have even provided meteorological information to atmospheric scientists from unconventional sources such as diaries and agricultural records. Contemporary arrangements for the storage and archiving of atmospheric data are diverse, complex, and present many problems. Some of these arrangements could be improved. Atmospheric data are in many locations, and they have a broad range of life cycles. Difficult problems arise in preparing metadata, packaging data for extended archiving, motivating researchers to prepare their data for use by others, and simply dealing with the large size of some of the atmospheric data sets. Criteria for identifying data sets to save indefinitely are not necessarily obvious. Finally, any proposed solutions must be made in full recognition of their impact on budgets and other resources. Geoscience Data Spatially, the domain covered by the geosciences extends from the Earth's core to the surface and into space. Temporally, it covers broad trends from the remote origins of the Earth to possible future scenarios, but it also is concerned with rapidly varying, often short-lived phenomena. Data in the geosciences fall into two broad categories. One is the observation and description of unique events, such as earthquakes, volcanic eruptions, and floods. In most cases, such data need to be archived for a long time period, regardless of their quality. The other category consists of observations of quantities continuous in space and time, such as gravity and the Earth's magnetism and structure, seismic sampling, and groundwater distribution.

OCR for page 13
The Challenge: Preservation and Use of Scientific Data 25 The volume of geoscience data obtained with public funding has increased dramatically over the past few decades. This increase is the result of several converging factors, including the extremely varied types of observational data collected by the scientific community; the large volumes available through better measurement techniques, more sophisticated instrumentation, and advancing computer technolo- gy; and increasing demand from not only the scientific community but also the general public, including engineers, lawyers, and statisticians. Nongovernmental and commercial institutions also are major collectors and sources of pertinent data. Two examples the Landsat database and the nation's holdings of seismic data illustrate many of the characteristics and issues inherent in the long-term archiving of geoscience data. Other examples are provided in the working paper of the Geoscience Data Panel (NRC, 1995~. The Landsat database consists of multispectral images of the Earth's surface, which have been accumulating since the launch of Landsat 1 in July 1972. The archive includes digital tapes of multispectral image data in several formats, black-and-white film, and false-color composites of synoptic views of the Earth's surface, all from 700 km in space. This database thus constitutes an important record of the evolving characteristics of the Earth's land surface, including that of the United States, its territories, and possessions. The record documents not only the results of various federal government policies and programs, but also those of many state and local governments and private programs and activities. It further provides documentation of the impact of various large-scale episodic events, such as floods, storms, and volcanic eruptions, and is of great value to both current and future public and private . . . activities. Landsat data are currently available in either image or digital form from the Earth Resources Observing System (EROS) Data Center in Sioux Falls, South Dakota. The Landsat satellites were originally under the control of NASA. However, in 1980 they became the responsibility of NOAA. The currently operational Landsat 4 and 5 spacecraft were placed under control of the EOSAT Company in 1985. Under EOSAT's control, the data are not in the public domain, are significantly more expensive, and carry proprietary restrictions on their use. Beginning with the launch of Landsat 7, responsibility for the Landsat system will pass back to NASA, which will build and launch the satellite the late 1990s. NASA will operate the systems and deliver the data to the EROS Data Center for distribution. The data will once again be in the public domain, although the EROS Data Center still plans to charge more than the marginal cost of reproduction in fulfilling user requests. It is now widely recognized that the shift to private control of the Landsat system significantly reduced the access to and use of the data. As of January 1993 the Landsat database contained more than 100,000 tapes of varying density and formats, and over 2,850,000 frames of hard copy imagery. Digital Landsat data are usually delivered to users as magnetic tapes. Other media, such as CD-ROMs and streaming tapes, also may soon be used. Data requests occur most frequently in reference to a particular geographic location, commonly ex- pressed as latitude and longitude, for a particular time of the year, and meeting certain cloud cover limitations. Landsat data are used widely across the spectrum of geoscience applications in both civilian and military operations and research. These include such applications as the impact of human activities on the environment, land-use planning and resource-allocation decisions, disaster assessment, measurement and assessment of renewable and nonrenewable resources, and many others. They are used also by the general public in any context where views of the Earth's surface are needed. Examples include such diverse applications as visual aids in elementary and secondary education, background for highway maps, and illustrations for magazine articles about various regions of the world. The Landsat database is unique because data from any given area may be available at sampled instants over a period of more than 20 years, thus making possible for the first time the study of slowly varying phenomena on Earth. Even though data from the early 1970s may now have a low frequency of use, their potential value remains high and they represent a significant archival record.

OCR for page 13
26 Preserving Scientific Data on Our Physical Universe In contrast to the Landsat database, seismic data are broadly distributed rather than concentrated in one data center or system. This example focuses primarily on seismic data from earthquakes and explosions, both nuclear and chemical. Some federal agencies, notably the U.S. Geological Survey (USGS) and NOAA's National Geophysical Data Center, collect and archive important seismic explora- tion data. In addition, the Department of Defense (DOD), Department of Energy (DOE), U.S. Nuclear Regulatory Commission (USNRC), USGS, and NOAA have been and continue to be engaged in the collection and archiving of earthquake and explosion data. These agency programs are carried out independently of one another with the result that each agency has its own data management and archiving policies and practices. Consequently, these data holdings are greatly distributed among the agencies in fundamentally different forms and formats. Global earthquake data have been acquired systematically since the early 1960s, when the U.S. Coast and Geodetic Survey of the Department of Commerce deployed a global seismic network of about 130 stations called the World-Wide Standardized Seismographic Network (WVVSSN) and produced an archive of photographic film "chips" of the 24-hour/day recordings at all stations. Researchers and other applications could obtain copies of these analog data at modest cost. The success of this precursor to today's global digital network cannot be overestimated, because the availability of a global data set in standard format from well-caTibrated instruments permitted previously impossible studies of global seismicity patterns, earthquake source mechanisms, and the Earth's structure. These studies have led to a vastly improved understanding of the dynamics of the Earth as a whole, including tectonic plate movements, generation of new ocean floor, evolution of the Earth's crust, and occurrences of destructive earthquakes and volcanic eruptions. The USNRC has funded the operation of regional seismic networks over much of the United States, some since the early 1970s, in support of programs for the siting and safety of nuclear power plants. USGS also has co-funded or separately funded regional networks for earthquake hazard assessments in seismogenic areas of the United States. However, changes in the funding priorities of USGS and USNRC in recent years have resulted in the interruption or discontinuation of some of these networks, particularly in the eastern United States. This has adversely affected data flow and seismic research. Seismic data have been archived in a broadly distributed, nonuniform mode by the organizations mostly universi- ties that collected the data from the various networks. Many of these data have long-term value for characterizing in detail the tectonic activity of seismogenic areas in the United States. In addition to the federal agencies, several private sector organizations now collect, distribute, and archive seismic data sets of long-term significance. The Incorporated Research Institutions for Seismol- ogy (IRIS), a not-for-profit consortium of universities and private research organizations, is engaged in a major development of a global digital seismic network of about 100 continuously recording stations (the Global Seismic Network) in cooperation with USGS. The project also includes a versatile, portable digital seismic array of up to 1,000 stations that can be deployed for various time intervals for special seismological studies. Data sets from the global and portable array are being permanently archived at the IRIS Data Management Center (DMC) in Seattle, Washington. The DMC also serves as the International Federation of Digital Networks' center for continuous digital data, which adds observations from many additional stations to the archive. IRIS funding for this activity comes primarily from NSF and DOD. Finally, individual universities, such as the California Institute of Technology, the University of Califor- nia at Berkeley, the University of Alaska, the University of Washington, Columbia University, Memphis State University, and St. Louis University, also maintain archives of the seismic data that they collect. The volume of digital data currently held and anticipated to be acquired by the IRIS DMC is summarized in Table 2.3. Although some data sets have been completed because they are project- or program-specific, most of the current operations continue to add large amounts of new data and Implement new technology for recording, storage, retrieval, and distribution, thereby creating a dynamic, highly distributed archive whose holdings and access protocols change with time. For example, the IRIS

OCR for page 13
The Challenge: Preservation and Use of Scientific Data TABLE 2.3 Summary of Actual and Projected Data Volumes Archived in the IRIS Data Management Center 27 Projected Data Volumes (gigabytes/year) Number of Instrumentsa 1994 1995 1996 1997 1998 1999 2000 GSN1001,1592,3593,9596,0038,04710,09112,281 FDSN1463706701,0701,5302,0502,6703,416 JSP arrays51,0952,1903,6505,4757,3009,12510,950 OSN30001558218498936 PASSCAL-BB5001,3182,2773,5565,1547,0739,31211,867 PASSCAL -RR5005428851,3411,9122,5973,3974,310 Regional-Trig5001502904907301,0301,3901,755 Total1,7814,6348,67114,08120,86228,31536,48345,515 Note: Abbreviations are as follows: GSN Global Seismic Network (IRIS) FDSN Federation of Digital Seismic Networks JSP Joint Seismic Program (with the fonder Soviet Union) (IRIS) OSN Ocean Seismic Network PASSCAL-BB ~ ^ ~ ^ ~ ~ ~ ~ ~ PASSCAL-:RR Regional-Tr~g aProjected numbers by year 2000. Program for Array Studies of the Continental Lithosphere-Broadband (IRIS) Program for Array Studies of the Continental Lithosphere- Regional Recordings (IRIS) Regional Triggered Recordings Source: IRIS Data Management Center, private communication, 1994. DMC recently began providing both archived and near-real-time data on the Internet, thereby greatly facilitating rapid access. ~gn~ncan~ volumes of exploratory seismic data obtained by geophysical contractors are held by the Decartment of Interior. These data are used bv the federal government once hv ne.trnl~.lim ~omn~ni~.~ in _~ ~ ~ i_ (~ J rig ^~^ --~^r~---~V -- ~.. ~. . . . . _ preparing for ol1 and gas exploration activities. rl here are' however. various proprietary restrictions on access to these data by other users. T .' ~ ~ . In summary, the sources of seismic data are diverse, the archiving is highly distributed, and the data are in many different formats with different metadata structures. Moreover, data sets with long-term scientific and historical value reside in both federal and nongovernmental organizations, although in most of the latter cases federal funds have paid at least in part for their acquisition, archiving, and distribution. The users of seismic data are many and diverse as well. They include federal and state government agencies, universities, and private industry, particularly the petroleum industry. Thousands of individuals are direct or indirect users of seismic data. Certainly, the public as a whole is an end user of historical seismic data and information, including the location, magnitude, and damage associated with earth- quakes around the world. Most seismic data sets have been or are now used both for operational purposes and for research, although for operational activities the data are used primarily immediately following their collection. Examples of their use for operational activities include tsunami warning and the rapid determination of the magnitude, location, and fault mechanism of destructive earthquakes and their aftershocks, both to inform the public and to assist in emergency response and special monitoring. On a longer time scale the data are used for hazard reduction and seismic safety in seismogenic regions, including local zoning decisions for future development, and siting and safety of critical facilities such as nuclear power plants. Data are obtained and used for continuous global monitoring of earthquake activity and of threshold or comprehensive test bans on underground nuclear explosions. Of course, there also is a broad spectrum of . . . . ~

OCR for page 13
28 Preserving Scientific Data on Our Physical Universe research that uses historical seismic data, including studies of the physics of earthquake and explosive sources, propagation effects on seismic signals, imaging of the Earth's structures at all scales, seismicity patterns, and earthquake prediction or hazard estimation. Older data are important and are commonly used for most of these types of research. For example, establishing the recurrence rate for larger- magnitude earthquakes requires decades to centuries of observations, even in the most seismically active areas. In conclusion, most of the seismic data have long-term value for scientific research, disaster mitigation, and various socioeconomic uses. The data are archived in a broadly distributed manner. However, only a fraction of the archived data are under the direct control of federal government agencies, and it appears that many of these data sets are not considered official federal records. Except for most commercial exploratory seismic data, federal funds have paid for much of the instrumentation, station operation and maintenance, collection, storage, and distribution of seismic data. These important seismic data sets should be kept indefinitely in a form accessible to both the scientific community and other users. Ocean Science Data

OCR for page 13
The Challenge: Preservation and Use of Scientific Data 29 likely to emerge from the interactive scientific collaboration and value-added services that are becoming increasingly available through electronic networks. The principal federal agency ocean data holdings are at the NOAA National Oceanographic Data Center (NODC), the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC) at the let Propulsion Laboratory, and at several Navy centers, which hold mostly classified data sets. In addition, significant amounts of data are held by the universities. Located in Washington, D.C., the NODC archives physical, chemical, and biological oceanographic data collected by other federal agencies, including data collected by principal investigators under grants from the National Science Foundation; state and local government agencies; universities and research institutions; and private industry. The center also obtains foreign data through bilateral exchanges with other nations and through the facilities of World Data Center A for Oceanography, which is operated by the NODC under the auspices of the National Academy of Sciences. The NODC provides a broad range of oceanographic data and information products and services to thousands of users worldwide, and increasingly, these data are being distributed on CD-ROMs and on the Tnternet. summary of the NODC's data holdings. The PO.DAAC is a major federally sponsored oceanographic data center, which is operated by the California Institute of Technology's let Propulsion Laboratory in Pasadena, California. As one element of the NASA Earth Observing System Data and Information System, the mission of the PO.DAAC is to archive and distribute data on the physical state of the oceans. Unlike the data at the NODC, most of the data sets at the PO.DAAC are derived from satellite observations. Data products include sea-surface height, surface-wind vector, surface-wind stress vector, surface-wind speed, integrated water vapor, atmospheric liquid water, sea-surface temperature, sea-ice extent and concentration, heat flux, and in situ data that are related to the satellite data. The satellite missions that have produced these data include the NASA Ocean Topography Experiment (TOPEX/Poseidon, done in cooperation with France), Geos-3, Nimbus-7, and Seasat; the NOAA Polar-Orbiting Operational Environmental Satellite series; and the DOD's Geosat and Defense Meteorological Satellite Program. Table 2.4 presents a SUMMARY OF MAJOR ISSUES The results of scientific research are disseminated in this country through a hybrid system that includes professional society and other not-for-profit publishers, the commercial sector, and the govern- ment. The formal journals are published largely by the professional society and commercial sectors, while government agencies manage less formal reports (gray literature). Secondary services, such as abstracting and indexing, provide access to this literature, increasingly by electronic means. While there are strains in this system because of rising costs, increasing workload, and issues related to the protection of intellectual property, it has served U.S. science well and has been an invaluable link in the process of translating scientific advances into further advances, useful technology, and economic benefits. The current system, however, is not well suited to handle the scientific electronic databases that are the focus of this study. The costs of maintaining these databases are typically too great to be covered by user fees; instead, these databases must be considered part of the national scientific heritage. Some government agencies have accepted responsibility for maintaining and disseminating data resulting from their own research and development. In some cases, this system is working reasonably well, but in others there are problems even with providing current access. Archiving for the long term raises questions in all cases, however. A general problem common to all scientific disciplines is the low priority attached to data manage- ment and preservation. Experience indicates that new experiments tend to get much more attention than the handling of data from old ones, even though the payoff from optimal utilization of existing data may be greater. For instance, according to figures supplied by NOAA, NOAA's budget for its National Data Centers in FY 1980 was $24.6 million, and their total data volume was approximately one terabyte. In

OCR for page 13
30 Preserving Scientific Data on Our Physical Universe TABLE 2.4 National Oceanographic Data Center Data Holdings (as of October 1994) Discipline Volume (megabytes) Physical/Chemical Data Master data files Buoy data (wind/waves) Currents Ocean stations S alinity/temperature/depth BT temperature profiles Sea level Marine chemistry/marine pollutants Other Subtotal Individual data sets, for example Geosat data sets CoastWatch data Levitus Ocean Atlas 1994 data sets Other (estimated) Subtotal Total Physical/Chem~cal Marine Biological Data Master data files Fish/shellfish Benthic organisms Intertidal/subtidal organisms Plankton Marine mammal sighting/census Primary productivity Subtotal Individual data sets, for example Marine bird data sets Marine mammal data sets Marine pathology data sets Other (estimated) Subtotal Total Biological Total Data Holdings 9,679 4,290 1,645 1,557 872 125 89 68 18,325 12,841 60,000 4,743 11,000 88,584 106,909 115 69 30 32 21 7 274 52 4 4 200 260 534 107,443 Source: NOAA, private communication, 1994.

OCR for page 13
The Challenge: Preservation and Use of Scientific Data 31 FY 1994, the budget was only $22.0 million (not adjusted for inflation), while the volume of their combined data holdings was about 220 terabytes! During this same period, the overall NOAA budget increased from $827.5 million to $1.86 billion. With regard to laboratory data, government programs have existed since the 1960s to compile results from the world scientific literature, to check the data carefully, and to prepare databases of critically evaluated data. For instance, the National Institute of Standards and Technology operates its Standard Reference Data Program, which covers a broad range of data in physics, chemistry, and materials science. The Department of Energy also supports a number of data centers of this type. Despite chronic To cite one ~ _ , , , underfunding, these programs have produced databases of lasting value to the nation. example, the Mass Spectral Database managed by the National Institute of Standards and Technology, the National Institutes of Health, and the Environmental Protection Agency contains spectra of over 60,000 compounds. It has been installed in many thousands of mass spectrometers that are being used for monitoring environmental pollution, designing drugs, characterizing new materials, and many other applications. The government investment in creating and maintaining this database has been repaid many times over. In the area of observational databases, the situation is mixed. Federal agencies collect large amounts of observational data, which in many cases are continuously added to the available record of Earth and space processes. The data sets resulting from these activities sometimes are well-documented and maintained in readily accessible form; but in many other cases, they are exceedingly difficult or impossible to access or use, and thus are effectively unavailable. In general, the agencies and other organizations do a good job of making data and information available to the scientists (primary users) during the active stages of projects and for some time afterward. Examples of notable successes include the NASA Planetary Data System, where the premise has been that the data have long-term value and must be accessible indefinitely into the future, and the NOAA National Data Centers, where the policy is to migrate archived data to new media every 10 years. Technological advances have kept pace with the large growth in data volumes in scientific disci- plines such that the long-term retention of all or nearly all of the data collected is feasible. Indeed, in most fields the entire collection of data from the past is not large in comparison with the current and anticipated data volumes that will be collected during only a year or two. However, significant fractions of the older data are difficult or in some cases impossible to access, because they have not been transferred to new storage media. This transfer often has received low priority because many data management and data retention activities are chronically underfunded and just handling the current data flow uses nearly all of the available resources. Thus, many valuable data sets are stored on low-density round tapes or on specialized magnetic tape media requiring hardware that is now obsolete or inoperable. For example, a large volume of the early Landsat coverage of the Earth resides on tapes that cannot be read by any existing hardware. Recent data-rescue efforts have been successful in getting older data into accessible form, but these efforts are time-consuming and costly. The reason these efforts have been undertaken, particularly in the observational sciences, is the recognition that retrospective data are vital to under- standing long-term changes in natural phenomena. Given the extraordinarily rapid advances in comput- ing and storage technology in recent years, planned periodic migration of data to new media will be increasingly important in all scientific disciplines to ensure long-term access to our scientific data resources. It is axiomatic that a database has limited utility unless the auxiliary information required to understand and use it correctly the metadata is included in the record. An unambiguous description of the storage format is obviously essential for interpretation of an electronic database. The requirement is even more stringent to support meaningful access to data over the long term, because the hardware, software, and even the language by which formats are described will likely be different decades and centuries from now. The same is true regarding the scientific details of the content of the data. Auxiliary information such as environmental conditions (e.g., temperature and pressure), method of calibrating the

OCR for page 13
32 Preserving Scientific Data on Our Physical Universe instruments, and data analysis techniques must be given to be able to fully and correctly use the data. Providing this information is time consuming and costly if done retrospectively, but much less so if it is prepared at the time the data are collected. Documentation that is inadequate for understanding and using the data greatly diminishes the value of the data, particularly for secondary and tertiary users. Another major problem inhibiting access to data is the lack of directories that describe what data sets exist, where they are located, and how users can access them. This, too, is especially a problem for potential secondary and tertiary users. In many cases the existence of the data is unknown outside the primary user groups, and even if known, there frequently is not enough information for a potential user to assess their relevance and usefulness. This realization has resulted in an interagency effort, led by NASA, to build a Master Directory of Global Change Data and information. This Master Directory is intended to inform users of where data sets of potential interest reside and how to access them. Similar directories are needed in other scientific disciplines, as well as across all disciplines. The lack of adequate directories adversely affects the exploitation of our national data resources and commonly leads to unnecessary duplication of effort. A significant fraction of the archived scientific data is held by the federal agencies that collected the data as part of their mission. However, a large amount of valuable scientific data gathered with federal funds is never archived or made accessible to anyone other than the original investigators, many of whom are not government employees. In many instances, the organizations and individuals that receive government contracts or grants for scientific investigations are under no obligation to retain the data collected, or to place them in a publicly accessible archive at the conclusion of the project. At best, scientists in the same field may be able to obtain desired data sets on an ad hoc basis by contacting the original investigators directly; secondary and tertiary users typically are unaware of the existence of the data and have no mechanism (other than personal contact) to access the data. Thus, data sets that commonly are gathered at great expense and effort are not broadly available and ultimately may be lost, squandering valuable scientific resources and much of the public investment spent in acquiring them. Clearly, there is a great need for the agencies to get more return on their investment in science by the simple expedient of making the data collected under their auspices accessible to others. As seen from the discussion in earlier sections and addressed in detail in the individual discipline pane] reports (NRC, 1995), there is a large and diverse collection of scientific data and intonation extant in federal agencies and nonfederal organizations, including state and local agencies, universities, not-for- profit institutions, and the private sector. At a minimum, those data that are acquired with the support of federal funding should be regarded as part of the National Scientific Information Resource. Finally, NARA's holdings of scientific and technical data in electronic or any other form are very small in comparison to the data holdings of these other organizations. Moreover, NARA's budget for its Center for Electronic Records, which has formal responsibility for archiving all types of federal electronic records, was only $2.5 million in FY 1994, a budget lower than that of many of the individual agency data centers reviewed by the committee in this study. Given NARA's current and projected level of effort for archiving electronic scientific data, it is obvious that NARA will be unable to take custody of the vast majority of the scientific data sets that require archiving. Therefore, a coordinated effort involving NARA, other federal agencies, certain nonfederal entities, and the scientific community is needed to preserve the most valuable data and ensure that they will remain available in usable form indefinitely. The challenge is to develop data management and archiving infrastructure and procedures that can handle the rapid increases in the volumes of scientific data, and at the same time maintain older archived data in an easily accessible, usable form. An important part of this challenge is to persuade policymakers that scientific data and information are indeed a precious national resource that should be preserved and used broadly to advance science and to benefit society.