Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 52
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers III Report of the Atmospheric Sciences Data Panel Werner Baum,* Marjorie Courain, William Haggard, Roy Jenne, Kelly Redmond, and Thomas Vonder Haar CONTENTS 1 Introduction, 52 2 Nature of Atmospheric Sciences Data, 54 3 Uses for Atmospheric Data, 62 4 Arrangements for Storage and Archiving, 64 5 Institutional Arrangements and National Responsibilities, 79 6 Summary, 83 Acknowledgments, 85 Bibliography, 85 1 INTRODUCTION Scientific data and records often have uses that continue long past their original purpose. The purpose of this report is to assist a steering committee providing advice to the National Archives and Records Administration (NARA), the National Oceanic and Atmospheric Administration (NOAA), and the National Aeronautics and Space Administration (NASA) regarding the long-term retention of scientific and technical records of the federal government related to the atmospheric sciences. This report of the Atmospheric Sciences Data Panel has concentrated on the particular data archiving problems of its disciplines, largely meteorology and climatology. Most of these problems are generic to the earth sciences. However, the atmospheric sciences are both blessed and plagued by perhaps the largest data sets of any scientific discipline. Some of the small, but important, data sets from the atmospheric sciences consist of some of the longest time series in any science of data acquired contemporaneously with the events measured (in contrast to evidence of past events). The panel has concentrated on procedures needed for ensuring the survival of irreplaceable environmental data that may be needed in the future. * Panel chair. The authors' affiliations are, respectively, Florida State University; Consultant, East Orange, New Jersey (deceased, January 14, 1994); Climatological Consulting Corporation; National Center for Atmospheric Research; Desert Research Institute; and Colorado State University.
OCR for page 53
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers For many years, atmospheric scientists have voiced concern that they will not be able to obtain and use long time series data sets collected by government agencies. However, most members of the scientific community are not particularly well trained nor well motivated to ensure the long-term preservation and usefulness of the data they themselves collect. Further, scientists in other specialties, engineers, planners, emergency managers, lawyers, historians, and other members of the public, all have retrospective uses for atmospheric data and related records. When one thinks of “federal records,” one usually imagines records associated with transactions of business by federal agencies. However, by law, federal records can include records “made or received by an agency . . . preserved or appropriate for preservation . . . because of the informational value of the data in them.” Nevertheless, agencies do not always consider their scientific data nor the scientific data resulting from research they fund to be federal records. The panel believes that federal agencies should be more inclusive with regard to designating scientific data, either in their possession or created with their funds, as federal records and ensuring the long-term survival of those data. For some, the term “archiving” conjures up images of records gathering dust in a safe place. That is not the image the panel wishes to convey. Preservation without easy access is of little value. Further, it is difficult to justify the effort and cost needed to maintain a set of records if they are never accessed, though such access may not occur until some distant future date. In developing the suggested recommendations, the panel has concentrated on the actions necessary for records to be useful after extended periods of time. This new look at the long-term preservation of scientific information has been requested by NARA, NOAA, and NASA as they consider taking a more active and responsive role in the protection and management of scientific and technical records, especially digital records that are machine readable. As the volume of new data continues to grow rapidly and some of the older records continue to decay or lose supporting information, the panel commends the federal agencies for their increased interest in the issue of long-term archiving of scientific and technical records. Retaining records for long periods of time, on the order of decades to centuries, is not an effortless process. The physical media on which records are stored—be they paper, magnetic, optical, or some other type—degrade over time. This degradation makes necessary the migration of the information. For records to be useful long after they are prepared, and by other than those who prepared them, the full provenance and associated “metadata” must be available to those later users; preparation of this descriptive information requires significant effort. Physical storage space is needed to house the accumulating records; this space must be secure and climate controlled to retard degradation. All of these requirements lay claim to human and financial resources. The retention problem is being exacerbated by the rapidly increasing collection of atmospheric data by federal agencies—truly a data explosion. Depending on how one defines a single data set, the atmospheric sciences have perhaps 2000 to 6000 identifiable data sets. Throughout the 1960s and 1970s, atmospheric data available for archiving were accumulating at a rate of about 2 terabytes/year. In the 1980s, expanded operational weather satellite systems pushed the rate to nearly 15 terabytes/year. In the 1990s, as new operational weather radar and satellite systems are installed, the atmospheric data-collection rate is expected to reach 120 terabytes/year. Present expectations are that this acceleration of data rates will slow so that collection rates will be “only” about 150 terabytes/year by 2005. The claim to resources for records retention leads NARA, NOAA, and all archivists to ask which records are worthy of retention and which might be discarded. The panel has attempted to address this vexing question. It also attempted to address what procedures are necessary for saved records to have value in the distant future and what institutional arrangements are likely to encourage such procedures. This is not the first study to examine issues in the archiving of scientific data. The panel endorses many assessments and recommendations made by other recent (1976-1992) studies and reports, and concurs with the following results of previous efforts: The Climatological Data Users Workshop, 27-28 April 1976, Asheville, North Carolina, recommended the establishment of a Scientific Advisory Panel to advise NOAA on data needs, formats, and retention periods. (This was finally done in 1991.) The Climate Data Management Workshop, 8-11 May 1979, Harpers Ferry, West Virginia, recommended: “undertak[ing] a program to determine and record the metadata of station histories and local geography of observing sites;” developing more complete information on the spatial coverage and resolution of available data sets; and,
OCR for page 54
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers while acknowledging the contribution of the National Center for Atmospheric Research (NCAR) in assembling and communicating compacted climatic databases, noted that a long-term effort was still required to realize the full potential of past climate studies. The NRC report Meeting the Challenge of Climate (NRC, 1982) stressed the need for: specialized and localized information in probabilistic form; easy access to data at several levels of summarization or aggregation; usable composite measures or indices tailored to applications; non-standard data; the importance of consistent, long-term data; and the utility of data collected for one purpose to totally different or unexpected applications. The NRC report Atmospheric Climatic Data: Problems and Promises (NRC, 1986a) noted that: climate is an inherently imprecise and open-ended concept; demand is growing for comparative historic climate data; and extremely long record sets are needed for analysis of trends and extreme climatic values. The NRC report The National Climate Program: Early Achievements and Future Directions (NRC, 1986b) found that: “there is a continuing need for the traditional long-term climate data archival programs that should incorporate the optimum mix of manuscript, digital, microform storage media;” and “technological change—the development of inexpensive computer and communications systems have [will] greatly increased [increase] the capability to handle data needed for climate services.” That report addressed many (then) future needs, including expanded surface and space data collection networks and data management procedures. The NRC Committee on Geophysical Data issued a report (NRC, 1988) noting nearly identical views on geophysical data as those discussed above by NRC panels on climatic data. Issued in 1988 by NOAA, The National Climate Program, Five-Year Plan, 1989-1993, stressed the need for: “global data collection, monitoring, and analysis activities to provide reliable, useful and readily available information on a continuing basis;” and “systems for the management and active dissemination of climatological data, information and assessments, including mechanisms for consultation with current and potential users.” Finally, the panel's report is consistent with the U.S. Policy Statements for Global Change Research (US-GCRP, 1991). The panel fully endorses those excerpts from prior study groups selected for mention above, and believes that climatic data are a national resource that will serve many research and strategic needs in future centuries. Decisions made today on long-term archiving of these data will have a significant impact on national policies in the future, both long-term and short-term. The federal government should articulate clearly an integrated national policy covering its obligations and limitations in the retention and archiving of weather, climate, and other atmospheric data. Indeed, the handling of all earth science data must be seriously examined from a broad, long-term perspective to assure sensible retention and data management and retention. 2 NATURE OF ATMOSPHERIC SCIENCES DATA Atmospheric sciences data sets are diverse. They therefore present many problems for archiving and later interpretation. Some data sets on the atmosphere stand out as being among the largest in any scientific discipline, particularly those sets that come from remote sensing by satellite or radars. Many of the data sets consist of contributions from thousands of individuals all over the world, and the provenance of those data is not always well known. Many of the data sets span decades and some span more than a century, with accompanying problems due to lack of homogeneity in measurement techniques and sampling strategies. Some data sets are derived from more basic ones with the expenditure of great effort. The data sets include digital information (in both written and electronic form), graphical records, and verbal descriptions. The records exist as ink on paper, punched paper, film (including microforms), magnetic tape of many types (including videotape), magnetic disk, and digital optical media (including CD-ROM).
OCR for page 55
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Atmospheric data sets tend to be dynamic, continually growing, or being otherwise modified. Since weather keeps occurring, observational time series from operational meteorological activities are never “complete.” On the other hand, field programs usually have finite lifetimes and the resulting data sets have a definite end. In recent years, however, many large and complex field programs have spawned associated monitoring activities that have continued well after the initial phases of the project. Despite the occasional usage of the term “experiment” to denote field programs, these intensive efforts are still largely observational rather than experimental exercises. Some truly experimental atmospheric data do exist, however, and include the results from such work as sensor development and tests, fluid dynamics experiments, thermodynamic measurements, and laboratory chemical studies. The vast majority of atmospheric science data describe observations of an ever changing earth. As such, they are unique, valuable, and irreplaceable. Despite the conspicuous, very large atmospheric data sets, most data sets are small. There are hundreds of data sets of only a few megabytes or less. There are also many medium-sized data sets that range from perhaps 100 megabytes to tens of gigabytes (see Table III. 1). Data volume does not drive the cost of archiving small-sized and medium-sized data sets, if proper technical choices are made. Rather, it is the labor intensive documentation, metadata preparation, and packaging costs that dominate the process of readying a data set for indefinite preservation can be a significant portion of the costs. The sections that follow examine in more detail the nature of the existing atmospheric science data and our expectations for the new data sets of the next decade. Traditional Meteorological Data By traditional meteorological data, the panel means those observations that began with the invention of the thermometer and barometer in the seventeenth century. These have since improved in accuracy, expanded in type, and increased spatial and temporal density, but they are not dependent on sophisticated electronic or optical techniques. The data rates of traditional types of measurements will continue to grow, but only slowly, as more of our planet becomes instrumented in the course of economic development. Traditional meteorological data consist of measurements in the atmosphere of such quantities as temperature, pressure, water vapor content, 1 wind, precipitation, and cloud cover. Historically, the state of the atmosphere has been observed at a distinct set of points. Most traditional atmospheric observations are made at regular intervals in time, regardless of present conditions, and therefore generate data at a more or less constant rate. At a small subset of observing platforms, prevailing conditions determine the rate of data generation. At such stations “break point” measurements made with event recorders preserve the time at which, for example, another kilometer of wind has blown by the site, or another millimeter of precipitation has fallen. Meteorological data are collected largely for the purposes of weather warning and forecasting, but also are intended for recording climate and climate variations. Data must be collected promptly, internationally, and in standardized format for this purpose—now primarily as input to computer calculations of the future state of the atmosphere. This collection is carried out under procedures and guidelines established by the World Meteorological Organization (WMO), a specialized agency of the United Nations. Table III.2 provides some basic information on the major observing systems for traditional meteorological observations. These ongoing operational observations are made by a large number of parties. Most measurements have been made in support of a particular set of applications, and therefore many existing long-term records have been by-products of operational activities. An awareness of the purpose of measurements is usually helpful to understand potential sources of bias. Data collection for aviation safety has led to the placement of instruments and collection of observations at thousands of airports around the earth's surface. Once the data are gathered in support of aviation purposes, they are fed into the system for monitoring weather and climate, and initializing forecasts. In fact, it was largely aviation interests that drove the growth of the observing network in the 1930s and 1940s. A few networks have a broad mission to characterize generally the behavior of the atmosphere and climate. For example, the Cooperative Climate Network includes 8000 volunteers who provide daily weather observations. The 1 Water vapor content is measured and expressed using many different quantities. These include relative humidity, specific humidity, dew point, dew-point depression, wet-bulb temperature, and vapor pressure, among others. With knowledge of an air parcel's temperature and pressure, all of these quantities can be calculated from any one of them.
OCR for page 56
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers TABLE III.1 Volume of Selected Data Sets in Atmospheric Sciences Type of Data Set Comments Dates Years Volume Atmospheric In Situ Observations World upper air Two times per day, 1,000 stations 1962-1993 32 25 GB World land surface Every 3 hours, 7,500 stations 1967-1993 27 60 GB World ocean surface Every 3 hours (~40,000 observations per day) 1854-1993 139 15 GB World observations during First GARP Global Experiment Surface and aloft, but not satellite 1978-1979 1 10 GB U.S. surface Daily, now 9,000 stations 1900-1993 94 15 GB Selected Analyses (mostly global) Main National Meteorological Center analyses Two times per day, increasing at 4 GB/year 1945-1993 48 50 GB National Meteorological Center advanced analyses Four times per day, increasing at 19 GB/year 1990-1993 4 58 GB National Center for Atmospheric Research's ocean observations and analyses Thirty-eight data sets 8 GB European Center for Medium Range Weather Forecasting advanced analyses Four times per day, increasing at 8 GB/year 1985-1993 9 76 GB Selected Satellites NOAA geostationary satellites Half-hour, visible and infrared 1978-1993 16 130 TB NOAA polar orbiting satellites 1978-1993 15 Sounders (TIROS Operational Vertical Sounder) 15 720 GB Advanced Very High Resolution Radiometer (4-km coverage, 5 channel) 15 5 TB NASA Earth Observing Satellite-AM In development, 88 TB/year, level-1 data 1998- U.S. Radar Data Domains of 30 to 60 km 1973-1991 19 1GB Next Generation Radar System (NEXRAD)a 650 GB per radar each year, 104 TB/year for 160-site system 1997- 100s TB Notes: Many other atmospheric data sets have volumes of only 1 to 500 MB. 1 MB (megabyte) = 106 bytes; 1 GB (gigabyte) = 109 bytes; 1 TB (terabyte) = 1012 bytes. a First radars were deployed in 1993. Historical Climate Network (HCN) is a specialized subset of the Cooperative Climate Network with data from selected stations that have a long record from sites where the surrounding environment has remained stable for decades; these data are important particularly for detecting climate trends within the United States. In addition to national observational programs, numerous efforts take place at the regional and local levels. Some of these are federal, some are state-funded, some are through regional commissions, and a few are private. Meyer and Hubbard (1992) identified nearly 100 such networks in the United States, encompassing over 600 stations. Since that report another 100-200 sites have been instrumented. These other networks mostly have been deployed in support of a principal operational mission, although they also may have other “accidental ” applications.
OCR for page 57
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers TABLE III.2 Selected Traditional Meteorological Observations Observing System Present Station Count Quantities Measured Observing Interval Record Length (years) Comments National Weather Service (NWS) surface observations 1000 Pressure, temperature, humidity, precipitation, cloud cover, visibility, wind 1 hour 45 Some quantities are sampled more frequently. U.S. Cooperative Network 5200* Minimum and maximum temperature 1 day 10-100 *All temperature sites have rain gauges, but not all precipitation sites have temperature gauges. There are some river-stage gauges in the network. 7800* precipitation 1 day 3000 precipitation 1 hour U.S. upper air Temperature, humidity, winds 12 hours 50 World synoptic 8000 See list for NWS surface 3 hours 120 World upper air 1000 Temperature, humidity, winds 12 hours 45 Some stations only once daily. Aircraft reports Temperature, winds 45 Over 1 million observations per year are obtained. Ships of opportunity varies 3 hours 140 Snowpack Telemetry (SNOTEL) 560 Snow water content, precipitation, temperature 1 day 15 Run by the Soil Conservation Service. Remote Automated Weather Stations 600 Temperature, pressure, humidity 1 hour 9 Run by the Bureau of Land Management and the U.S. Forest Service. Non-federal networks 700 Temperature, humidity, wind, solar radiation, soil moisture 15 minutes to 1 hour 10 or less See Meyer and Hubbard 1992. Automated Surface Observing System (ASOS) 30 Temperature, wind, visibility, cloud ceiling, precipitation 1 minute 1 The number of stations will soon grow to more than 400. Note: All numbers are approximate.
OCR for page 58
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The primary missions include agricultural growth and production, hydrology, fire potential estimation, highway safety, air-quality measurements, forest and range growth forecasts, and research programs of various emphases, among others. Some, such as the Oklahoma mesoscale network, will serve a broad range of interests, but for a state rather than the nation or a region. The funding for almost all of these networks, however, has public origins. Foreign climate data reach the United States in several formats. Operational surface weather data, often hourly, and upper air measurements (using balloons), taken twice daily, are disseminated over various international communications networks, usually under WMO agreement. Ship data are exchanged by radio and mail under international agreements sponsored by the WMO. Commercial aircraft regularly contribute weather data while en route. All these data also are useful for climate studies. Climate publications, usually of monthly or annual summaries, are prepared for many nations for their major cities. All are of great value to U.S. interests in operational, strategic, and retrospective analyses. Though the atmospheric sciences rely heavily on data from many nations, international data sharing has become a contentious issue. A few nations are now attempting to recover the expense of acquiring some atmospheric data through the sale of that information at prices significantly greater than copying and dissemination costs. Detailed satellite data have been affected most by this trend. Archived climate data are being added to the list. Attempts have been made to copyright meteorological data. Decreases in international data sharing may make it more important for the United States to carefully archive any foreign data that come into its possession. For operational meteorological data, immediacy is of the essence. Thus, time has priority over accuracy. Especially in areas of high data density, inaccuracies can be detected by humans or by computers from comparison with other data points, making it possible to bypass the inaccuracy. However, when these data are used subsequently for research, accuracy becomes more important than time. Therefore, before storing operational data for retrospective use, the National Climatic Data Center (NCDC) and other agencies maintaining data for long-term use apply quality checks to many data sets. In the past, operational meteorology rarely used data sampled more frequently than every hour, and usually less frequently than that (though higher sampling rates are now being used operationally). Nevertheless, weather data often have been collected on paper strip charts with resolution of about one minute. These data have the operational use of finding extremes in wind or pressure (the most common quantities with frequent sampling). They also have great value in research for calculating certain statistics of atmospheric behavior. In our computer-dominated world, these older paper records are not likely to see much use unless they are digitized; yet this is a time consuming and expensive procedure. The best preservation and sampling scheme for these records is not yet clear. Similar sampling issues are being raised now by new automated systems that are capable of very high temporal resolution. The matter is complicated further by averaging procedures that are built into some of the measuring and recording systems. As part of the modernization of the National Weather Service, the measurement of many traditional meteorological quantities is becoming automated. Even temperature measurements will shift largely from human reading of mercury-in-glass thermometers to electronic sensors at automated observing stations. These new sensors are capable of very frequent measurements and consequently much higher data capture rates. This change in technology also will create great difficulties for researchers wishing to use long homogeneous time series and will provide a severe test of the quality of the metadata that document the changes in instrument response. These changes will require great attention to the preparation of metadata to allow the new data to be used as part of the time series already begun. Remote Sensing from Space In situ point measurements are being supplemented increasingly by remotely sensed information. Aside from their wide areal coverage and relatively frequent repetition cycle, the resulting data sets are probably best known for their huge sizes. Each new system, as it is deployed, is designed usually to push the current storage technology to its limits. Several space-based platforms, or series of platforms, are included in Table III.1. For the last 20 years, the volume of satellite data for atmospheric science use has dominated the data volume lists of NOAA, NASA, the U.S. Air Weather Service, and U.S. Navy weather units. Attention to this large volume has resulted in practical technological solutions so that currently all the satellite data can be recorded and be made available to users in large or small blocks.
OCR for page 59
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers In addition to acquiring data from the growing domestic sources, U.S. agencies come into possession of a large quantity of data from the satellites of other nations. Much of this is obtained through bilateral agreements. Foreign satellites sometimes obtain measurements over parts of the globe where coverage from U.S. satellites is poor, and these foreign data generally cannot be re-obtained. Satellite data present several difficulties because the quantities measured must be manipulated extensively to yield the quantities of meteorological interest. Almost all remote sensing technologies, regardless of platform, share this problem. For example, the imaging instruments that provide the familiar satellite pictures of earth usually are measuring infrared or visible light irradiances in narrow frequency bands. For a particular application of the infrared data, these irradiances then are converted to brightness temperatures, and then into estimates of cloud height. The results of this transformation depend strongly on the algorithm used, which in turn depends strongly on the state-of-the-art at the time of the calculation. Hence, the most useful form of the results for either research or operational purposes has greater uncertainties than the original measurements. Similar considerations apply to other satellite-based observations, such as multichannel infrared radiometry for deriving temperature profiles, microwave radiometry for estimating water vapor, scatterometry for estimating surface winds over the oceans, and radar altimetry for estimating sea-surface height. Improvements in the algorithms that transform the basic measurement of lower-level data into more useful quantities are to be expected. Like all measuring systems, those on satellites can suffer from calibration drift and sensor degradation. However, the remoteness of the sensors can make it more difficult to cope with these problems than for land-based sensors. Furthermore, satellite-based sensors have been undergoing continual major improvements with each new series of platforms. These problems have created great difficulties for researchers attempting to create homogeneous time series of geophysical quantities from satellite-based measurements. Overlapping time series from different platforms sometimes allow cross calibration and provide some continuity. The large data volumes, the continual change in data processing algorithms, and the continual change in instrument behavior, all raise important issues relevant to the requirements for the long-term archiving of data obtained from space-based platforms. Remote Sensing from the Surface Though some surface-based remote sensing systems have been operational for several decades, there is an ongoing rapid increase in the number of such measuring systems. The modernization of the National Weather Service is increasing greatly the number of remote sensing systems, the temporal frequency of observations, the areal coverage, and the spatial density of remotely sensed observations. The planned increases in data accumulation rates are staggering. The largest source of remote sensing data from surface stations is from weather radars. Processing and storage of older radar data, and even some current radar data, is not very sophisticated. Much of the data consists of manually drawn patterns based on screen observations that then are encoded in digital form. These are prepared once per hour. Photographs are taken routinely of all network radars (about 80) at intervals of 40 seconds in severe weather, every five minutes when precipitation is within 125 miles of the radar, and every 15 minutes if precipitation is only between 125 and 250 miles of the station. These films are archived at NCDC. A few radar sites save digital data for research purposes. NCDC has been archiving these data for a handful of sites and NCAR also has many years of digital radar data for the United States. As a major component of its modernization, the National Weather Service, along with the Federal Aviation Agency and Department of Defense, is in the process of deploying a new generation of weather radars (the WSR-88D). This single system is of particular note because the resulting data stream will be the largest from any set of measuring devices in the earth sciences. The system, known as NEXRAD, will consist of about 160 sites when deployment is completed in 1996. The WSR-88D radars will have a far more sophisticated digital retention and recall system than the older systems discussed above. Data retention decisions for the NEXRAD program are still evolving and uncertain. However, each change in policy has headed in the direction of saving more of the data. Present plans do not include the saving of the level-1 analog data directly from the receiver, though recording is possible for diagnostic purposes. Originally, the plans called for only a limited number of recorders for the level-2 data, consisting of unprocessed digital reflectivity and Doppler shift data, but efforts are now being made to have recorders at all sites and for the data to eventually reach NCDC for archiving. A selected set of the level-3 products,
OCR for page 60
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers such as plots of reflectivity and winds at particular levels, precipitation, and cloud top heights, will be archived at NCDC. One of the level-3 products will be similar to the older manually digitized maps, for continuity. At this time, there is no intention to save the special request and combined analyses designated level 4. The data volume from a radar site is weather dependent, but it is expected to average 650 gigabytes/year, filling about 200 exabyte tapes. Therefore, the 160-site system will deliver over 100 terabytes/year. Workers at NCDC and the National Severe Storms Laboratory at Norman, Oklahoma, are still developing options for data compression at the drive. Their tests show surprisingly good compression factors of 8 to 11. Other types of surface-based remote sensing are expected to generate greatly increasing amounts of data in the near future. Several wind profilers with high data-capture rates are already in use. Lidar is being used to measure the concentrations of many atmospheric constituents. Furthermore, there are large-dish research radars, such as those at the Arecibo Observatory in Puerto Rico and the Haystack Observatory, used for upper-atmosphere research. Concerted effort is needed to assure that the data from all these sources are accessible. Other Data Sets in the Atmospheric and Related Sciences There are many other data sets that provide information on the atmosphere and still more data sets on processes that affect the atmosphere. As time scales under consideration get longer, the boundaries between disciplines become more vague and the number of relevant data sets increases further. Other Contemporaneous Observations Contemporaneous observations refer to measurements of some phenomenon as it is occurring. This is largely to distinguish such observations from indicators of paleoclimate or any other measurement that is made to learn about the past. The following discussion, while not exhaustive, provides some sense of the great diversity of data sets handled by atmospheric scientists. This diversity presents severe management, indexing, and archiving problems. Though the data sets discussed in this section are mostly not large, they have great scientific and economic importance. Some regularly collected meteorological data are obtained sporadically. Such data include hurricane tracks and intensities, tornado incidents, and lightning strike data. Other data are collected less than regularly due to budgetary constraints, such as from rocketsondes into the upper atmosphere and electric and magnetic field measurements. Of fairly direct relevance to the atmospheric sciences are measurements of the earth's water, snow, and ice. These include river-stage levels, lake levels, soil moisture, storm-surge measurements and tide-gauge levels, wave heights, sea-ice coverage and thickness, glacier extent and thickness, snow cover, ocean-surface data, and ice-buoy data. Studies of climate have led to increased interest in measurements related to the radiation budget. These include determinations of solar radiation, vegetation coverage, surface albedo, and atmospheric aerosol types and concentrations. Also needed to determine radiation budgets are atmospheric-trace-gas concentrations of carbon dioxide, methane, chlorofluorocarbons, ozone, nitrous oxide, and others. Each of these substances has further data pertaining to the anthropogenic sources and biogeochemical cycles that affect it. Other atmospheric constituents are studied for reasons of public health and air quality, including hydrocarbons, various nitrogen oxides, near-surface ozone, and pollens. Many other historical records provide information useful to climatologists. Some of these were not made for scientific purposes and some provide information spanning several centuries. These include harvest records, flood records, and personal diaries. Even paintings of a famous glacier have allowed inferences of its growth and retreat. All of these data have the property that if lost, they cannot be replaced. Some quantities vary so slowly that although measurements of them might be considered contemporaneous, changes in the measurements are probably more indicative of changes in measurement skill than changes in the underlying phenomenon. Such quantities include land and water boundaries, topography, and bathymetry. Accurate knowledge of these is necessary for many different types of atmospheric models. Though paleoclimatologists are interested in time series of these earth features on time scales of millions of years, contemporaneous measurements cannot provide such a series. For the slowly varying properties of the earth, the atmospheric science community usually is content with the best current information available.
OCR for page 61
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Analyses, Forecasts, and Simulations For many decades, interpolation methods of various degrees of sophistication have been used to infer the state of the atmosphere other than at the actual observation points. Until the 1970s, the most common of these were hand-drawn maps. Automated procedures now dominate both numerical and graphical analyses. Dissemination and storage methods have changed from mostly drawings on paper to largely digital form. With the shift of these images to digital formats, electronic archiving is easier and becoming regular practice. Blended analyses that serve as the initialization for numerical forecast models are coming into more widespread use as a data source. These analyses combine many different types of data and are usually model dependent. Whether such information should even be considered “data ” has sparked considerable debate. There is no agreement even as to the terminology that should be applied to such model-dependent products. Nevertheless, some users are content to treat such information as “observations.” Their rationale is that a wide variety of primary measurements have been integrated and synthesized into a coherent and self-consistent picture with the help of a model, which in this case can be thought of as simply an extremely sophisticated interpolation tool. The convenient (usually gridded) form of such information attracts many researchers. Such analyses are provided regularly by the U.S. National Meteorological Center (NMC) and the European Center for Medium-Range Weather Forecasting (ECMWF). Most of these analyses are the result of real-time operational procedures. However, there are now significant efforts to reanalyze older data using the current model-initialization schemes to provide a more temporally homogeneous data set. Forecasts themselves also comprise a data set. The archiving of forecasts allows study of the forecast process and evaluation of improvements over time. Statistical tools have been developed that improve local forecasts by making adjustments based on a comparison of several years of archived forecasts and observations. Legal proceedings occasionally rely upon archived forecasts and warnings. Atmospheric scientists are among the greatest consumers of computer resources. Most of these resources are devoted to simulations of the atmosphere under various assumptions. Computer simulation output is considered “data” by some scientists, but not by most. Clearly any simulation can be repeated, but sometimes only at great expense. Large volumes of output from global general-circulation-climate models are being maintained, mostly at NCAR. This model output is being used primarily for research on climate change issues. The great expense of running the global models has made it worthwhile for investigators to prepare archives of model runs so that the output can be used in a variety of ways. Field Programs There have been many field programs to collect extensive data for the study of problems in the atmospheric sciences. Programs have emphasized topics such as tropical rain systems, synoptic meteorology of the globe for a whole year, hail formation, the radiative effects of clouds, interannual-climate variations, and many others. Field programs raise special problems for the long-term archiving of data and have also demonstrated solutions to some data access problems. Some of the data collections from field programs will lose their importance after a better experiment has been performed, or after the main scientific questions have been resolved. Some collections are very valuable for at least several decades. For example, the combined data from satellites, ships, and aircraft of several countries that were collected for the 1974 GARP Atlantic Tropical experiment still are being used actively. Field programs characteristically acquire many types of data for a limited time and in limited space. Since there usually will be at least several dozen, if not several hundred, users of the data from a field program, it makes sense to package the data for easy use and to extract subsets from other global archives once for each user, rather than force the users to acquire such data on their own. For example, the GALE (Genesis of Atlantic Lows Experiment) project developed 126 data streams. The organization and preparation that goes into this packaging is often just what is needed to allow easy long-term archiving of the data. More than 6 person-years were required to prepare the large number of GALE data sets and the associated descriptive information (metadata). Many field programs now provide their data sets, well packaged and documented, on CD-ROM. However, the large number and great diversity of relatively small data sets creates cataloguing and access problems when decisions need be made on whether or how these data sets are to be integrated with other sets of atmospheric data containing longer time series.
OCR for page 62
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Some data are specially measured during the experiment and other are extracted from larger ongoing operational data collection efforts. Packaging these together creates problems of redundancy for data taken from operational efforts. A program to examine cloud radiative processes, called FIRE (First ISCCP Regional Experiment) integrated data sets from satellite, ship, aircraft, and land-based sources. Four experiments in the last eight years each produced about 5 gigabytes of data, a significant fraction of which was operational data. With this quantity of data, redundancy in an archive is not a significant problem. However, the recently initiated Atmospheric Radiation Measurement (ARM) project of the Department of Energy plans to assemble data sets with tens of gigabytes, much of which is part of other larger data sets. Paleoclimatology There are many different types of paleoclimatology data. Such data include tree-ring widths; gas concentrations, isotope ratios, and dust levels in ice cores; sediment-core analyses; paleoflood-stage indicators; fossils; and even the contents of ancient pack rat middens. After measurements are made on the samples collected, the original samples typically are stored for very long periods of time. Most measurements are very labor intensive. As a result, paleoclimate data sets are generally very small. Paleoclimate data are almost all collected originally for research purposes. However, data on events long past also can have operational value. In vital studies such as the determination of the magnitude of extreme precipitation events for dam licensing or utility siting and design, the value of the limited historical precipitation record can be enhanced greatly by geomorphological indicators of paleoflood stage and sediment deposits. It is usually possible, in principle, to recreate data similar to existing or lost paleoclimate data, but paleoclimate studies are very expensive. Most of the work in the United States is government funded. However, because samples come from all parts of the world, including very remote regions, many projects are cooperative international efforts. The data sets are usually small, making it possible for most of the data to be contained within papers in the open literature. There is an effort to build a centralized archive at the National Geophysical Data Center in Boulder, Colorado, which is the primary World Data Center for Paleoclimatology Data Sets. However, some of original data can still be found only in the custody of the investigators who prepared them. A small number of very prominent data sets have been made available from data centers or over computer networks. Laboratory Data Laboratory data relevant primarily to the atmospheric sciences are fairly limited. The nature of these data and their archiving problems are similar to those covered by the Physics, Chemistry, and Materials Sciences Data Panel in this volume. The relevant laboratory data are largely of two types of work: experiments attempting to simulate some atmospheric phenomenon and experiments to develop a measurement technique or sensor. Laboratory thermodynamic data are essential for atmospheric research and operational activities, but these are largely not determined or archived by atmospheric scientists. Recreation of most laboratory data is possible. However, some data related to sensor response of operational measuring devices may be nearly impossible to reproduce once the sensors are no longer made and parts have degraded. Sensor response data are an important part of the metadata for operational measurements. There is no organized archiving of laboratory data for the atmospheric sciences. Summarized forms of these data can usually be found in the open scientific literature. 3 USES FOR ATMOSPHERIC DATA Meteorological and other atmospheric data are used for various purposes on different time scales. It is convenient to delineate three: (1) real-time or current, (2) recent past or short-term retrospective, and (3) distant past or long-term retrospective. Compared with other disciplines, meteorological data are probably used by a much wider segment of the U.S. population than other scientific data sets because they are related directly to many practical concerns of nearly everyone on a daily basis. Other kinds of geophysical data (solid earth, upper atmosphere, oceans, etc.) are more likely to be used by the scientific community or technicians, but there is a large lay audience for weather and climate information.
OCR for page 75
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Metadata for weather observations have been expanded and organized into families of “station histories” for both NWS hourly and cooperative stations. These are now placed in a single secured location, though they do not reside with the primary data. In cooperation with DOE, NCDC has constructed a Global Historical Climate Data set of yearly and monthly temperature, precipitation, and pressure data. NCDC is in the process of developing an automated updating system for the U.S. Historical Climate Data Set. These “historical” data sets have the most stringent requirements for data quality and continuity. The cooperative (daily) station history files have been placed into a computer-based file. A number of high-quality data sets have been made available on CD-ROM. NCDC has made available on-line inventories and some data sets through the Internet. Finally, of great importance for the long-run, NCDC has established —after many years of misunderstanding—a solid working relationship with NARA. National Center for Atmospheric Research One example of data archives located near research efforts is the data bank at NCAR. The NCAR archiving activity was started in 1965 to help NCAR and university scientists obtain access to data for research problems. Researchers use data from the archive and contribute data back to the archive. There are about 1100 users of the super-fast computers at NCAR, 400 from NCAR and 700 from universities. These individuals can access the archived data on-line and can use simple access programs that accomplish the task of unpacking the data. Other users obtain the data on storage media (round tapes, cartridges, exabyte tapes, CD-ROMs) or by network. This data archive at NCAR is a major resource, with over 400 data sets and a volume of over 4 terabytes. Many types of data are included, especially atmospheric data, ocean data, selected land-surface data, and data from climate models. The staff size was usually 5 to 7 people during 1972 to 1991, but was more recently expanded to a total of 10 people to assist with a number of new projects. The NCAR archive is devoted to databases for research that includes many data development activities. It takes advantage of other data activities in the United States and other countries. Additional work is often needed to make the data easy to access at low cost. NCAR organizes data so that it is practical to use very large amounts of data for large projects, and small projects are still supported. An on-line system provides extensive information about the data sets and is heavily used. The archive activity is located near several research groups. It prepares data for those groups and becomes an advocate for their needs in the wider data-providing community. NCAR is currently involved in several projects of relevance to archiving. Methods are being developed for low-cost distribution of very large data sets. On-line information systems describing the archived data are being expanded. NCAR scientists are working with NCDC and NOAA 's Environmental Research Laboratories to prepare consistent, long-term data sets of surface marine data. A project is underway for the National Meteorological Center to reanalyze the world's atmosphere every six hours from 1958 to 1993 in a consistent, homogenous manner (Kalnay and Jenne, 1991) using worldwide upper-air rawinsonde data, land-surface and ocean-surface data, aircraft reports, satellite-derived temperature soundings of the atmosphere, and satellite-derived winds from cloud motions. Carbon Dioxide Information Analysis Center The Carbon Dioxide Information Analysis Center (CDIAC), though not formally designed as a long-term archive, serves as a good example of how a data archive should acquire and package data sets. CDIAC 's mission is to acquire, compile, quality assure, document, archive, and distribute information on CO2 and climate change. In the process, they actively promote and support international data collaboration efforts. CDIAC currently maintains over 200 data sets, but these are mostly small and fill only 5 gigabytes. The collection is quite selective. The data sets are primarily related to the atmosphere, including data on atmospheric chemistry and climate. About 50 of the data sets are unique to CDIAC, largely because they were compiled from more disparate data sources. Some data sets are duplicated elsewhere, sometimes even within government holdings. Part of CDIAC's success can be attributed to consistent funding. The Center's work is totally devoted to making data available to users other than those who initially obtained the data. There has been a deliberate effort to
OCR for page 76
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers add value to the original data through packaging with metadata and providing the data in easily usable form. Though CDIAC does not have the force of law to obtain data sets (unlike NARA and NOAA), it has been very successful in obtaining the data it wants for several notable reasons. No format requirements are imposed on incoming data. The data are not changed without the approval of the investigator(s) who provided them, and data are not distributed in preliminary form. Full authorship credit is given to the investigators who provided the data, and that credit accompanies the data set whenever it is distributed. Lastly, the CDIAC staff actively work to obtain data from researchers. Users of archived data need clean, consistent data more than immediate data; this greatly informs the decisions made. There is a strong emphasis on data quality and comprehensive metadata preparation. Though a 20-year test of usefulness is used as a target, the quality of the metadata and packaging is sufficiently high that the resulting products are likely to survive much longer. The services and data sets provided by CDIAC are provided free of charge, greatly encouraging their use. The formats for distribution and metadata are flexible, tailored to the particular data set and user requests. Access is possible over the Internet through anonymous ftp, or by almost any other medium of recipient's choice. In many ways CDIAC is a model archive. However, it has several luxuries not shared by NARA or NOAA. It restricts itself to data on a particular limited subject and emphasizes small data sets. Though many of the data are historical, CDIAC is largely concerned with satisfying current users rather than anticipating the needs of users several decades in the future. Nevertheless, much can be learned from the successes at CDIAC. Pathfinder In 1989, NASA and NOAA initiated a “Pathfinder” data recovery program, which concentrated on recalibration and reprocessing of older satellite data. Pathfinder reaches back 10-12 years in a vast (greater than 100 terabytes) data set of operational data from polar-orbiting and geostationary satellites. It has been driven by science questions related to awareness of possible global change and by the need to develop and test new data information systems with extensive data sets that emulate those of the upcoming satellite systems. Retention Criteria It is clear that NCDC feels pressure from two goals, which are sometimes in conflict. One goal is to preserve needed data for the long term. The other goal is to follow the U.S. Records Control Act, which tries to limit the accumulation of data and the needed floor space. The pressure to discard data frequently has been greater than the pressure to keep it. A considerable amount of early data was purged that should not have been. It is quite difficult to provide a simple set of practical retention criteria for scientific data and other technical records. A simplistic rule, such as “save everything,” quickly runs into problems with the storage of large amounts of unusable material, difficulties in finding useful material, and cost. Similarly, a rule to “discard any data set that is not accessed for a period of twenty years,” or any other arbitrary length, would surely lead to the destruction of irreplaceable data acquired at great expense, with possible valuable use in the distant future and little cost to maintain. Though saving everything in usable form is not currently practical in the atmospheric sciences, if all data sets were initially prepared with retention in mind, the incremental cost of archiving the data set would be sufficiently small that it would almost surely survive. Priorities Although the panel has been unable to develop hard-and-fast rules for retention decisions, it has been able to agree on some priorities for making those decisions. The panel suggests that in determining the priorities for long-term archiving, consideration should rely most heavily on: the unusualness or rarity of a data set, the level of quality control, the availability of metadata, the value of the data for long time series, a preference for original data over processed data,
OCR for page 77
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers the possibility of recreating similar (or better) data, the cost of recreating similar data, if possible, and the cost of archiving. Evaluating any given data set according to these criteria can itself be a difficult process. The panel commends NCDC for finally arranging for regular review by an outside scientific advisory group, which is helping set data rescue and archiving priorities. The panel suggests that NARA, whether or not it obtains scientific expertise on its staff, should regularly convene groups of outside scientific experts to assist in decisions on the archiving of scientific data and the priorities for making such decisions. Further, if there are significant costs in the continued archiving of data sets, panels of outside scientists should be asked periodically, perhaps every 20 years, to assist the government in deciding which data sets deserve continued support. These procedures and priorities will not totally eliminate errors and regrets, but they should help to reduce them. The unusualness or rarity of a data set should weigh heavily in determinations of whether or not to archive it. For examples, the present data sets on atmospheric trace gas concentrations, field experiments measuring cloud condensation nuclei, and Atlantic hurricane tracks are all more unusual and should rate a higher priority for archiving than, say, the RAWS network of traditional meteorological data, even though the RAWS data are still quite valuable. Atmospheric data from a maritime location are presently more valuable than similar records from a land-based station. Unusual data sets, such as the kite-based boundary layer and upper-air measurement, should maintain a high priority for archiving even if they go through very long periods without use because they are the data sets for which we are least likely to be able to accurately forecast future needs. The level of quality control for a data set and the availability of metadata must necessarily affect decisions on the archiving of that data set. Although the hope is that all data sets are of high quality and are accompanied by detailed metadata, that is not the case. If, due to mistakes already made, a data set is not likely to be of great value to someone interested in data of that type or the data are more likely to mislead than to inform, that data set should be a low priority for archiving, or perhaps not archived even if resources are available. In the atmospheric sciences, the value of a data set as part of a long time series is a very important criterion for archiving decisions. The temperature record for a year from a station operating over a century is much more valuable than a similar record from a nearby station with a shorter lifetime. Studies of climate change and other types of global change find long time series to be essential. Confirmation of the “ozone hole” required reference back to the Dobson column ozone data from the first half of this century, when no long-term trends in ozone were expected. The U.S. Historical Climate Network data are a high priority for archiving because they represent a long time series of high-quality data, with excellent metadata; this combination of attributes of data of a common type makes the overall data set quite unusual. Almost all atmospheric data go through some amount of processing, recalibrating, error correction, or combining with other data before they are made available. The unprocessed data may, in many cases, be virtually unusable. The processing of a data set may provide considerable added value to the users. All that said, the panel recommends that a greater priority for long-term archiving be placed on the retention of least-processed data (i.e., level-1) along with sufficient description of any algorithms needed for making use of the data. This need not be pushed to extremes. The panel is not suggesting that thermistor voltages be stored along with calibration curves rather than temperatures. However, radar reflectivity data should be archived in preference to derived rainfall rates (though if there are sufficient resources, we have no objections to archiving both). The more likely the case that further research will result in the redetermination of the algorithms used to process a data set, the more appropriate the choice of the lower-level data for archiving. Even for cases where the processing of a data set and the packaging of related data sets for research purposes took considerable cost and effort, if the combined or processed data package is no longer being actively used and the underlying data are already being archived, the combined or processed data should be a low priority for archiving. For example, under NARA's archiving considerations, the archiving of prediction model initializations —which nevertheless may provide the best overall description of the atmosphere at a given point in time—should have a much lower priority for indefinite archiving than the diverse data sets that went into the initialization. On the other hand, there is sufficient value in the data sets of model initializations for NCDC to store these data sets for decades, or until improved algorithms have been run with the original data. There is a tendency in the atmospheric sciences to think of all data as being potentially part of an eventual long time series and not being subject to recreation. Though there is some truth to this, not all data sets have equally
OCR for page 78
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers strong claims. The data from some older process studies have been surpassed by more recent experiments. For example, old data on surface fluxes of heat and moisture now have little value in the wake of studies with better instrumentation. Similar reasoning holds for some cloud process studies. The data from many process-oriented field experiments are not likely to become part of any homogenous time series and may eventually be made obsolete by better measurements. However, if better data have not yet been obtained, it should be remembered that the cost of obtaining almost any data set was greater than the cost of archiving that set. The panel finds that the great cost of almost all data gathering efforts, even those that can be repeated, provides a strong argument for providing sufficient resources for archiving. Foreign data present a somewhat special case. The panel notes that obtaining in situ foreign data can be considerably more difficult than obtaining similar domestic data. The United States has many very important retrospective uses for foreign data, some of them during hostilities. Data may become unavailable to the United States even though they continue to exist elsewhere in the world. The archiving of foreign data, therefore, should be considered a high priority. The panel recommends that strong efforts should be made to archive indefinitely all in situ atmospheric science data from outside the United States that federal agencies (including the military) obtain in the course of their activities, even if it is possible to construe these data as not otherwise being federal records. High-volume data sets also present special difficulties. Scientists in discipline areas need to ask what really does make sense to save in the first place, and what needs to be saved beyond 20 years. The costs associated with such options should be known and discussed along with the value of the data. The panel suggests that periodic review, perhaps every 20 years, be made by scientists outside of the government of large data sets to determine the most effective storage strategy, including the possibility of retaining only a sample of a data set. Issues in the saving of sampled data are discussed further below. Sampling Considerations of data sampling apply to measurement systems, to archiving strategy, and to providing reasonable user access. Even before archiving decisions are made, many sampling rate decisions have already been made. Temperature sensors and wind gauges could easily be sampled 100 times per minute, but that would be unnecessary and counterproductive for nearly all uses. It is necessary to ask whether a high sampling rate makes sense for a network where stations are located 20 to 200 km apart. For networks, it usually is desirable to have an averaging time that will average part of the small-scale variability that cannot be measured by the network anyway. However, archivists need only consider the data that have been stored at least been initially and are spared some of these questions. There is some rationale for wanting data sampled at a high temporal rate over a relatively coarse grid. It has been shown, for example, that for extreme value analysis it is much better to use areal averages of the required statistical descriptors (such as coefficient of variation, skewness, kurtosis) than simply to use station values. For important types of engineering applications, a good knowledge of the statistics, extremes, and return periods of short-duration precipitation values is necessary; e.g., 40 years of 5-minute precipitation extremes can be used to estimate the 100-year return value. Still, the high temporal-sampling rates might not be needed for every station. A debate that includes cost and utility tradeoffs should be conducted within the data-collecting agencies to determine whether the high-rate data should be saved. Often the science community is not told of the cost tradeoffs when advice on such questions is asked; such tradeoffs should always be described. The problem of data sets with rapid sampling rates is not a new one that began with satellites, radars, and ASOS stations. There is still a large archive at NCDC of strip-recorder charts that have pressure, temperature, and wind graphs on paper, where the instrument response times are probably from 10 seconds to 2 minutes. Decisions have been made that the paper charts will no longer be made or archived because of cost. If the equivalent digital data are needed, the national network of about 1500 planned automated stations must be sampled at a rate of once per minute rather than a rate of once per 5 to 20 minutes as planned. Nearly all uses of the data require data only each 10 to 60 minutes, so that a once-per-minute archive makes it 10 times harder and more costly to obtain and process the required data. Archiving decisions also must be made on the already decaying paper records, not all of which have been copied to microforms. Most of these records have only limited use. The panel suggests that only a sample of the old (more than 50 years) paper barograph, wind, and similar strip chart records need be archived. The archiving of only a sample of our largest data sets raises difficult issues in statistics, data management philosophy, and budgeting. In concept, there may be acceptable procedures for the long-term archiving of only
OCR for page 79
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers samples for very large data sets. However, the panel did not have the resources to fully examine these issues, and recommends that a focused study should be undertaken to evaluate the tradeoffs and problems of archiving for the long term only a sample of our largest data sets in the atmospheric sciences, with an eye toward finding acceptable sampling techniques. Effects of Insufficient Budgets on Archiving The databases at NCDC represent the largest collection of historical weather information in the United States. They have undergone many changes in formats, storage media, data management, and handling and processing procedures, often with inadequate funding as priority. NCDC has pursued various data rescue, archival, and systems modernization efforts with limited success. However, data-set preparation for archiving and care is expensive. In recent years, there have been significant accomplishments in data management, archival procedures, data conversion, and data availability. Much, however, remains to be done. Specifically: there are vast holdings at risk (with neither compression nor backup); many holdings are not properly inventoried nor prioritized; computer systems at NCDC lag behind the current state-of-the-art, and are overloaded, without an adequate replacement plan; many of the data collection systems are inefficient, antiquated, and not compatible with others; backup tapes, for a portion of the data, are in close proximity to the operations, placing those data at risk; and systems planning for data protection has lagged. At one time the USAF had nearly 15 years of thunderstorm summary data from the Soviet Union. These data were totally lost when NCDC was unable to accept them due to resource constraints. Most of these problems are a direct result of decades of inadequate funding and have been solved only partially by the dedication and heroic efforts of highly motivated NCDC staff. Though NCDC is the active archive for NOAA weather data, it does not have all the NOAA atmospheric data. It has a substantial fraction of the operational NOAA data sets, and a much smaller fraction of the research data sets generated by NOAA funding. It is barely keeping up with the quantity of information pouring into the Center. The panel questions whether it is a good idea for NCDC to make efforts to physically obtain yet more databases without funding specifically allocated to the acquisition and maintenance of that specific data set. Otherwise, their very limited resources would be diluted even further. However, the panel can think of no better home for most U.S. atmospheric data. A great difficulty in the funding of archives is deciding who pays for long-term retention and access. The users who will most benefit from the archive may not yet anticipate their own need or even yet be born. Along with the initial planning for the collection of data, there should be initial budgeting for data management and preservation. For example, DOE's ARM program budget estimates about 15 percent of the total program budget for data management and archiving. EOSDIS will claim a similar share of the EOS funds. This amount is significantly more than many projects might need, but the panel supports the concept. Often there is the attitude that if an agency accepts or archives data at all, full support with extensive resources must be made available. NARA and NOAA should explore providing a range of levels of support with different levels of storage media quality and different levels of access convenience for data sets, decided deliberately rather than by default depending on short-term budget fluctuations. As has already been noted, the best technology is not always appropriate. The panel reminds the agencies, as have many prior studies, that maintaining archives should be among agencies' highest priorities because the costs of maintaining and archiving data are sufficiently small compared to original data collection and preparation costs. On the whole, putting more money into data management usually realizes much more on our investment than collecting more data. On a broad level, the federal government should examine priorities for additional funding of data management and archiving. 5 INSTITUTIONAL ARRANGEMENTS AND NATIONAL RESPONSIBILITIES NARA has the general responsibility for overseeing the disposition of all federal records and retaining all federal records worthy of indefinite archiving. This is a daunting responsibility, given the magnitude of the federal
OCR for page 80
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers government. However, agencies are required to determine which of their own documents should be considered federal records and to submit disposition schedules for those records to NARA before disposal. Under the status quo, NARA's role is essentially reactive. It relies on agencies submitting requests for records disposition authority. In principle, NARA espouses a “total system” approach, with a review of the totality of an agency's records to determine which ones merit archival retention. In practice, the evaluation is often piecemeal because NARA cannot compel submission of comprehensive schedules and an agency is free to withdraw an item from a schedule (which often happens when NARA and the submitting agency disagree over disposition). Despite NARA's general responsibility, there is a detour that affects records related to the atmospheric sciences. The Federal Records Act of 1950 2 established the National Weather Records Center (NWRC, since succeeded by NCDC) as the official depository of all U.S. weather records. At that time records already in the custody of the Archives were returned to the Weather Bureau. Other legislation provides mandates for NCDC to “establish and record the climatic conditions of the United States”3 and “authorize activities of processing and publishing data”4 and for the management and active dissemination of climatological data.5 NCDC has its special statutory authority because of the great importance of weather and climate records, the great length of time for which those records remain important, the particular expertise with respect to those records found within NOAA, and the lack of that expertise within NARA. Nevertheless, NCDC legally has only temporary custody and NARA remains the ultimate archive. Consistent with this, NCDC must provide disposition schedules for all of its records. These schedules usually specify that NCDC will maintain legal custody for another 50-60 years, with the understanding that this period might be renewed at expiration. The relationship between NARA and NCDC has gone through several periods of both easy cooperation and great strain. At one point, in the early 1960s, to ease the strain between NARA and NOAA, it was proposed that NCDC be made the official archive for weather and climate records. Such legislation still would not eliminate the need for NCDC and NARA to cooperate, since the range of data sets that are considered relevant for climate is continuing to grow. The panel finds that it is essential that policies on long-term archiving be fully coordinated between NARA and the operating divisions of the government collecting and utilizing large quantities of atmospheric data, including NOAA, DOD, NASA, and DOE. Cooperation from smaller producers of atmospheric data (such as the BLM, USFS, USGS, CoE, and the BR) will also be useful. Beyond the requirements for coordination between the myriad federal agencies, there is a also is a need for coordination on an international level. Some U.S. agencies run World Data Centers and are thus responsible for serving as the archive for data from other nations. Some U.S. data are archived in World Data Centers of varying degrees of stability in other nations. Joint research missions with other countries establish national obligations for the United States. In determining the archiving arrangements of the federal agencies, these foreign obligations must be considered. Custody of Data Following passage of the Federal Records Act of 1950, the National Archives and Congress recognized the appropriateness of the National Weather Records Center of the U.S. Weather Bureau as the appropriate center for operational archiving of weather records. The evolution of agencies has resulted in that center now being the National Climatic Data Center of NOAA's National Environmental Satellite, Data, and Information Service (NESDIS). NCDC is the operational repository for meteorological and climatic data collected by the United States from land-surface and upper-air stations, marine platforms, buoys and ships, satellites, and other remote sensing devices. Agreements have been signed between NCDC and NARA designating NCDC as the operational archive for weather and climate data. The panel recommends that NARA and NOAA continue their Memorandum of Agreement, whereby NCDC will remain the active archive for weather and climatic data. NARA is not equipped to act as the custodian of most atmospheric data. The volume is too great, the specialized knowledge is not resident with the staff there, the interagency linkages are not in place, and a huge infrastructure similar to the ones that already exist at other agencies would need to be rebuilt at NARA. The 2 P.L. 81-754. 3 33 U.S.C. 313. 4 33 U.S.C. 883 B. 5 P.L. 95-357, National Climate Program Act.
OCR for page 81
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers agencies closest to the data sets and best equipped to deal with them are themselves already struggling with these issues. However, NARA does have great expertise in issues involved in the long-term storage of data and the packaging requirements for data to be of value to future users. The panel believes that NARA's role should be supervisory or consulting to make sure that the agencies who are the actual custodians of data at the working level follow all the relevant federal laws and guidelines in taking care of the data. The panel suggests that scientific information should go to NARA 's physical possession only as a last resort. As a basic operating principle, the panel believes that scientific data should be maintained by the agency most knowledgeable about those data as long as there is any regular active use, even if that use is not part of the regular activity of the holding agency. Further, the operational agencies should collect, quality control, analyze, summarize, compact, store, retrieve, and make available the maximum feasible amount of weather and climate data as long as there are any significant demands for their post-operational use. The panel suggests that if a lead agency can be determined for a subject matter, it should take responsibility for coordination of scientific data on that subject, no matter which agency has physical ownership or custody of those data. The panel recognizes, however, that some data sets are largely of interest at the boundaries of disciplines or agency interests and that these may not be properly managed or documented. The panel recommends that in general, when an agency is the lead agency in a discipline, data in its possession related to that discipline should remain with that agency indefinitely. This recommendation may require legislative changes for agencies other than NOAA and even some changes with respect to NOAA. The panel recommends that when an agency collects data incidental to its mission, but those data are not central to its discipline, when feasible those data should be transferred to the agency with the lead in that discipline. Notwithstanding the general principle, no agency should be required to accept any data that are not of sufficient quality for archiving. Very large data sets that are of an interdisciplinary nature cause special problems. For these complex situations, no simple rule will take the place of negotiations between the involved agencies. These are particularly evident with the data sets being collected by NASA under its Mission to Planet Earth and by DOE's ARM program. Despite plans by NASA and DOE to collect large amounts of data and manage them for several years, neither of those agencies has indicated a willingness to ensure the long-term retention of those data. The panel recommends that NOAA and NARA should take active cognizance of the impending large amounts of atmospheric-science-related data that will be coming from NASA and DOE missions. Arrangements should be negotiated for the long-term archiving of these data. The panel recommends that agencies have a positive obligation to keep their holdings of scientific data in usable condition, even if they are not using them, until agreeing on disposition of those data with NARA or another agency. The panel also recommends that NARA should attempt to accept for indefinite retention all environmental science data that an agency wishes or needs to dispose of and no other agency wishes to maintain, unless convinced on the advice of outside experts that the data are truly useless or unreliable. In addition, the panel recommends that when an agency loses interest in a scientific data set or its activity related to a data set is being terminated, it should look first to other agencies that may have interest, especially if there is a clear lead agency for the subject of the data set, to take possession of those data, before looking to NARA as a repository. Implementation of this recommendation may require changes to current legislation. Agencies inform NARA of their intentions for their records, including scientific data, through various schedules. Agencies feel pressured into scheduling records when the records reach about 30 years of age. NCDC even provides schedules for data that it plans to hold indefinitely, noting that intention. For most types of records, the pressure to schedule provides the useful function of preventing an agency from simply warehousing continually increasing volumes of unused records without examination. The panel recommends that there should be no requirement for data sets to be “scheduled” with NARA unless and until an agency wishes to dispose of those records. As an exception to this, it is appropriate for NARA to require scheduling of records for which it provides temporary storage (such as it now provides for some paper records belonging to NCDC). This suggested recommendation is not intended to apply to records of scientific activity that would not be construed as scientific data. The panel recognizes that this recommendation also might require legislation for its implementation. For data that an agency does not wish to destroy, but that are not frequently accessed, NARA sometimes provides storage space without taking ownership. Some infrequently used records are physically housed at the Federal Records Center in East Point, Georgia, under a Memorandum of Agreement between NCDC and NARA. These are considered temporary records, mostly manuscripts awaiting conversion to microform. NCDC must first provide disposition schedules for any material that is to be stored at NARA facilities. If NARA did not provide
OCR for page 82
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers some worthiness test for records before agreeing to provide storage for another agency, the federal records centers would likely become inundated with records of little value. The panel finds it reasonable that NARA should be allowed to require disposition schedules on scientific records owned by other agencies for which NARA is providing storage space. The nation is moving toward a system of distributed archives for electronic records. Data sets are distributed among various physical locations, and the expertise to interpret these data sets is likewise already distributed and becoming more so. The rapid increase in computer networks within the United States and the rest of the world is beginning to make a qualitative impact on the way people access information. Thus, there is a lessening need for data users and providers to be physically in possession of the data they need or distribute. Advising and Information Sharing With the exception of some staff members at the large data centers, many federal employees (including government scientists) and most non-government scientists are unaware of the requirements of the Federal Records Act. Even some of those entrusted with large quantities of valuable data were largely unaware of NARA and their related responsibilities until contacted in the course of this study. This may be partially because scientists, even those within the federal government, often studiously avoid learning the bureaucratic requirements of their own institutions. The panel is encouraged that NARA is working to address this problem. However, NARA needs to become even more active in publicizing itself, its mission, and the Federal Records Act, in particular its relevance to scientific data. The panel believes that since NARA staff as a matter of interest and training are more concerned with long-term archiving issues than most agency staff, NARA can serve an essential role in reminding agencies of the very long-term value of data that should be considered. NARA should regularly provide advice to agencies that have needs to keep scientific data on hand for extended periods of time. Furthermore, NARA should actively promote policies and protocols to assure the future availability and validity of data sets retained by agencies and centers. On the other hand, NARA has almost no scientific expertise within its ranks (except related to physical records preservation). Despite the large amounts of scientific information within some federal records, NARA officials have indicated that they do not believe that they could keep a scientist on the staff interested in the work. Therefore, NARA does not plan to acquire any permanent scientific personnel. Nevertheless, NARA will continue to be faced with difficult issues involving the archiving of scientific data. In the interim, the panel suggests that NARA should arrange for temporary seconding or regular advising from the active scientific ranks of the federal government on a frequent as needed basis. Because of the great challenges that NARA will face from scientific data and the proven ability of other agencies to hold scientifically trained personnel in data management positions, NARA should rethink its position and consider adding personnel to its permanent staff with scientific expertise. NARA has the potential ability to see that the lessons learned on difficult archiving issues in one agency are shared with another. Therefore, communication between NARA and agencies collecting scientific data should be greatly increased on all data management issues and that NARA should serve as a clearing house for such information. NARA should make recommendations on data management issues that will assist in the eventual archiving of data. However, agencies should retain the power to make their own rules for data in their possession. Directories and Access To make use of a data set, one must first know that it exists and where to locate it. Various efforts are underway within agencies and jointly between agencies to create and maintain the necessary indexes. There should not be many indexes that a user need search. The panel recommends that NASA, NOAA, NARA, and the other federal agencies should seek the resources needed to populate and maintain the Master Directory of environmental science data. To accomplish this, agencies need to prepare and maintain more detailed information on their own holdings. Also as an assistance to users, as well as the agencies themselves, the panel recommends that all agencies should actively work to include entries for the smaller local and regional databases in the Master Directory. The directory also should assist agencies in providing mutual backup of data sets and eliminate unnecessary duplication of identical or similar data sets. Care and thought must go into the indexing of data sets. For example, over 100 data sets were collected during the GALE field experiments. A collection like this should be indexed mainly as a collection first, and then the user
OCR for page 83
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers can find out about all of the component data sets if interested. Some of the data sets for GALE are small subsets (in time and space) of much larger data sets that extend for many years. A user who needs to find the larger data set should not be hampered by having too many catalog “hits” to small component data sets associated with various projects. Archives should not be viewed as data cemeteries, with only rare and dwindling visits after deposition of data. Access needs to be part of archive construction. The panel recommends that NARA, NOAA, NCAR and other agencies, such as NASA and DOE, coordinate an efficient and effective plan for mutual access to and dissemination of joint and disparate weather, climate, and other atmospheric data holdings to users and researchers within and outside these (and other) agencies. The panel can envisage a future with transparent on-line access to most scientific data holdings of the federal government, regardless of which agency has physical custody. Motivation and Incentives Preparing data for archiving and maintaining archives are viewed by most scientists as activities that are neither very exciting nor well rewarded. The benefits for a job well done are largely gained by others after those preparing the archive have moved on to other things, and the difficulties resulting from failures often occur far in the future. The current trends toward use of data by other than those who collected them and the increased interest in long-term changes in geophysical processes have been leading to improvements in the status of those who prepare and manage data sets. Nevertheless, it is likely that data gathering always will have greater attractiveness than data preservation (NRC, 1995). 6 SUMMARY The increased attention that long time series of data are receiving in the earth sciences in general, and in the atmospheric sciences in particular, is a very encouraging trend. This increased attention is creating a consensus on the importance of archiving issues and the long-term retention of scientific records. Nevertheless, most members of the scientific community still are not very experienced in thinking on very long time scales regarding data retention. Scientific data and the technical records needed for the interpretation of data are sufficiently different from other federal records that they merit special treatment. Agencies do not always consider their scientific data nor the scientific data resulting from research they fund to be federal records. The panel suggests that federal agencies should be more inclusive with regard to designating scientific data, either in their possession or created with their funds, as federal records and ensuring the long-term survival of those data. There is still a need for the federal government to articulate clearly an integrated national policy covering its obligations and limitations in the retention and archiving of weather, climate, and other atmospheric data. For all environmental data networks and scientific projects funded with federal extramural grants, the funding agencies should, on a case-by-case basis for all grants, evaluate whether the resulting data would be useful as a national resource. If so, the funding contract should require the submission of the resulting data to some federal agency for archiving as federal records. A great impediment to the preservation of data is that archiving often requires large amounts of preparation and packaging of the data, rather than just a physical transfer. To ease the archiving process, for projects that involve the collection of scientific data, all agencies (especially NOAA for projects involving the atmospheric sciences) should integrate into the initial planning considerations of data storage, metadata preparation, data set indexing, information retrieval, and data archiving. For data sets to be useful in the future, it is essential that they be accompanied by full documentation of the instrumental calibration and precision, site exposures, quality control, compaction techniques, and related information; this information is the metadata for a data set. The metadata file accompanying a data set must contain enough information so that the data can be understood and used effectively after all of the experts who created or worked with them are gone. Archiving will be much easier if agencies prepare data initially so that they can be archived internally or passed on to NARA without significant additional processing later. Agencies may even discover that it is easier for them to use the data operationally if these additional preparation efforts are made when the data are first stored. In particular, metadata and associated manuals should travel with the data sets in the same medium. In cases where low-level data are archived not in their most convenient form, additional files containing descriptions of algorithms or samples of computer software may need to be bundled with the data.
OCR for page 84
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Data sets archived for extended periods of time will require migration to new storage media. To make this process easy, safe, and low cost, a data set should be organized so that it can be thought of as just a sequence of computer bits that must not be changed. Furthermore, the information about the data should be another sequence of bits that might include format descriptions, text, manuals, and pictures. The process of copying the data set to new media is then the task of copying these strings of bits to new media, which can be fully automated. If data are worth archiving, they are usually worth backing up. To reduce costs, agencies should maintain lists of backup copies of important data sets held by other entities and should check the status of these holdings by other organizations regularly. Furthermore, agencies should exchange brief memoranda with other organizations providing backup that list the data sets held in duplicate and agreeing to notify the other party before discarding one's own copy of the data. Lower-level data that have been transformed only to account for measuring system characteristics should have a much higher priority for long-term archiving than results derived later. In view of the lessons learned from the historical uses of weather-satellite data, increasing use of environmental data by numerous groups, and the rapidly increasing volume of the “raw” (level-1) data set, a “staged” or hierarchical data processing approach should be adopted for all very large data sets, such as satellite and radar data, shortly after collection. This allows a good set of options to be exercised at regular intervals (perhaps 10 years) regarding long-term retention. Such an approach provides two to four stages of reduced resolution with respect to space and/or time for each initial data set. This reduces volume accordingly for the regularly used data. Consideration should be given to the long-term archiving of reduced data sets with only a small additional selection of full-resolution data. Attempts to use best-quality technology, even for huge data sets, could have the perverse effect of forcing people to avoid saving the data. It is better to save large high-resolution data sets on low-quality media, and accept a little loss, than to not archive those data. It also would be better to save a deliberately selected sample of a data set than to allow an entire set to become unavailable because of insufficient resources to allow complete archiving. Though the panel has been unable to develop hard-and-fast rules for retention decisions, it has been able to In determining the priorities for long-term archiving, agree on some priorities for making those decisions. consideration should rely most heavily on: the unusualness or rarity of a data set, the level of quality control, the availability of metadata, the value of the data for long time series, a preference for original data over processed data, the possibility of recreating similar (or better) data, the cost of recreating similar data, if possible, and the cost of archiving. These priorities may not be easy to apply. Agencies should regularly convene groups of outside scientific experts to assist in decisions on the archiving of scientific data and the priorities for making such decisions. Furthermore, if there are significant costs in the continued archiving of data sets, panels of outside scientists should be asked periodically, perhaps every 20 years, to assist the government in deciding which data sets deserve continued support. As a special case, strong efforts should be made to archive indefinitely all in situ atmospheric science data from outside of the United States that federal agencies (including the military) obtain in the course of their activities, even if it is possible to construe these data as not otherwise being federal records. Very large data sets also present special problems. Periodic review, perhaps every 20 years, should be made by scientists of large data sets to determine the most effective storage strategy, including the possibility of retaining only a sample of a data set. The archiving of only a sample of our largest data sets raises difficult issues in statistics, data management philosophy, and budgeting. A study should be undertaken to evaluate the tradeoffs and problems of archiving for the long term only a sample of our largest data sets in the atmospheric sciences, with an eye toward finding acceptable sampling techniques. A great difficulty in the funding of archives is deciding who pays for long-term retention and access. The users who will benefit most from the archive may not yet anticipate their own need or even yet be born. Along with the initial planning for the collection of data, there should be initial budgeting for data management and preservation. Maintaining archives should be among the agencies' highest priorities because the costs of maintaining and archiving data are sufficiently small compared to original data collection and preparation costs. Putting more
OCR for page 85
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers money into data management frequently would provide a better return on our investment than collecting more data. On a broad level, the federal government should examine priorities for additional funding of data management and archiving. It is essential that policies on long-term archiving be fully coordinated between NARA and the operating divisions of the government collecting and utilizing large quantities of atmospheric data, including NOAA, NASA, DOD, and DOE. Cooperation from smaller producers of atmospheric data also will be useful. NARA's role should be supervisory or consulting to make sure that the agencies that are the actual custodians of data at the working level follow all the relevant federal laws and guidelines in taking care of the data. Scientific information should go to NARA's physical possession only as a last resort. As a basic operating principle, scientific data should be maintained by the agency most knowledgeable about those data as long as there is any regular active use, even if that use is not part of the regular activity of the holding agency. Agencies have a positive obligation to keep their holdings of scientific data in usable condition, even if they are not using them, until agreeing on disposition of those data with NARA or another agency. ACKNOWLEDGMENTS The panel has relied heavily on the varied and rather extensive experience of its members in dealing with atmospheric data, as well as on the numerous studies conducted over the last two decades—studies that are referred to in this report and in which some of the members participated. However, the panel wishes to acknowledge the diverse and substantial inputs provided by the following individuals, many of whose observations, analyses, and recommendations have been incorporated into this report: Larry Baume of NARA, Thomas Boden of the Carbon Dioxide Information Analysis Center, Dean Bundy of the Naval Research Laboratory, Donald Collins of the National Aeronautics and Space Administration, Richard Davis of the National Climatic Data Center, P. C. Hariharan of Johns Hopkins University, Trudy Peterson of NARA, Gerald Stokes of Pacific Northwest Laboratories, Kenneth Thibodeau of NARA, and Helen Wood of NOAA. The panel also wishes to thank the National Center for Atmospheric Research for hosting its second and final meeting in October 1993. Finally, the panel is especially grateful to Mark Handel of the National Research Council's Board on Atmospheric Sciences and Climate staff for his extensive and substantive contributions to this report. BIBLIOGRAPHY Department of Commerce (DOC). 1976. Proceedings of the Climatological Data Users Workshop, 27-28 April 1976, Asheville, North Carolina, 65pp. Department of Commerce (DOC). 1979. Report of the Climate Data Management Workshop, 8-11 May 1979, Harpers Ferry, West Virginia, 300pp. Department of Commerce (DOC). 1988. National Climate Program, Five-Year Plan, 1989-1993, National Climate Program Office, Rockville, MD, 48pp. Dirks, R. A., J. P. Kuettner, and J. A. Moore. 1988. Genesis of Atlantic Lows Experiment (GALE): An overview. Bull. Amer. Meteorolog. Soc. 69(2): 148-160. Jacobs, Woodrow C. 1947. Wartime developments in applied climatology, Meteorological Monographs 1(1), 52pp. Meyer, Stephen J. and Kenneth G. Hubbard. 1992. Nonfederal automated weather stations and networks in the United States and Canada: A preliminary survey .Bull. Amer. Meteorol. Soc. 73(4): 449-57. National Research Council (NRC). 1982. Meeting the Challenge of Climate, National Academy Press, Washington, D.C. National Research Council (NRC). 1986a. Atmospheric Climate Data: Problems and Promises,National Academy Press, Washington, D.C. National Research Council (NRC). 1986b. The National Climate Program: Early Achievements and Future Directions, National Academy Press, Washington, D.C. National Research Council. 1988. Geophysical Data: Policy Issues, National Academy Press, Washington, D.C. National Research Council (NRC). 1995. Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data, National Academy Press, Washington, D.C. Rossow, William B. and Robert A. Schiffer. 1991. ISCCP cloud data products. Bull. Amer. Meteorol. Soc. 72(1): 2-20. U.S. Global Change Research Program (USGCRP). 1991. Policy statements on data management for global change research, report DOE/EP-0001P, Washington, D.C. 8pp.
Representative terms from entire chapter: