3
Preliminary Principles and Guidelines

As is explicitly requested in its statement of task, the Committee proposes the following preliminary principles and guidelines to help NOAA begin planning specific archiving strategies for the data streams it currently collects and manages. Each proposed principle is followed by some explanatory text to define the terms used and to put the suggestion in the proper context. The Committee developed this guidance based on its collective experience, the materials it has reviewed to date, and its initial deliberations. It is important to note that these ideas, while phrased as recommendations, are not ranked in order of importance, and are intended to serve as a framework for further discussions with NOAA and NOAA’s user community.

Following additional data gathering, including a user forum, and additional Committee deliberations, a final report will be generated. This report will include an expanded set of principles and guidelines, illustrated with examples, that can be used to identify the observational and generated data that should be preserved indefinitely versus those that require only limited storage lifetimes or can be readily regenerated from archived first-stream input, as well as the degree to which a wide variety of data should be made available. A more extensive discussion of the specific scientific requirements for data access and data stewardship for a range of applications, including climate change detection and analysis, will also be included, as will further discussion of funding issues, both in general and in the context of specific archiving and access strategies for individual data streams.


The environmental and geospatial data collected by NOAA and its partners, including model output, are an invaluable resource that should be archived and made accessible in a form that allows researchers and educators to conduct analyses and generate products necessary to accurately describe the Earth System.


The Earth System is a complex, interactive biogeochemical system that requires a large number of environmental variables for an accurate description. Any data stream, data set, or model output array that contributes to the understanding, prediction, or long term description of the Earth System should be considered for permanent archiving by NOAA. Data considered for archiving need to have sufficient and understood accuracy and temporal and spatial resolution to increase our understanding of the System, improve our characterization and predictability of the System, and/or allow required analyses for determining the past, current, and possible future states of the System along with its past variability.


The decision to archive or continue to archive data or model output should be driven by its current or future value to society. The decision will need to take into account the cost to archive versus the cost to regenerate, as well as the costs of providing access to the data.


The maximum benefit to society ultimately defines the rationale for government data collection (and its funding) that is to support broad government and private sector decision-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 10
Preliminary Principles and Guidelines for Archiving Environmental and Geospatial Data at NOAA: Interim Report 3 Preliminary Principles and Guidelines As is explicitly requested in its statement of task, the Committee proposes the following preliminary principles and guidelines to help NOAA begin planning specific archiving strategies for the data streams it currently collects and manages. Each proposed principle is followed by some explanatory text to define the terms used and to put the suggestion in the proper context. The Committee developed this guidance based on its collective experience, the materials it has reviewed to date, and its initial deliberations. It is important to note that these ideas, while phrased as recommendations, are not ranked in order of importance, and are intended to serve as a framework for further discussions with NOAA and NOAA’s user community. Following additional data gathering, including a user forum, and additional Committee deliberations, a final report will be generated. This report will include an expanded set of principles and guidelines, illustrated with examples, that can be used to identify the observational and generated data that should be preserved indefinitely versus those that require only limited storage lifetimes or can be readily regenerated from archived first-stream input, as well as the degree to which a wide variety of data should be made available. A more extensive discussion of the specific scientific requirements for data access and data stewardship for a range of applications, including climate change detection and analysis, will also be included, as will further discussion of funding issues, both in general and in the context of specific archiving and access strategies for individual data streams. The environmental and geospatial data collected by NOAA and its partners, including model output, are an invaluable resource that should be archived and made accessible in a form that allows researchers and educators to conduct analyses and generate products necessary to accurately describe the Earth System. The Earth System is a complex, interactive biogeochemical system that requires a large number of environmental variables for an accurate description. Any data stream, data set, or model output array that contributes to the understanding, prediction, or long term description of the Earth System should be considered for permanent archiving by NOAA. Data considered for archiving need to have sufficient and understood accuracy and temporal and spatial resolution to increase our understanding of the System, improve our characterization and predictability of the System, and/or allow required analyses for determining the past, current, and possible future states of the System along with its past variability. The decision to archive or continue to archive data or model output should be driven by its current or future value to society. The decision will need to take into account the cost to archive versus the cost to regenerate, as well as the costs of providing access to the data. The maximum benefit to society ultimately defines the rationale for government data collection (and its funding) that is to support broad government and private sector decision-

OCR for page 10
Preliminary Principles and Guidelines for Archiving Environmental and Geospatial Data at NOAA: Interim Report making abilities. The collection, archiving and accessing of Earth System data needs to support the analyses and products necessary to make these decisions. In general, the cost of archiving and providing access to data represents only a small fraction of the total resources invested in collecting or generating data.7 In addition, it is extremely difficult to estimate the present and future value of environmental or geospatial data to society. For instance, recently global atmospheric reanalysis of observations beginning about 1950 have proven to be an important climate assessment archive (Kalnay et al., 1996). It is difficult to appreciate the specific contributions that any data stream or data set makes, or might make, to long term climate monitoring or other environmental research requirements. In addition, data perceived to be of little use in the present could become quite valuable in the future as advancing technology increases our ability to make use of data. As discussed in more detail below, it is essential to actively engage the user community to help make these decisions. Effective stewardship of the nation’s investments requires preservation of what has taken substantial resources to collect. Funding for Earth System measurements should include sufficient resources to archive and provide ready and easy access to these data for extended periods of time. In particular, at the outset of undertaking an activity which will generate data or model output, end-to-end data management needs to be planned and budgeted. Many in situ data sets and the highest volume data sets (especially radar and satellite data) have been collected and funded in support of NOAA’s operational missions, with little initial provision made for long term preservation to support Earth system research, weather and climate prediction, and other societal benefits. Also, in the past, many data sets often had little use past their operational needs, while others were little used because of their spatial and temporal deficiencies. However, with new data assimilation methods and systems, most meteorological and oceanographic data can now be ingested by numerical models and used, for example, for model initialization and verification. These circumstances emphasize the vital importance of establishing an enduring, long term archive of environmental and geospatial data that is supported by the resources necessary to effectively meet these requirements, including resources for hardware and data managers. Assuring adequate and sustained levels of funding to archive and provide access to data remains a major ongoing challenge. All data that are welldocumented, are of known quality, and represent systematic collections or characterizations of the state of the environment should be archived in their most primitive useful form. Several of the considerations noted in the preceding section support this principle. First, observations of the state of the environment are generally expensive to obtain, and these costs far exceed archival costs. Second, it is impossible to anticipate all future applications of a data set, so a data set of uncertain present value may provide the key to some future scientific issue. Third, well calibrated data sets, or at least data sets with well defined error characteristics, are essential to long term climate monitoring and many other Earth System research requirements; this requirement implies a commitment to maintain records of successive improvements, 7   As an illustration, in NOAA’s FY2007 budget request, $994 million is requested just for satellite acquisition and satellite observing services, while only $51 million is requested for all of NOAA’s Data Centers and information services.

OCR for page 10
Preliminary Principles and Guidelines for Archiving Environmental and Geospatial Data at NOAA: Interim Report recalibrations, reprocessing, or other changes (including, in the case of model output, records of changes to the model), in addition to maintaining an archive of the original first-stream data. Finally, Earth System observations are unique; it is usually not possible to go back in time and resample, although if proper care is taken to document the data it is often possible to regenerate them from archived first-stream input (especially in the case of model output). Original Data are one of the seven types of data defined by NOAA (Appendix C), and represent the most obvious data type to consider for long-term archiving based on the principle above. Some of the other data types defined by NOAA are also of long-lasting value and deserving of consideration for permanent archiving (e.g., certain Synthesized Products and Experimental Products). In addition, certain types of at-risk data, such as data on deteriorating or substandard media, are critical to NOAA’s mission. NOAA Data Centers should be encouraged to be proactive and, where appropriate, obtain and archive these data. To make the most judicious data rescue decisions, it would be useful to obtain recommendations from an advisory panel made up of data users, as discussed in more detail below. Decisions not to archive data permanently should only occur when the original and predicted purpose of the data has been satisfied, or when the cost of storing the data exceeds the cost of regeneration, and should be made in collaboration with the appropriate user communities. Since estimating the present and future value of data is extremely difficult, it may be of greater use to first define those data that clearly have only short-term uses. For example, derived analyses or products for specific decisions may have little value after the decision has been made, and could be reproduced if necessary. These data could include long range and intermediate model and forecast output and/or high volume data used to generate specified parameters such as sub-second wind or radar returns used to produce several-minute averages for operational reporting purposes. Data collected for specific short term research (e.g., Experimental Data, as described in Appendix C) or near term operational requirements may also fit some of these criteria and therefore not meet NOAA’s mission requirements for archiving. As discussed in more detail below, it is essential to actively engage the user community to help make these difficult decisions. Metadata that completely document and describe archived data should be created and preserved to ensure the enhancement of knowledge for scientific and societal benefit. Metadata are all the information necessary for data to be independently understood by users, to ensure proper stewardship of the data, and to allow for future discovery.8 Metadata are essential for effective archive management throughout the entire data lifecycle, and are the essential data management component that makes an archive useful for data discovery and integration, through data mining and other techniques. Where possible and practical, it is advisable to use established metadata standards. As necessary these standards should be further developed, in coordination with other federal agencies and international entities, to take advantage of current and future technologies needed for data search, mining, discovery, and integration capabilities. Effective metadata management will help meet the challenge of the increase in data volumes, enable better integration of information across data sources and 8   See the information package definition on page 2-5 of the Open Archival and Information System reference model (Consultative Committee for Space Data Systems, 2002) for details.

OCR for page 10
Preliminary Principles and Guidelines for Archiving Environmental and Geospatial Data at NOAA: Interim Report disciplines, and improve understanding and usability of the data. As part of this emphasis, data systems should be designed to share metadata and build catalogs to enable data discovery across systems, disciplines, and programs. Further, since data systems will evolve to incorporate new information and to take advantage of technological improvements, the data system philosophy needs to account for the reality that metadata continually evolve, expand, and mature. This necessitates the use of existing standards and the adoption of new ones, when appropriate, to greater facilitate services and integration across data sources and disciplines; as such, metadata management will need to expand beyond mandatory requirements. For example, information should include context (relation to other information, appropriate application, and limitations) and information necessary for tools and interoperable services. NOAA’s archival process should be designed to allow the integrated exploitation of data from multiple sources to answer environmental questions and support the total life-cycle aspects of individual data sets. This could potentially be accomplished through a distributed but federated archival system facilitated via a single user portal. NOAA’s mission represents a wide breadth of responsibilities and disciplines related to understanding, describing and predicting various aspects of the highly complex Earth System. This mission requires the integration and synthesis of disparate data holdings both within and outside NOAA and, often, assimilation of multiple sets of environmental variables. NOAA’s many data streams reflect this diversity, as does the system-of-systems architecture of its observing mission. NOAA’s archival activities should recognize and reflect this same diversity in data sources and needs of its users. In addition, the data archival process needs to be rooted in an evolutionary framework, and to be flexible, reliable, extensible, and scalable in order to accommodate the increasing volume and complexity of data holdings and evolving needs of NOAA’s customers. NOAA therefore needs to ensure that they develop a scalable, extensible, and reliable infrastructure to ensure the long-term access and preservation of digital assets. Given the widely distributed nature of its data activities and holdings, NOAA could consider its archival process as part of a federation of distributed data sources and archival delivery partners. One particularly promising framework is a decentralized approach to archiving and data provision, enabled by a centralized corporate-level portal that facilitates discovery and access of integrated data sets tailored to user needs. This portal concept does not preclude the existence of other portals, but would allow NOAA to facilitate a “one-stop-shopping” capability and a more recognizable web presence to current and potential users. Within this framework, data could be made available and discoverable to all users in reliable, standardized formats for easy use and integration with other related data. A coordinated integration process would also allow NOAA to easily collaborate with non-NOAA data holders for the benefit of users. Since NOAA’s internal data stewards include many data managers, centers of data and Data Centers, integration of these various centers’ data holdings, and the migration of data through the chain from individual researchers to centers of data to formal Data Centers needs to be recognized and more explicitly formalized.9 9   According to NOAA Administrative Order 212-15, “Centers of Data transfer their data holdings to the NOAA National Data Centers for permanent archiving when continued storage at the Center of Data is no longer appropriate.”

OCR for page 10
Preliminary Principles and Guidelines for Archiving Environmental and Geospatial Data at NOAA: Interim Report Broad community representation is essential to establish the process whereby data proposed for archiving can be evaluated and prioritized in terms of scientific and societal benefits. It is difficult to implement archiving principles that can be applied to all of the diverse data types managed by NOAA. Thus, user input is critical. For example, for data sampled at very high temporal, spatial, and/or spectral resolutions, advice from relevant users could be sought regarding what level and at what resolution the data should be archived. For instance, NEXRAD (Next Generation Weather Radar) Level II data from Alaska, Hawaii, and a few other places is not currently archived. If present funding levels and technology do not support the archiving of all Level II data from the WSR-88D radars in the United States (and definitely not the voluminous Level I data), it is not clear if it will be possible to archive all channels of GOES-NEXT (Geostationary Operational Environmental Satellites-Next Generation) output. Important decisions on what types and amounts of data to archive should be made with the broadest possible cross section of inputs. Data users can also help identify at-risk data streams. Other federal agencies, non-federal government agencies, and non-government organizations are heavily involved in collection and archiving of data that are relevant to NOAA’s mission. Therefore, NOAA will need to coordinate with these partners to establish the criteria for archiving and providing access to these data at NOAA versus at other organizations. The criteria for archiving by NOAA versus other organizations will need to be agreed to among the governmental agencies supporting the collection, since all data produced by governmental resources should be considered for retention and archival. NOAA will also need to work with international partners through organizations such as the Group on Earth Observations, Committee on Earth Observation Satellites, International Council for Science, World Meteorological Organization, International Oceanographic Commission, and International Hydrographic Organization to develop internationally agreed-upon standards and protocols to ensure that key data sets can be accessed and exchanged. The Committee’s final report will discuss in greater depth the factors leading to the decision whether to archive data or not, as well as the potential consequences and tradeoffs of these decisions, in addition to who might be in the best position to make them. Scientific data stewardship should be applied to all archived information so it is preserved, continually accessible, and can be supplemented with additional data as discoveries build understanding and knowledge. Stewardship is vital to maintaining a long-term archive. It includes preserving archive integrity, assuring accurate media and format migration, maintaining data access10 and integrity during technology and software evolution, and enhancing the archive by adding information that is established throughout the data lifecycle. Archived information encompasses individual data sets and all their associated metadata. Ideally, metadata fully describe and document the data as well as the relationships between collections within a NOAA center and other collections at other centers. The primary stewardship activities are to: 10   Expert stewardship practices enable improved information access. Access functions are not discussed here, but are critical and a topic that will be addressed in the Committee’s final report.

OCR for page 10
Preliminary Principles and Guidelines for Archiving Environmental and Geospatial Data at NOAA: Interim Report Assure data backup strategies that guard against any loss of irreplaceable data (e.g. Original or Synthesized Product data types that cannot be easily and economically regenerated). Assure that when the data are migrated to new media or translated to another form (changing the original data format), no information is lost and recovery from errors is always possible. Assure that information integrity and access are not compromised during software and technology evolution. Assure that data management systems can support the growth of archived information that occurs during the data lifecycle. During a data lifecycle the archived information may be enhanced by reprocessing, error correction, and with the addition of supplemental data, quality control assessments, and other metadata derived from the scientific research and knowledge building processes. Maintain close interactions with the scientific community and evolving user base to capture information about data use and limitations.