4
Data Systems

INTRODUCTION

The National Polar-orbiting Operational Environmental Satellite System (NPOESS) offers unique opportunities for the climate research community. The next-generation sensors flying on a continuous series of NPOESS orbiting platforms for some 20 years beginning in approximately 2000 are expected to produce data of unprecedented quality and coverage. However, to capitalize on this opportunity, the research community and government agencies will need to develop data processing and archiving systems that can enable the huge volumes of raw data (approximately 1 terabyte/day) coming from NPOESS to be stored and converted into useful scientific products and information (see NRC, 2000a).

Currently the NPOESS procurement process is focusing on operational needs. The NPOESS system contractor is being asked to provide a data system that will produce raw data records, sensor data records, and environmental data records for all NPOESS sensors (see Table 4.1). This operational data system will be installed at various Department of Defense and National Oceanic and Atmospheric Administration (NOAA) centers. An important requirement for the data system is timeliness (processing the data from one orbit of data in 20 minutes). Archiving the various products is not a requirement; neither are the many other attributes associated with a research data system. For example, in its phase one report (NRC, 2000b), this committee concluded that the operational environmental data records will not meet all the needs for climate research and that access to the unprocessed sensor-level data will be required. The current focus of NPOESS on operational needs is certainly prudent from a programmatic standpoint, given the enormity of the space hardware segment and the very high data rates, and to impose the many additional requirements associated with climate research onto the current NPOESS operational system is probably impractical. Instead the committee sees the need for an autonomous infrastructure of data systems focusing on climate research rather than operational needs.

The development of an NPOESS climate data system (NCDS) represents a significant challenge requiring planning, revision, hard work, and adequate funding. Care will be needed to ensure that the design and specifications for the data system are given a broad review prior to their implementation. Calibration and validation, data provision, data product continuity, data archiving, archive access, reprocessing, and cost will need to be given special attention for the climate research community. The NPOESS Preparatory Project (NPP) will provide an early test of the instruments and data system. A joint activity of the National Aeronautics and Space Administration



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation 4 Data Systems INTRODUCTION The National Polar-orbiting Operational Environmental Satellite System (NPOESS) offers unique opportunities for the climate research community. The next-generation sensors flying on a continuous series of NPOESS orbiting platforms for some 20 years beginning in approximately 2000 are expected to produce data of unprecedented quality and coverage. However, to capitalize on this opportunity, the research community and government agencies will need to develop data processing and archiving systems that can enable the huge volumes of raw data (approximately 1 terabyte/day) coming from NPOESS to be stored and converted into useful scientific products and information (see NRC, 2000a). Currently the NPOESS procurement process is focusing on operational needs. The NPOESS system contractor is being asked to provide a data system that will produce raw data records, sensor data records, and environmental data records for all NPOESS sensors (see Table 4.1). This operational data system will be installed at various Department of Defense and National Oceanic and Atmospheric Administration (NOAA) centers. An important requirement for the data system is timeliness (processing the data from one orbit of data in 20 minutes). Archiving the various products is not a requirement; neither are the many other attributes associated with a research data system. For example, in its phase one report (NRC, 2000b), this committee concluded that the operational environmental data records will not meet all the needs for climate research and that access to the unprocessed sensor-level data will be required. The current focus of NPOESS on operational needs is certainly prudent from a programmatic standpoint, given the enormity of the space hardware segment and the very high data rates, and to impose the many additional requirements associated with climate research onto the current NPOESS operational system is probably impractical. Instead the committee sees the need for an autonomous infrastructure of data systems focusing on climate research rather than operational needs. The development of an NPOESS climate data system (NCDS) represents a significant challenge requiring planning, revision, hard work, and adequate funding. Care will be needed to ensure that the design and specifications for the data system are given a broad review prior to their implementation. Calibration and validation, data provision, data product continuity, data archiving, archive access, reprocessing, and cost will need to be given special attention for the climate research community. The NPOESS Preparatory Project (NPP) will provide an early test of the instruments and data system. A joint activity of the National Aeronautics and Space Administration

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation TABLE 4.1 Data Set Processing Levels Data Level NASA/NOAA Description Level 1A: Raw data records Reconstructed unprocessed instrument or payload data at full resolution, time referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters (i.e., platform ephemeris and orientation) computed and appended but not applied to the level 0 data Level 1B: Sensor data records Level 1A data that have been processed to sensor units Level 2: Environmental data records Derived geophysical variables at the same resolution and location as the Level 1 source data (NASA) and the Integrated Program Office (IPO), the NPP will provide an opportunity to benefit from the progress NASA is making in data system development. In this chapter, the committee first establishes the need for an NCDS separate from the operational system. Some essential elements of an NCDS are then discussed. These elements include a long-term archive for the lowest-level NPOESS data (raw data records or sensor data records) and a system architecture in which science teams are primarily responsible for the development of the algorithms needed to generate geophysical products, which are then archived and distributed by innovative data centers. The need for the NCDS to accommodate algorithm and sensor evolution, reprocessing, and multiple versions of data sets is also described. Finally, the importance of innovation and competition in the emerging NCDS is stressed. OPERATIONAL VERSUS RESEARCH NEEDS Operational processing and research have different requirements. Operational processing, as it is now envisioned for NPOESS, will be done at a number of centralized sites, with each site using a common data processing system provided by the NPOESS prime contractor. The emphasis will be on generating products very quickly for weather applications. The centralized, no-archiving, one-time-processing architecture of the operation centers is totally different from that required by the research community, in which scientific algorithms are continuously evolving and reprocessing is routine. For research, the requirement of timeliness can be relaxed, thereby allowing for the implementation of complex algorithms using diverse ancillary data. As understanding of sensor calibration issues and radiative transfer from Earth improves, algorithms can be improved, and better products can be generated via reprocessing. To be more specific, NCDS has the following basic requirements over and above what is needed for operational processing: A long-term archiving system that can fully support the needs of the climate research community. This entails easy, affordable, and timely access for a large number of scientists in many different fields. The data must be supported by metadata that carefully document sensor performance history and data processing algorithms. The ability to reprocess large data sets as understanding of sensor performance, algorithms, and Earth science improves. Examples of new information that would warrant reprocessing are detection of sensor calibration drift and the availability of better ancillary data sets, better geophysical models, and errors in previous processing. Use of standard formats and interfaces, so that researchers, data producers, and archives can be closely linked. Research data systems tend to be less centralized and more distributed than operational processing systems.

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation Another element of an NCDS that is missing from the operational system is the selection of science teams by an open, peer-reviewed process. Good science is the key to generating reliable climate products, and peer-reviewed science teams are essential in this endeavor. In contrast, the operational algorithms for NPOESS are being developed by the sensor contractors. This may ensure that operational algorithms are ready at launch, but there is no assurance that the algorithms will enjoy the consensus of the scientific community. Daily weather prediction and long-term climate monitoring have different requirements, and from both a cost and research perspective, the committee thinks it would be a mistake to burden the NPOESS operational weather prediction system with all the additional requirements associated with climate research. Rather, it encourages the research community and government agencies to take the initiative and begin planning for an NCDS. To begin this planning exercise, some of the essential elements of an NCDS are pointed out, along with some of the more important issues that need addressing. A recurring theme in the committee’s phase one report was the need to facilitate comparisons among data sets from different instruments and different time periods, to (1) permit periodic reprocessing of data sets, (2) preserve long-term data collections for climate monitoring, and (3) allow examination of issues not always anticipated at the time the data were acquired (NRC, 2000b). LONG-TERM ARCHIVING OF RAW DATA RECORDS It is in the national interest to have a climate data system to improve the likelihood of achieving a scientific understanding of climate change. One essential element of an NCDS is a long-term archive of the raw data records. (A possible alternative would be to archive the sensor data records if the sensor data records are reversible to raw data records.) As mentioned above, data archiving is currently not part of the operational NPOESS data system. Given the extremely high data rates (~1 terabyte per day) anticipated for NPP/NPOESS and the large number of diverse users that will be accessing the data, a raw data records archive is in itself an enormous undertaking. It is essential that archiving the raw data records be addressed and responsibilities assigned as soon as possible. A U.S. Global Change Research Program report (USGCRP, 1999) discussed the elements required for the long-term archive component of the data system and emphasized the following: A long-term archive should be established and operated in the simplest way possible to meet user needs and program goals. A long-term archive is not only for today’s generation of users but also for the next generation of scientists and citizens whose needs have yet to be expressed but must be provided for. Specifically, the report states that the long-term archive must ensure that archived data sets and products are accompanied by complete, comprehensive, and accurate documentation. Its recommendation for simplicity and longevity is particularly pertinent to the NPOESS raw data records archive. Because the raw data records archive will be a major interface between the operational and research communities, early discussions are needed between NASA, NOAA, and the IPO to ensure that the NPOESS operational data system meets the archiving requirements of the research community. A conceptual study of the raw data records long-term archive should be initiated in the near future. This study should address issues such as the following: What government agency (or agencies) will be responsible for funding the long-term archive? Which agencies or organizations will maintain the long-term archive (existing NOAA and NASA data centers or industry via a request-for-proposals procurement like NPOESS)? Will the long-term archive be located at one central site or will it be distributed (different products at different sites)? What data and metadata need to be saved for climate research? How long will the data be saved and maintained? What is the procedure for deciding when a data set is no longer useful?

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation How will users access the data? Will users be charged for data access? Will the costs impede the development and analysis of long-time-series data sets? Should the long-term archive have advanced features such as subsetting and data mining or should it be kept as simple as possible? Should it be designed to accommodate advanced features later on (e.g., by using formats that can later support subsetting and data mining)? What innovative software solutions are available to reduce cost and increase functionality (e.g., no-loss compression, geolocation computed on demand, elimination of redundancy, etc.)? What innovative hardware solutions are available to reduce cost and increase functionality (e.g., data storage devices in 2010)? ARCHITECTURE FOR THE NPOESS CLIMATE DATA SYSTEM The two basic tasks of an NCDS are (1) data production (i.e., converting the raw data records or sensor data records to science products) and (2) archiving and distribution of these science products. Defining an optimum architecture for performing these two tasks for a system as large as NPOESS will be difficult. NASA has been struggling with this problem for a number of years. Early on, NASA’s Earth Observing System Data and Information System (EOSDIS) concept involved a small number of centralized distributed active archive centers (DAACs) that would in effect be all things to all people. The EOSDIS architecture consisted of a relatively small number of physically distributed sites that were organized and run in a centrally controlled, hierarchical manner. EOSDIS was designed to manage data from NASA’s Earth science research satellites and field measurement programs, providing data archiving, distribution, and information management services. Through the EOSDIS Core System contractor, EOSDIS provided the necessary hardware and software to the DAACs to capture, process, and distribute data from the EOS satellites. Each DAAC was responsible for archiving and managing data in a given scientific discipline (Table 4.2). Site visit reports of the seven DAACs by National Research Council (NRC) review panels concluded that most DAACs are serving the user community quite well, although several DAACs have management problems and poor records with the user community (NRC, 1998). Although individual DAACs support their discipline scientists, the objective of an overall seamless coordination of the DAACs into the EOSDIS has not been realized. The heavily centralized architecture of EOSDIS proved difficult to manage and built an unnecessarily high wall between the Earth scientists and computer scientists so that operating within the original time lines became impossible. In reaction to these and other problems coming from EOSDIS (see NRC, 1998), NASA began experimenting with a much more distributed architecture, called the Earth Science Information Partnerships (ESIPs). This was an TABLE 4.2 NASA’s Distributed Active Archive Centers DAAC Host Institution Scientific Specialty / Terra Instruments ASF Alaska SAR Facility, University of Alaska Sea ice, polar processes / none EDC EROS Data Center, U.S. Geological Survey Land processes / ASTER MODIS GSFC Goddard Space Flight Center, NASA Upper atmosphere, atmospheric dynamics, global biosphere, hydrologic processes / TRMM, MODIS LaRC Langley Research Center, NASA Radiation budget, aerosols, tropospheric chemistry / CERES, TRMM, MISER MOPITT NSIDC National Snow and Ice Data Center, University of Colorado Snow and ice, cryosphere / MODIS ORNL Oak Ridge National Laboratory, Department of Energy Biochemical fluxes and processes / none PO.DAAC Jet Propulsion Laboratory, NASA-Caltech Ocean circulation, air-sea interactions / none SEDAC CIESIN, Columbia University Socioeconomic data and applications / none

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation experiment to develop a federated approach to the provision of data and data services. In the federated concept, data processing responsibilities were given back to the principal investigators (PIs). This PI-driven model involves a large number of loosely related groups. The advantage of this model is that smaller organizations are more easily managed, and the PI has more control over the generation, distribution, and quality of the science products. The disadvantage is that the architecture of many independent groups generating various products may lack the cohesion necessary to provide the Earth science community with an integrated and consistent set of Earth science products (Matt Schwaller, NASA, personal communication, 1999). It is too early to assess the success of the ESIPs. Meanwhile, NASA is currently studying the trade-offs between centralized and PI-driven data systems models and is moving toward a new data system architecture called NewDISS. An important lesson to be learned from this experience is that the production of research climate data sets from the raw data records is mostly a scientific problem. The scientists best understand the problems associated with the retrieval process. They are the ones who develop the methods, techniques, and algorithms for processing the data, and they should be an integral part of the data production if it is to be successful. Accordingly, the committee recommends that science teams, selected by peer review, play the central role in producing climate data records. A good paradigm for the NPOESS science teams is the NASA science team model that uses NASA research announcements to solicit proposals from the community at large. In this way, climate data sets will be produced in a peer-reviewed, competitive environment. In many cases, the science team will have the capacity to do its own data processing. In addition, the quality and consistency of climate data records can be further ensured by setting up some type of oversight mechanism that routinely reviews the status and progress of the climate research teams. In other scenarios, the data processing can be turned over to a data center, but the science team will still be the responsible entity. One potential problem with having the science team be responsible for the generation of climate data records is that the team may be reluctant to release the data until they feel the quality is sufficient for general release. In so doing, the team can maintain exclusive use of the data for an extended period. To alleviate such a concern, the raw data records data should be made available to any interested investigator, who may then implement his or her own approach to generating climate data records. The committee notes that raw data records are produced on a near-real-time basis (3 hours) as part of the NPOESS operational processing. Therefore, it should be possible to make them available to investigators (through the raw data records long-term archive) within 1 or 2 days after the observation. A further safeguard is the opportunity during a peer review cycle to change the membership of a science team that is not fulfilling its obligations to provide algorithms and produce data. However, the committee doubts that data hoarding will be a problem in this very competitive age. Most scientists are more than eager to have others use their data products, and near-real-time access is becoming the norm. Archiving and distribution of climate data records are distinctively different from data product generation. Optimum archiving and distribution are related more to computer science than to physical science. Many of the comments above about a raw data records archive apply to archiving the climate data records, except that the climate data records encompass a much more diverse set of data than do the raw data records. Climate data records will be used throughout the world for a large variety of Earth science applications. In view of this, advanced archival features will be highly desirable, including capabilities for data mining, subsetting, and provision of products on demand. Many of these archival functions are dependent on the type of data sets and applications being considered. This suggests that a distributed set of government, university, and commercial data centers, both large and small, each specializing in a particular type of geophysical product and application, may be advantageous. To facilitate interoperability among the various data centers and users, standards relating to formats, interfaces, and protocols need to be established. In addition, the climate data products have to be provided in a timely manner with no unnecessary delays. Again, innovation will be critical to success. These data centers should be selected in a competitive environment to encourage innovative and cost-effective solutions. EVOLUTION, REPROCESSING, AND MULTIPLE VERSIONS OF DATA SETS An NCDS should be designed to fully accommodate our ever-increasing understanding of Earth as well as the satellite sensors used to observe it. Geophysical retrieval algorithms are in a constant state of evolution: incorpo-

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation rating more realistic radiative transfer models, relying on more ancillary data sets (perhaps from non-NOAA or even non-U.S. data centers), and using more complex and accurate retrieval techniques. The ability to detect small sensor drifts and other systematic measurement errors is improving dramatically as researchers learn the subtleties of the sensor response functions and perform on-orbit intersensor comparisons. Numerous reprocessings are unavoidable and in fact desirable in that they improve the quality of the data products. Multiple versions of the same data sets, coming from different investigators using competitive algorithms, are extremely valuable in understanding the errors and uncertainties associated with retrieval. Incorporation into an NCDS of these requirements for evolution, reprocessing, and multiple versions is another significant challenge. These requirements are completely different from those imposed on an operational system that relies on stability and rigid control of change. One obvious aspect of an evolving data system is the necessity for ample metadata. All data sets need to be accompanied by comprehensive metadata precisely describing the algorithms used to process the data, the relevant sensor characteristics, and all the ancillary data sets used in successive data generation. Other aspects of an NCDS are not as obvious. If reprocessing is done too frequently or there are too many versions of a particular data set, then the NCDS will be overburdened and the users will be confused. The trade-off between improving data quality and producing too many versions can be a difficult one. There could also be competing interests among the science teams, the data producer (if different from the science team), and the data archive and distribution center. For example, the science team may want to do a reprocessing but the data producer or archive may not have the resources to support it. Such issues are particularly problematic because it is difficult to predict beforehand the amount of reprocessing that will be required. Nonetheless, they will have to be resolved if an NCDS is to have adequate flexibility and adaptability. EXISTING NASA AND NOAA DATA CENTERS The NCDS can be built in part on the existing NASA DAACs and ESIPSs and the NOAA data centers (see Tables 4.2 and 4.3). The two approaches taken by NASA and NOAA to data management and archiving are in many ways complementary. The focus of NASA has been on providing data for scientific research. In the NASA nonoperational environment, evolving algorithms and retrospective reprocessing are common, and the data archives often contain multiple versions of the same products coming from different investigators. In contrast, the NOAA data centers have a wider array of responsibilities, ranging from delivery of operational weather data to the National Weather Service, to the analysis and archiving of weather and climate data, physical oceanography data collected by ships and satellites, coastal observations, solar-terrestrial observations, and data related to glaciology, and even to marine geology and geophysics. However, rather than simply expanding and continuing the present modes of operation at the NASA and NOAA data centers, it is necessary to review the strengths and weaknesses of past performance with the objective of developing better approaches for handling the NPOESS data. CONCLUSION The planned NPOESS climate data system would benefit from adopting the best elements of the current NASA and NOAA data systems. However, it will not be enough to simply expand existing facilities. A successful NCDS will also require a new vision in which innovation and competition play a central role. Observations of Earth will increase by an order of magnitude when NPOESS begins operation. Realizing the potential increase in scientific understanding will require converting the huge volumes of raw data to usable products and information. The responsibility for doing this should be given to those groups and organizations that demonstrate the vision, innovation, and expertise needed to meet the NPOESS challenge.

OCR for page 29
Issues in the Integration of Research and Operational Satellite Systems for Climate Research: II. Implementation TABLE 4.3 NOAA Data Centers Data Center Host Institution and Location Specialty NCDC National Climatic Data Center, Asheville, NC Climate of United States Archive of weather data NGCD National Geophysical Data Center, Boulder, CO DMSP satellite archive Glaciology World Data Center-A for marine geology and geophysics Paleoclimatology Solar-terrestrial physics Solid Earth geophysics NODC National Ocean Data Center, Silver Spring, MD Coastal oceanography Ocean climate Biological oceanography NSIDC National Snow and Ice Data Center, University of Colorado Snow and ice, cryosphere World Data Center-A for glaciology RECOMMENDATIONS The committee recommends meeting the following data-systems requirements in addition to what is planned for operational data processing: A long-term archiving system is needed that provides easy and affordable access for a large number of scientists in many different fields. Data should be supported by metadata that carefully document sensor performance history and data processing algorithms. The system should have the ability to reprocess large data sets as understanding of sensor performance, algorithms, and Earth science improves. Science teams responsible for algorithm development, data set continuity, and calibration and validation should be selected via an open, peer-reviewed process (in contrast to the operational integrated data processing system and algorithms, which are being developed by sensor contractors for NPOESS). The research community and government agencies should take the initiative and begin planning for a research-oriented NCDS and the associated science participation. REFERENCES Integrated Program Office (IPO), National Polar–orbiting Operational Environmental Satellite System (NPOESS). 1996. Integrated Operational Requirements Document (First Version) (IORD-1) 1996. Issued by Office of Primary Responsibility: Joint Agency Requirements Group (JARG) Administrators, March 28. The updated IORD and other documents related to NPOESS are available online at <http://npoesslib.ipo.noaa.gov/ElectLib.htm.> National Research Council (NRC), Committee on Geophysical and Environmental Data. 1998. Review of NASA’s Distributed Active Archive Centers. National Academy Press, Washington, D.C. National Research Council, Space Studies Board. 2000a. Ensuring the Climate Record from the NPP and NPOESS Meteorological Satellites, National Academy Press, Washington, D.C. National Research Council (NRC), Space Studies Board. 2000b. Issues in the Integration of Research and Operational Satellite Systems for Climate Research: I. Science and Design. National Academy Press, Washington, D.C. U.S. Global Change Research Program (USGCRP). 1999. Global Change Science Requirements for Long-Term Archiving, Report of the Workshop, Oct. 28-30, 1998. National Center for Atmospheric Research, Boulder, Colo., March.