Data Management Requirements
Early attention to data management and archiving is a critical step in ensuring the success of a long-term Climate Data Record (CDR) program. Datasets, and ancillary information such as metadata, must be preserved for decades and stored in ways that promote (1) access as data needs change; (2) reprocessing as errors are discovered or calibration is improved; (3) integration as new data products, algorithms, and data technologies are developed; and (4) user-friendly access tools. Climate research problems will inevitably require that scientists use combinations of datasets from many sources: satellite, aircraft, in situ, and even socio-economic data.
It will be critical to facilitate the integration of these multiple types of data. To extract the full scientific and societal value, the data must be available in appropriate formats for scientists, public and private sector decision makers, and managers. Each of these user groups requires different types of information from the original data, which adds complexity and cost to the data management system. Satellite-derived CDRs present special problems stemming from the great volumes of data collected, the multiple sensors and channels involved, and the need in many cases to incorporate surface validation information or to integrate in situ and satellite data sources for the CDR.
Because the ultimate legacy of long-term CDR programs is the data left to the next generation, the cost of data management and archiving must be considered as an integral part of every CDR program. For reference, other large science programs with multidecadal data access and preservation requirements can spend as much as 20 percent of their budget on data management (NRC, 2002). Over time, CDR programs will require significant resources for both continued collection and ongoing management of the data. This chapter discusses data management principles and outlines some key elements to help NOAA to maintain a high-quality CDR program.
REQUIREMENTS FOR DATA MANAGEMENT OF CLIMATE DATA RECORDS
The preceding chapter highlighted 14 elements that contribute to a successful CDR generation program. An underlying requirement of many of these elements is a sound data plan for stewardship, management, access, and dissemination of CDRs; for instance, fundamental CDR (FCDR) and thematic CDR (TCDR) data obtained from satellites will involve huge volumes of data, but those who need the CDRs will generally not utilize large datasets. A balanced suite of TCDR products will meet the needs of most users, although there will be occasions when portions of FCDR time series, or even raw data, will be needed for independent research; for instance, they also will require other information, such as guides to the data, explanatory metadata based on community standards, fact sheets, frequently asked questions (FAQ) lists, browse images, and searchable archives (by location, time, and phenomenon). To preserve the integrity of the data series and the flexibility needed to constitute new CDRs from the same underlying data, the original data must be stored and available for scientific reanalysis over time. This requires full documentation, including instrument documentation (e.g., CDR information, hardware documentation, firmware documentation, engineering models, and computer models), platform documentation (e.g., overview), and algorithm documentation (e.g., Algorithm Theoretical Basis Developments, “gray” books). Long-term success for the CDRs also will depend critically on sufficient metadata, in standard formats, including metadata fully describing the product line and metadata to discuss CDR limitations and to aid in data management (dataset lineage, version control, and unique identification parameters). The committee cautions that the cost of metadata generation and maintenance can be a significant part of the overall data management costs.
A carefully designed, efficient data system is fundamental for ensuring success of the CDR program. Since CDRs will be stored, analyzed, and reprocessed in an environment of changing technology and user requirements, the system design should focus on simplicity and endurance. The more complex the system, the more difficult, time-consuming, and costly system upgrades will be. The lessons from Earth Observing System Data and Information System (EOSDIS) and reports from the Standish Group (1999) also suggest that large, complex systems are more prone to failure than
smaller, more specialized data systems (Box 4-1). The need to maintain the CDR data systems over long periods will require either large complex systems, medium size and intermediate-complexity systems, or numerous small technically simple but organizationally complex systems. The institutional structure that is used to manage the data will play a critical role in determining who will use CDRs and how they will be used.
The value of standard data management practices cannot be over-emphasized because data quality, ease of access, accuracy of documentation, easy problem reporting procedures, and other elements of data management will either promote or hinder current and future utility of CDRs. For instance, the success of the National Center for Environmental Prediction and the National Center for Atmospheric Research (NCEP/NCAR) reanalysis effort compared with the European Centre for Medium-Range Weather Forcasts (ECMWF) reanalysis effort highlights the need for accessible data; the NCEP/NCAR report effort resulted in the second most cited paper in the Earth sciences (E. Kalnay, personal communication). The NCEP/NCAR success is not a result of having superior data but of having data that are more available (Box 4-2).
Advances in computing and networking capabilities are creating new methods of resource sharing, data access, and scientific collaboration. Systems should be designed to permit analysis of multiple or merged CDRs and data-mining strategies.
The following sections elaborate on the general aspects of data management related to data quality, formats, access, policy, and security, and note some features specific to satellite–derived CDRs.
To ensure that the FCDRs and TCDRs are of the best scientific quality, research scientists who understand the data and the meaning of changes in the data and use the data for their own research should play a major role in data management through active involvement in the development of data products and associated documentation. This practical engagement of scientists also allows for the infusion of their scientific perspectives as actual and prospective data users into the data production process.
The TCDR science teams introduced in the previous chapter must work under a well-defined protocol to implement operating procedures that cover all phases of product development from start to finish, providing standards that all CDRs must meet. The science team approach ensures the communication of all essential production information among the data producers,
The NASA Earth Observing System (EOS), and in particular the Aqua satellite and its system of instruments, affords a highly similar collection of instruments being planned for the NPOES Preparatory Project (NPP). Moreover, the scientific products produced by the EOSDIS are forerunners of the environmental data records expected from NPP. Thus, the lessons learned from the development and conduct of the EOSDIS offers the National Environmental Satellite, Data, and Information Service (NESDIS) an unparalleled opportunity to benefit from that experience in planning the production of the National Polar-orbiting Operational Satellite System (NPOESS)/ NPP CDRs. The following lessons learned from the EOSDIS illustrate some overarching management considerations for meeting evolving customer needs over decades as the evolution of technology, scientific requirements, and budgetary constraints change:
Science Investigator Processing Teams. A programmatic change in early 1998 transferred the responsibility for most EOS data processing from the Distributed Active Archive Center (DAAC) (and the EOSDIS Core System) to EOS science instrument teams and their facilities. These teams included both scientists who generated data and scientists who used TCDRs. This transfer, accomplished through a call for proposals, was a major reason for the success and timely delivery of the EOS standard data products and accounted for the high degree of scientific community acceptance of these products.
Planned Evolutionary Upgrades. The EOSDIS has changed significantly in terms of architecture design and implementation since its original planned configuration. Planning for the infusion of evolving information technologies over the course of the development of EOSDIS has made it possible to support the scope of data products and services without compromising the functionality under the ensuing budgetary constraints over the years. If anything, the functionality has expanded to support a much larger community than originally envisaged. It is the expanded functional evolution that has led to the recognition that EOSDIS is a more open and distributed architecture both in terms of science processing and user applications. By adopting the EOSDIS Clearing House (ECHO), costly revised system versions or scrapping of systems and restarting from a clean slate will be avoided for many products. ECHO allowed for a limited open source architecture concept to address the current needs and capabilities as a natural evolution. ECHO supports various searches of the metadata so that individual communities can tailor the user interface to their own needs and access methods.
Program and Project Management. Creating widely acceptable CDRs from NOAA operational satellites will be as difficult a science challenge as managing a data information system as complex as the EOSDIS. Garnering the full support of a diverse and broad representation of the science community from the initial proposed concept, plan, scope, and implementation is critical to the success of this NOAA undertaking. Unfortunately, in developing the EOSDIS the science community was not completely supportive from the start, and was unsatisfied with the centralized design approach of an EOSDIS core system with the DAAC selection process and with its role in the scientific processing of higher-level data products and the one-size-fits-all approach. Allowing users to gain ownership of requirements through sponsored workshops to reach community consensus and initiating processes to enable users to prioritize requirements allowed stakeholders to take an active role in the design and thus improve their level of comfort with the EOSDIS core system.
The Reanalysis Project is a cooperative effort of the National Center for Environmental Prediction (NCEP) and National Center for Atmospheric Research (NCAR) to produce a 50-year (1948-1997) record of global analyses of atmospheric fields. This effort, which started in 1989, grew out of the need in the research and climate monitoring community for a climate data assimilation system (CDAS) that would be unaffected by changes in numerical weather prediction operational systems. The CDAS Advisory Panel recommended that a long-term reanalysis be carried out in conjunction with the development of the CDAS (Kistler et al., 2001).
The design, development, and implementation of the reanalysis project occurred during 1990-1994. It involved the recovery of land surface, ship, rawinsonde, pibal, aircraft, satellite, and other data, and quality controlling and assimilating these data. Data collection, a major task that was performed mainly at NCAR, required the cooperation of international agencies including the U.K. Meteorological Office, Japanese Meteorological Agency, and the European Centre for Medium-Range Weather Forecasts (ECMWF). The data assimilation system is kept unchanged over the reanalysis period, although it is still affected by changes in the observing systems.
The main outputs from the NCEP/NCAR reanalyses are four-dimensional gridded fields and observations. Gridded output variables are classified into four classes, based on the degree to which they are influenced by the observations or the model. An archive of five decades of observations has been encoded into a common format (denoted Binary Universal Form for the Representation of Meteorological Data [BUFR]), including metadata.
NCEP conducted two major reanalyses, one from 1948 to the present (Kalnay et al., 1996; Kistler et al., 2001) and a second from 1979 to the present (Kanamitsu et al., 2000). The long, consistent datasets from reanalysis have been extremely valuable to an impressive range of scientific studies and applications, including climate monitoring, climate prediction, applied climatology such as prediction and monitoring of climate related health problems, stratospheric transports and chemistry, and boundary conditions for regional models. It is estimated that 5,000-15,000 papers and studies have been carried out just in the last few years using the reanalyses, and their use is growing exponentially (E. Kalnay, personal communication).
Two other major global reanalyses have been undertaken: the ECMWF 15-year reanalysis covering 1979-1993 (Gibson et al., 1997) and the NASA/Data Assimilation Office 17-year reanalysis
data operations staff, data analysts, documentation writers, user services, and archive and distribution team members.
NOAA should also determine the relevance of the Data Quality Act (67 CFR 8452, http://www.noaanews.noaa.gov/stories/iq.htm) to satellite data and resolve issues related to watermark, provenance, reproducibility of data, peer review, integrity of data, and supporting information.
covering 1980-1996 (Schubert et al., 1993). The University of Maryland has performed a preliminary 50-year reanalysis of the oceans.
Different categories of users will require different data formats, and these will change over the decades. If the CDRs are available in multiple, flexible, and well-documented formats or in a form that permits the use of alternate formats, NOAA will be able to meet the needs of future users,
particularly several decades hence when the data are still valuable. The committee does not believe that this report should recommend which data format to use for CDRs, because technology development will change the available options. It is, however, important that CDRs be made available in interoperable formats with certain common standards, such as being self-describing, and that there be periodic reconsideration of CDR formats as the underlying data technologies change. Many current format tools have been written by data users. This is an inefficient use of research funds, as the same problems must be solved over and over again by individual researchers. It would be better for the data provider, in consultation with data users, to promote standard formats for CDRs or to provide the tools necessary to use the data. In the case of the latter, simplistic visualization is not enough, and extraction and format conversion are essential.
If NOAA is to fulfill its goal of increasing the understanding of climate and climate impacts, integration of satellite data with other types of data, such as in situ, geospatial, and socio-economic data, must be simple and easy to perform. A currently effective means of accomplishing this is through Geographic Information Systems (GIS). Common geospatial standards for data that permit interoperable use of various types of software and hardware are critical. Specifications from the Open GIS Consortium (http://www.opengis.org/) should be used in CDRs to promote interoperability across geospatial data types. While GIS systems might be considered as a means for distribution, calibrated, geolocated, and time-stamped observations likely will remain the primary data source.
The key to data access is the ability to provide data to the scientists and other users that is as practical and cost-effective as possible. With the increase in satellite data resolution, and the corresponding increase in data volume, providing users with no more than what they want has become increasingly important. Two primary ways of reducing the amount of unwanted data delivered to the users are (1) to increase the accuracy of the search and (2) to provide subsetting services. Mechanisms for providing each will help NOAA to ensure that the CDR generation program is successful.
Subsetting. NOAA should ensure rapid access to meet the user subsetting needs. Specifically, they should provide the capability to subset the CDR in multiple formats including row/column bounding box, similar to the
EOSDIS data pool concept (http://nsidc.org/data/data_pool/index.html). In addition to subsetting by time and space, subsetting by parameter (e.g., cloud fraction or channel radiance) should also be available.
Temporal Search. Users should have the capacity to search by day or year, time of orbit, and temporal subsampling (data from every nth day), separately or in combination.
Spatial Search. Users should have the capacity to search by a variety of spatial specifications (e.g., latitude and longitude sectors and spherical polygons).
Given recent and expected future advances in networking, storage capabilities, and technologies for data access (e.g., mobile devices and wireless networks), NOAA should endeavor to make CDRs available online through user-friendly, automated procedures. NOAA should take advantage of developments in other agencies with regard to data distribution initiatives, grid computing, and online collaborative tools (e.g., NSF cyber-infrastructure and middleware initiatives).
Web service infrastructures are rapidly evolving to permit data access through such digital libraries as the Digital Library for Earth Science Education (DLESE) and the National Science Digital Library (NSDL). Such software infrastructure as Unidata’s Thematic Real-time Environmental Distributed Data Services (THREDDS) is a tool for accessing archived environmental datasets from distributed server sites. NOAA can benefit from these technologies by designing a system to accommodate data mining and data discovery.
To ensure that the CDRs are used in multiple fields of science, CDR products should be promoted and distributed by multiple channels and mechanisms: electronically through online media, person to person at scientific meetings, and through scientific publications and presentations. To meet the needs of differently equipped users located around the world, data products should be distributed by multiple electronic paths, as well as made available on a variety of media (e.g., CD-ROM, DVD, DLT, flash drives). As data distribution technologies evolve the means of disseminating CDRs must also evolve.
The ability to promote and distribute the data depends greatly on metadata standards. Although data may be held in various formats broad data discovery and access will depend on metadata standards, or metadata
that can be automatically mapped to a standard. NOAA should monitor and take an active role in groups working on this problem so that their needs for the CDRs are represented in the standards.
Usually implemented through the data management process, the CDR data policy will have to build upon longstanding policies and develop some new aspects as the development process develops, probably determining the applicable policy for each CDR (e.g., some CDRs will involve multiagency or international sources [see chapter 5] with different policies applicable). NOAA already has significant experience with these types of policy issues. The primary principle should be open and unrestricted access to all data and in compliance with the recommendations of World Meteorological Organization (WMO) Resolution 40 and all applicable federal regulations. The irreplaceable primary or FCDR data must be preserved in perpetuity, and data policies for superceded TCDR versions should be established so that they may be decommissioned and deleted and the storage resources recovered.
Data management systems must ensure the security of stored data. The primary means of data security currently involves having authenticated system backups. Redundancy is essential, and backup copies must be regularly placed in widely-separated geographical locations.
DATA STEWARDSHIP AND LONG-TERM ARCHIVE
Various scientific and policy-making groups have reviewed and defined the requirements for essential data systems and services needed to ensure a long-term satellite data record in support of climate research (NRC, 2000b; GCOS, 2003). Recently the Earth Observing System Science Working Group on Data offered recommendations relevant to the Earth Science Data Lifecycle that have been modified appropriately here.
NOAA, in conjunction with NASA and the DOD, should determine the nature of “stewardship.” How does it work (at the various stages in the life of the CDR)? Who is responsible for it? Who funds it?
NOAA has initiated a Comprehensive Large Array-data Stewardship System (CLASS), which is an electronic library of environmental satellite data. The web site provides capabilities for finding and obtaining those data. CLASS is an operational component of NOAA’s Office of Satellite Data Processing and Distribution (OSDPD) and NOAA’s National Climatic Data Center (NCDC). Its success is dependent on provision of adequate resources. It may provide useful lessons and capabilities for CDR data management.
Planning for CDR data management must take place within a framework that considers the full range of data management issues over multiple decades. The data lifecycle approach provides a broad view of data stewardship that represents a fundamentally new concept. The transfer of FCDRs and TCDRs to a long-term archive (LTA) should begin when there is community agreement on the validity of a CDR, although with reprocessing as a hallmark of the TCDRs they may not be suitable for an LTA. The current concept for EOS that the transition occurs following the end of a mission is not valid for the planned multidecadal life span of NOAA’s CDR program. The primary distinction between an active archive and an LTA designation is the level of user support provided to a dataset and ease of access (e.g., rapid ftp versus copied from physical media). In particular, as long as reprocessing is likely, data should remain in the active archives. Coordinated schedules and goals should be set up for working with the other agencies to effect initial CDR agreements, planning, and eventual transfer. The proposed advisory council and the science thematic teams should participate in advisory panels and committees within NOAA to specify and administer the CDRs.
Each TCDR team should develop guidelines to manage the data stream throughout the data lifecycle.1 These guidelines will provide the NPP and NPOESS mission science teams with a roadmap for the orderly transition of the data from production to an active archive and ultimately on to an LTA facility. New operational satellite missions must plan for an orderly process that addresses data archiving, metadata collection, data access, and data delivery as the data progresses through its full lifecycle.
The permanent preservation of the CDR must be assured for at least a century; therefore, policies and strategies should be in place to facilitate the long-term viability of the CDRs. This requires methodologies that address
migration (copying and reprogramming applications to new hardware system);
encapsulation (an e-document explaining how to recreate software and hardware systems to decode the bits);
emulation (software running on new platforms that mimic the hardware processing, prior software and applications systems, and virtual computers);
standards (an ISO standard reference model that provides a conceptual framework and defines a consistent set of standards for all major archive functions and services); and
peer-to-peer file sharing (distributed, ubiquitous computer servers networked in a dependable infrastructure that can support nomadic data access and retrieval).
Institutional options for structuring data archive and dissemination functions include both central and distributed archives and can be located either totally within NOAA or completely within a nongovernmental center. There are advantages and disadvantages to each approach and risks to each that must be considered when planning the institutional structure for managing CDRs. There must be carefully defined and documented agreements to ensure continuity of data preservation and provision for transfer of data to other archiving centers if this continuity cannot be assured. Each of the approaches discussed here has different institutional and financial requirements. CDRs need a system that is cost-effective, provides the flexibility required by the disparate CDR user groups, and has the stability required for permanent data services and preservation. There are at least four ways this could be done:
A single, archive within a NOAA/NESDIS center would provide NOAA with complete responsibility and control over the data storage system, including maintenance and upgrades. The disadvantages of a single system are the volume of data that would need to be managed and the
potentially serious impact of a single point of failure. Large systems also tend to be more inflexible over time and more difficult to adjust, which also threatens success (Standish Group 1999). The diversity of the CDRs that will be managed in a single archive may affect the quality of the data, and it is possible that, like EOSDIS, a one-stop shop will not satisfy user needs.
A second approach, storing data in distributed archives across NESDIS disciplinary centers, namely the National Climatic Data Center (NCDC), National Geophysical Data Center (NGDC) and its linked National Snow and Ice Data Center (NSIDC), and National Ocean Data Center (NODC), also retains responsibility for data management entirely within NOAA. The advantage of this structure is that each topical data center would focus on a more limited number of TCDRs. Another strength of this approach is that the in situ data is usually close at hand, along with experts from the associated fields. Somewhat smaller data storage units may also be more adaptable to technology development. One disadvantage of this scheme is that it requires strong coordination across NOAA data centers and requires additional oversight to ensure that goals are met on schedule. CDRs in several disciplinary areas (biosphere, cryosphere, hydrosphere) are not formally represented in the three NESDIS centers. Because some CDRs may fit equally well in two or more data centers, a Web portal across the data centers would permit users to identify what they want and could supply the software to locate data from the appropriate center.
A distributed archive that spanned both NESDIS and external centers, such as NASA DAACs, the Federation of Earth Science Information Partners, International Council of Scientific Unions (ICSU) World Data Centers (WDCS), or university data centers is another option. Such an archive could encompass multiple agencies and government and nongovernmental data centers. Many of the NASA DAACs, for example, not only archive data generated by external groups (including products derived from NOAA polar orbiters), but they also create their own CDRs and have well-organized user communities. NOAA could utilize these functioning structures for CDR creation and data management, and develop specific partnerships, rather than create a new organization. The advantage of this approach is a further reduction in the number of CDRs a location manages, which may result in more informed data stewardship. As with option two, the potential problems incurred are the more complicated organizational structure and the need for a transparent data portal. In addition, continuity over time in terms of both data management and funding for data maintenance may be more difficult to accomplish with this more diverse institutional structure. Effective short-term (possibly for mission duration only) archive centers could be
located at facilities with appropriate experts. The additional condition would be that these centers have a long-term archive agreement to transfer all data, metadata, and associated documentation to a permanent archive center.
A fourth option is a central archive that is subcontracted outside the government. The advantage of this scheme is that proposals can be solicited for the project, which may lead to innovative new data archival and dissemination schemes. The winning proposal would also have full stewardship of the system, which could help ensure success. However, NOAA would have less control over the maintenance and upkeep of the system and costs could rise significantly over time. User services might also be a bigger problem with a fully external central archive.
There are compelling reasons to avoid Options 1 and 4. In the case of option 2, not all NOAA centers have experience with satellite data streams. Option 3 has the advantage of entraining a wider range of expertise. Regardless of which of the four options is selected, there will be a need for strong oversight, periodic reconsiderations of scientific advances and user needs, and frequent assessments of the adequacy of data management procedures and responsiveness to technological changes.
Levels of Service
An important step in data dissemination is the decision about the levels of service for each CDR. These levels should be assigned for different functional activities: ingest, processing, documentation, archiving, access and distribution, search and order, and user services (see Appendix C for more details based on EOSDIS). For data ingest there are two primary alternative modes: operational (time-critical) ingest with immediate verification of data integrity and quality, and routine ingest and verification of data quality and integrity without tight time constraints. Data processing options include such alternatives as operational products generated within two, seven, or thirty days of ingest or availability of required inputs. Since users have markedly different acceptable processing times, NOAA should survey user communities to determine appropriate time delays.
As noted in Chapter 3 and earlier in this chapter, documentation of the CDR generation process (metadata) is critical for future reprocessing efforts and for using the data appropriately. Data and product holdings (including multiple versions of products and corresponding documentation as needed) should be documented to the adopted standard for long-term archiving,
including details of processing algorithms and processing history; documentation should be sufficient for current use (e.g., product type descriptions, product instance [e.g., granule] descriptions including version information, FAQs, “readme” directions, Web pages with links to metadata, user guides, and references to journal articles describing the production or use of the data or product).
The User Community
Society’s need for climate data has grown rapidly, along with better computing capabilities in user communities, better access to data, and an increased appreciation for the impact of weather and climate on daily activities (e.g., NRC, 2003a). At NASA DAACs total requests for products increased from under 41,000 in fiscal year 1996 prior to the launch of Terra to over 208,000 user requests in FY 2002 (F. Fetterer, personal communication). The NOAA data centers have witnessed a similar increase in data requests, volume delivered, and products stored. Since 1996, NOAA data centers have received nearly an order-of-magnitude increase in data requests (Figure 4-1), with marked percentage increases in NODC requests, although NCDC continues to receive the most requests. The volume of data delivered to users has increased at an even higher rate (Figure 4-2). The increase in user requests and data volume accompanies an increase in the number of products stored at the NOAA data centers, from roughly 800 in 1998 to nearly 1600 in 2003 (Figure 4-3). Most of the products requested are from the private sector (Figure 4-4).
Long-term archives of FCDRs, derived products, and complete documentation must be preserved. This will facilitate reprocessing and user access to create new TCDRs over the entire record, including the archiving of the required ancillary data, instrument, project and dataset documentation, and the science production software.
The institutional structure chosen for data management must meet the criteria set out for CDR generation, archiving, access, and distribution. The overall system design should be flexible and enduring. An archive should be identified for each data product, including both dissemination responsibilities and long-term archiving.
A clear policy is needed from the beginning to ensure continuity in the
data record as well as full and open data exchange and access. Distribution must encompass multiple electronic paths and a variety of media. Data must be available in formats appropriate for a variety of uses, including geospatial and socio-economic applications.
Life cycle data management from initial planning, through development and implementation is needed. This must involve cooperation among researchers, data and archive managers, data collectors, and primary users. To assist in making decisions on data stewardship in a resource constrained environment, a process should be established for the science assessment of the long-term potential of data and data products.
Given the large satellite data volumes, it is critical that the NOAA infrastructure provide tools to enable the user to do spatial and temporal searches and arbitrary subsetting. Levels of service must be determined and implemented in the design of the system infrastructure. Preserving complete documentation along with the data is of absolute importance for successful reprocessing of archived data to produce improved or new geophysical products. The use of CDRs by policy makers, resource managers, educators, and planners will require the NOAA CDR system to provide them with the capability for deriving high-level information products from the CDRs.