The National Aeronautics and Space Administration (NASA) has become a knowledge agency. Long after the Mars Surveyor has gone silent, Hubble has met the same fate as Mir, and the Moderate Resolution Imaging Spectroradiometer has produced its final set of images, what will endure are the volumes of valuable data that these instruments and many others have collected over their lifetimes. NASA data sets are revolutionizing the fields of astrophysics, solar system exploration, space plasma physics, and earth science. As this impressive collection of observations has grown, NASA’s mission has also expanded—evolving from an emphasis on mission planning and execution to include the collection, preservation, and dissemination of earth and space data.
Spacecraft that will be launched during the next decade will increase the data volume returned by NASA missions a hundredfold. These rich data sets will open new eras in precision cosmology and in understanding of the complex linkages in the forces that shape the Earth’s environment. Addressing the increasingly complex questions that can now be asked—and answered—through the use of NASA data will require the capability to compare and combine observations of different types and to discover patterns and relationships through sophisticated querying tools. The user community will need still-to-be-developed tools and methodologies for accessing, analyzing, and mining data; recognizing patterns; and performing cross-correlations that are scalable to a billion or more objects. Developing the necessary tools will present new challenges to space scientists, to the information-technology community, and to NASA. Investments in scientific analysis and in packaging data in formats useful to other potential users, including educators, those in industry, state and local government officials, and policy makers, will be needed in order to exploit the full potential of existing data sets. The end product of each mission—knowledge—must be the key factor in determining mission design and budget allocations.
AVAILABILITY AND USEFULNESS OF NASA’S SPACE MISSION DATA
The Task Group on the Usefulness and Availability of NASA’s Space Mission Data was charged by NASA’s associate administrators for earth science and space science to evaluate the availability, accessibility, and usefulness of data from earth and space science missions, and to assess whether the balance between attention to mission planning and implementation versus data analysis and utilization is appropriate. Based on input from various sources—recent National Research Council (NRC) and other advisory committee reports; interviews with the chairs of relevant NASA advisory committees and discipline committees within the NRC; information gathered from NASA headquarters; and the task group’s survey of the archives, data
centers, and data services and use of their Web sites—the task group’s answers to the charge (see Appendix A) are summarized below:
Charge 1. How available and accessible are data from science missions (after expiration of processing and proprietary analysis periods, if any) from the point of view of both scientists in the larger U.S. research community, as well as U.S. education, public outreach and policy specialists, and private industry? What, if anything, should be changed to improve accessibility?
As few as 10 years ago, NASA’s data collections were accessible mainly to researchers involved with specific missions. With the advent of a NASA network of active archives, data centers, and data services, most newer data sets have become widely available, especially to researchers. Enhancements in bandwidth and planned increases in the number of online data sets available through publicly accessible data facilities will improve the accessibility of NASA’s earth and space science data still further over the next decade. However, much of the older data (e.g., in the fields of solar and space physics and planetary science) is still in the hands of principal investigators (PIs) or is not available in formats that users need. Other data or information products (e.g., education and nonscientific applications products) are available on project Web sites but may require extensive searching to find, and their long-term availability is not assured. Further improvements in cataloging and documentation will be required to help users find data.
Charge 2. How useful are current data collections and archives from NASA’s science missions as resources in support of high priority scientific studies in each Enterprise [i.e., NASA’s Earth Science Enterprise and Space Science Enterprise]? How well are areas such as data preservation, documentation, validation, and quality control being addressed? Are there significant obstacles to appropriately broad scientific use of the data? Are there impediments to distribution of derived data sets? Are there any changes in data handling and data dissemination that would improve usefulness?
The use of archival data has contributed to a number of scientific advances in the earth and space sciences (e.g., confirmation of the Antarctic ozone hole and the accelerating expansion of the universe). The large and growing number of users—coupled with the positive results of user surveys, external reviews, and the task group’s own experience with the data facilities—attests to the usefulness of the data in a wide variety of investigations.
Many data sets will grow in value as the time period covered by the measurements lengthens. However, getting the most out of existing data sets will require the development of software tools for handling the data (e.g., for changing formats, subsetting large data sets, and querying and visualizing data sets) and improvements in documentation, user interfaces, and technical and scientific support. These improvements will be even more important for dealing with the projected growth in the volume of data (one to two orders of magnitude over the next 5 years) and the increasing need to integrate disparate data sets for both research and applications purposes. Maintaining accessibility and compatibility with changing standards for storage media, software tools, and so forth in the long term will present substantial challenges in terms of both cost and management. Although issues of validation and quality control of individual data sets were not directly addressed in this study, the task group’s generally positive findings about data usefulness suggest that these issues do not now pose either major or widespread obstacles to data
use. However, they will require heightened attention in the future as demands on the active archives increase.
NASA data have the potential to benefit society in many ways, but in order to exploit this potential it is necessary to provide support for the translation of scientific data into data products that are tailored for specific applications. These data products must be easily accessed and interpreted by people who are experts in the fields to which the data are being applied, but who will very likely have limited or no training in fields for which the data were originally collected. The work of Earth Science Information Partners, Regional Earth Science Application Centers, Infomarts, and similar applications programs is an important step in increasing the usefulness of NASA data. However, meeting the needs of the broader community would require a very substantial additional investment of resources, and such investments should be preceded by an assessment of the market for NASA information and a prioritization of investments according to cost-effectiveness and likely impact.
Charge 3. Keeping in mind that NASA receives appropriated funds for both mission development as well as analysis of data from earlier or currently operating missions, is the balance between attention to mission planning and implementation versus data utilization appropriate in terms of achieving the objective of the Enterprises? Should the fraction of a mission’s life-cycle cost devoted to data analysis, processing, storage and accessibility be changed?
Declines in funding for analysis of space science data in the 1990s have been reversed in recent years, although funding remains insufficient for analyzing data during extended missions or after missions have been completed. The major exception to this generalization is for longlived astrophysics missions, where funding for data analysis, including analysis of archival data, is made available for a decade or more after launch. Despite changes in the way budgets are reported, the fragmented budget structure of both enterprises makes it difficult to quantify the adequacy or inadequacy of funding.
Rigid guidelines for the balance between support for mission planning and implementation on the one hand and data utilization on the other are inappropriate. However, in view of the expected growth and diversification in the data products from future missions, NASA should address more explicitly the issues of balance in its planning and management of missions and programs and it should do so utilizing mechanisms that involve the user communities. Trade-offs within the life-cycle budget should be made in such a way as to optimize the overall scientific return, even if that means reducing mission capabilities for data acquisition.
Specific recommendations related to the task group’s charge are presented in the sections that follow.
MANAGEMENT OF DATA WITHIN NASA
Concerns about the management of NASA data sets have been identified in several earlier NRC and General Accounting Office reports. The task group concludes that the management of science data and information has become a function of sufficient scope and importance that its successful execution requires leadership with the expertise to carry out these tasks:
Provide strategic planning, oversight, and advice concerning the collection, processing, archiving, and dissemination of data and information collected by NASA’s space missions;
Be the advocate for the appropriate balance of investment in data analysis;
Ensure the preservation and accessibility of valuable space mission data and information;
Require a data management plan for each mission and monitor its implementation;
Provide oversight for the design and implementation of software, hardware, and database systems for processing and storing NASA’s massive data sets;
Develop a long-term software plan for NASA’s Earth Science and Space Science Enterprises;
Require interenterprise communication and sharing of successful methods and systems for data management;
Work out the memorandums of understanding governing access to data from those missions that are carried out cooperatively with other countries; and
Determine how information generated by the space programs of other countries can be accessed and effectively used by U.S. scientists and institutions.
The person(s) charged with the tasks listed above should also create and draw on the experience of an advisory panel composed of instrument scientists, computer scientists, chief information officers (CIOs) from major corporations and government organizations, and an electronic-records expert from the National Archives and Records Administration. Analogous to the position of CIO in a major corporation, the NASA person(s) in charge of the information-management function should have budgetary responsibility for the collection, analysis, and long-term maintenance of all earth and space science data sets. This responsibility could consist of either holding the budget for designing the data collection, analysis, dissemination, and archiving function for each mission or having the right of refusal for projects and programs that do not handle it adequately, or both. In parallel with the title of CIO in industry, this person might appropriately be called the chief science information officer(s) (CSIO; this title distinguishes the functions addressed here from those of the chief information officer at NASA, who is primarily responsible for NASA business systems and security). The CSIO(s) would have responsibility for the data acquisition and utilization component of every mission and would advocate investment in data management at a level that optimizes the overall scientific return of a mission when trade-offs between hardware and data must be made.
Some of the responsibilities outlined above relate to cross-NASA issues, while others are more specific to individual program offices. Accordingly, they could be carried out either by a single individual or by individuals assigned to each of the enterprises. However, whatever administrative structure is selected, it should be one that supports cross-enterprise communication and cooperation and provides the support and authority needed to ensure that the CSIO is effective in carrying out the functions identified here.
The recommendation to consolidate the information-management function does not imply that NASA should centralize all data aspects of all missions. The task group believes that a combination of distributed and centralized activities is necessary. For example, analysis and production of data products should probably continue to be performed in a distributed manner by scientists, while long-term maintenance of data is probably best handled centrally. The NASA CSIO(s) would be responsible for overseeing the development of the overall architecture of the data and information “production line,” while leaving much of the actual design,
implementation, and operation to the scientists and engineers directly responsible for each mission.
Recommendation. NASA should assign the overall responsibility for oversight and coordination of NASA’s data assets to a chief science information officer (CSIO) (or alternatively to multiple science officers). The CSIO(s) would provide leadership; long-term strategic planning; and advice on the collection, processing, archiving, and dissemination of data and information collected by NASA’s space missions to ensure the preservation and accessibility of these valuable resources. If a single CSIO is named, then this individual should report to the NASA administrator. Alternatively, CSIOs might be appointed for each of the enterprises and report to the heads of the enterprises, but in this case a mechanism should be established to ensure cross-enterprise coordination and communication of best practices.
INVESTMENTS IN SOFTWARE AND DATA ANALYSIS
The scientific productivity of a space mission depends as much on the readiness of software and data flow pipelines as on the readiness of the sensor and spacecraft hardware. Therefore, NASA science missions should be viewed as integrated systems of hardware and software. The trade-offs among capabilities that are inevitable in missions and programs with fixed budgets must include not only the funding for new missions, the development of new capabilities, and the fabrication of spacecraft instrumentation, but also the funding for software development for mission operations, data distribution, and data analysis. In cases where hardware cost overruns occur, maintaining an adequate investment in software and scientific analysis may well require reducing the capabilities of the mission itself. Ground and flight systems should be designed in conjunction in order to achieve cost-effective data acquisition and analysis.
Recent program solicitations from both the Earth Science and Space Science Enterprises require the PIs to prepare budgets for the total mission cycle cost—from mission definition to data processing, publication, and archiving. The task group encourages the continuation of this practice.
Recommendation. Budgets for mission operations and data analysis should be included as an integral part of mission and/or program funding. Reviews, including NASA’s nonadvocate review, which is required to authorize project funding, should include assessment of the data analysis elements, including archiving and timely provision of data to users. While reviews of some projects already follow this recommendation, its implementation is not uniform across all NASA programs. The appropriate balance between hardware and software investment is best determined jointly by NASA managers and the user communities involved in the mission.
The prime mission phase includes the development, launch, data collection, and analysis for a fixed period of time that is estimated to be sufficient to answer the minimum set of scientific questions that must be addressed in order for the mission to be judged a success. However, for many missions and many scientific problems, the value of data extends well beyond the termination of the prime mission phase. Missions are extended, calibrations are improved, novel
uses of the data are made that were neither foreseen nor planned by the original mission investigation team, and many significant discoveries occur only after a variety of heterogeneous data sets are integrated and studied. The peak publication rate for a mission often occurs 4 to 5 years after launch. All of these factors argue for continuation of support for scientific analysis after the prime mission phase is completed. Mechanisms (e.g., proposal pressure and advisory committees) exist for setting priorities within a discipline. However, NASA, in consultation with the scientific community, will have to develop mechanisms for addressing issues of balance across disciplines or between new missions, extended missions, and postmission data analysis within or between programs. Whatever mechanism it chooses should be carried out on a regular and systematic basis.
LONG-TERM MAINTENANCE OF DATA
NASA currently provides a data center—the National Space Science Data Center (NSSDC)—for long-term maintenance of space science data. However, the NSSDC faces tremendous challenges in serving current users as well as future generations of scientists. Many scientifically valuable data sets are not archived in the center, and those that are may not be sufficiently well documented or formatted to be readily accessible. Declining budgets and rapidly growing volumes of holdings will only exacerbate these problems.
A permanent storage facility is not even available for most of NASA’s earth science data. Instead, these data are to be transferred to the U.S. Geological Survey and the National Oceanic and Atmospheric Administration 15 years after collection. Even if adequate resources can be found, transferring petabytes of data from those familiar with them to organizations with little knowledge of the data entails a risk. Because NASA data sets are a national resource and because the value of many of them increases in direct proportion to the time interval covered by them, it is important to preserve the data indefinitely. The care of the data must be accomplished so as to maximize their knowledge-enhancement possibilities, scientific impact, and discovery potential.
Recommendation. NASA should assume formal responsibility for maintaining its data sets and ensuring long-term access to them to permit new investigations that will continue to add to our scientific understanding. In some cases, it may be appropriate to transfer this responsibility to other federal agencies, but NASA must continue to maintain the data until adequate resources for preservation and access are available at the agency scheduled to receive the data from NASA.
FEDERATED DATA SYSTEMS
Many of the important scientific problems of the 21st century in both space and earth science will require the ability to explore and integrate data obtained from different spacecraft and different instruments. Rather than creating a single information system to meet the evolving needs of a wide range of users, it is now possible, and may even prove to be more cost-effective, to create a federation of distributed databases with universal standards for archiving and to provide common and easily used visualization tools. Federations capitalize on bottom-up
decision making and local, custom solutions to specific user needs. A prototype federation of Earth Science Information Partners, which has been operating for 3 years, has demonstrated the ability of different NASA-funded organizations to cooperate, provide system operability at the catalog level, and produce specialized data products. The astrophysics community has developed a plan called the National Virtual Observatory (NVO), which would provide common access tools for their multiwavelength databases; development of the overall architecture and establishing of metadata standards have been funded at a level of $10 million over the next 5 years by the National Science Foundation (NSF). These and other grass-roots efforts to establish multimission data sets and data products in support of interdisciplinary or cross-cutting approaches should be nurtured, although they may not be the best solution in every case. A challenge for the future will be to develop methods for making complex queries of these federated databases.
Recommendation. NASA should encourage efforts by the scientific community to develop plans for federations of data centers and services that would enable complex querying, mining, and merging of data from different instruments and missions in order to answer complex, large-scale scientific questions.
The National Virtual Observatory, an astrophysics project funded recently by the National Science Foundation (NSF), will develop the architecture, standards, and so forth for creating a distributed system of data centers that can be cross-accessed and queried in a transparent manner by users. NASA should coordinate with the NSF-funded work on the NVO, which is predicated on seamless joint access to ground- and space-based data, to ensure that space data are compliant with NVO standards.
NASA should encourage close communications among the groups operating or developing federated systems in order to transfer best practices among its various scientific programs.
The successful implementation of methods for making complex queries of multiple databases is likely to be technically challenging and costly. The level of appropriate investment by NASA in federated data systems should be evaluated at regular intervals and should be based on (1) the importance of the scientific questions that can be addressed through the simultaneous mining of multiple databases, (2) demonstrated scientific return from past investments, and (3) the readiness of computational and communications technology to support data mining.
EARTH SCIENCES DATA SYSTEM
The earth science community has a particular need to generate and access data within a unified framework that integrates data sets and data centers in a seamless way. The Earth Observing System (EOS) Data and Information System (EOSDIS) Core System (ECS) software was intended to provide “one-stop shopping” access to multidisciplinary data in a timely manner. This goal was not, and probably could not have been, achieved with the technology available at the time the ECS was designed. A restructured ECS with fewer capabilities will be used for a subset of EOS missions, and data processing and distribution for the remainder will be handled by active archives or PI facilities.
NASA recognizes the problems associated with EOSDIS and is developing a strategy for the evolution of the network of data systems and service providers that support the Earth Science Enterprise. The next-generation system is called SEEDS (Strategic Evolution of ESE Data Systems). SEEDS is intended to support all phases of the data management life cycle: (1) acquisition of sensor, ancillary, and ground validation products necessary for processing; (2) processing of data; (3) generation of value-added products via subsetting, format translation, and data mining; (4) archiving and distribution of products; and (5) search, visualization, subsetting, translation, and order services to assist users in identifying, selecting, and acquiring products of interest. Study teams drawn from the user community will be engaged to identify options, define scope, and establish schedule requirements. SEEDS is intended to be managed and implemented as an open and distributed information system architecture under a unifying framework of standards, core interfaces, and levels of service. SEEDS is a work in progress; details about the implementation plan were not available at the time this task group concluded the current report.
Recommendation. The ECS (the EOSDIS Core System) software should be placed in a maintenance mode with no (or very limited) further development until a concrete plan for the follow-on system, SEEDS (Strategic Evolution of ESE Data Systems), has been formulated, its relationship to ECS defined, and the plan reviewed by an external advisory group. This plan should be measured against the lessons learned from EOSDIS and from the experience in other disciplines, and should include provisions for rapid prototyping and an evolutionary and distributed approach to implementing new capabilities, with priorities established by the scientific and other user communities.
USERS OF NASA DATA
NASA currently regards scientists as the end users of data from its missions. While scientists are a major user segment, there are many others, including project and program managers, engineers, educators, the general public, and decision makers. These users need information, rather than data, in order to design and operate missions and to make policy decisions.
Recommendation. NASA planning and project funding should continue to include provisions for the timely generation and synthesis of data into information and the dissemination of this information to the diverse communities of users. This plan should take into account the needs—and the contribution to information generation—of end users, including other federal and state agencies, educational organizations, and commercial enterprises. The plan should include provisions for ongoing assessment of the effectiveness of data transfer and its educational value.
STRATEGIES FOR MEETING THE REQUIREMENTS OF THE RESEARCH COMMUNITY
The task group has identified several elements that appear to be common to those overall data management systems that best meet the requirements of the science communities that they
serve. These elements are listed below and should be included in planning for future missions and facilities:
Archives and data centers should have (1) scientists on staff with a strong background in the scientific discipline being supported and (2) scientific working groups to help set priorities for acquiring, managing, and discarding data.
Prelaunch funding should be provided for software development to ensure the timely development of pipelines for processing newly acquired data.
Multiyear funding should be provided for research, including research using archived data, on the basis of the quality of the proposals received. A recent senior review (the highest level of peer review within the Space Science Enterprise) of extended planetary missions, for example, noted the success of the archival research programs maintained in astrophysics and suggested that these programs might profitably be emulated by the Planetary Data System.
Guest investigator programs should be established to allow the community to conduct research not planned by the initial project teams.
Early and open access to data should be provided to permit follow-on proposals to take advantage of new discoveries.
A mechanism should be established (such as the senior reviews in space science) for making trade-offs among operations of long-lived missions and operations of active archives and data centers in a way that reflects the scientific merit of the range of possible investments.
The importance of managing data and information from NASA’s space missions will only continue to grow in the coming years. Maintaining the increasing volumes of data in forms that are readily accessible and that meet the needs of very diverse user communities presents intellectual challenges that are at least the equal of the challenges of building and launching hardware into space. NASA is well positioned to become a leader in developing the techniques and tools for querying and mining large nonproprietary data sets. However, doing so will require a new emphasis on software management; rigorous review of the balance between investments in software and hardware to optimize the science return from both individual missions and suites of missions; and development of new techniques for exploring and intercomparing data contained in a distributed system of active archives, data centers, and data services located both in the United States and abroad.