4
Strategies for Managing Earth and Space Science Data

As shown in Chapters 1 through 3, NASA’s space missions are a primary source of data and information for a wide range of earth and space research and applications. However, the following trends pose challenges for managing the data effectively:

  1. The user community is growing in size and diversity. The increase in users is a measure of success of NASA’s active archives, but it also poses a challenge because new user groups commonly require data sets tailored for specific applications. Providing calibrated data is no longer sufficient for meeting the needs of NASA’s customer base.

  2. The volume of data is increasing rapidly, with increases of one to two orders of magnitude expected over the next five years (see Figures 4.1 and 4.2). The challenges and the costs of preserving and enabling access to these growing data sets will also grow with time. The larger data volumes will place increasing demands on developing tools for finding data and for extracting small subsets.

  3. Research questions and practical applications increasingly require the integration of a wide variety of data sets, which are commonly stored in differing data models and formats, with lack of agreement on metadata, different resolutions, different data quality requirements, and in different locations. The holdings must be well documented with standard formats and metadata standards and available through a common set of querying tools so that it becomes possible for users to integrate data across centers and to combine data from the active archives with data from long-term data centers in order to identify patterns (e.g., environmental influences on galaxy evolution) and monitor long-term variations and trends (e.g., in land use or climate).

  4. Data relevant to NASA-supported research programs may be held by other federal agencies or by other countries. It will be necessary to establish agreements to ensure that these data are also properly curated and made accessible and that the formats and metadata standards are compatible.

Dealing with these management challenges will require more than simply increasing funding to the active archives, although providing increased funding to the centers is reasonable in many cases. To get the most out of its holdings, NASA will have to reexamine its overall strategy for collecting and managing data. This chapter focuses on the need for a comprehensive strategy for managing earth and space science data; the balance between acquiring, analyzing, and archiving data; and usefulness of federated approaches to managing databases.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data 4 Strategies for Managing Earth and Space Science Data As shown in Chapters 1 through 3, NASA’s space missions are a primary source of data and information for a wide range of earth and space research and applications. However, the following trends pose challenges for managing the data effectively: The user community is growing in size and diversity. The increase in users is a measure of success of NASA’s active archives, but it also poses a challenge because new user groups commonly require data sets tailored for specific applications. Providing calibrated data is no longer sufficient for meeting the needs of NASA’s customer base. The volume of data is increasing rapidly, with increases of one to two orders of magnitude expected over the next five years (see Figures 4.1 and 4.2). The challenges and the costs of preserving and enabling access to these growing data sets will also grow with time. The larger data volumes will place increasing demands on developing tools for finding data and for extracting small subsets. Research questions and practical applications increasingly require the integration of a wide variety of data sets, which are commonly stored in differing data models and formats, with lack of agreement on metadata, different resolutions, different data quality requirements, and in different locations. The holdings must be well documented with standard formats and metadata standards and available through a common set of querying tools so that it becomes possible for users to integrate data across centers and to combine data from the active archives with data from long-term data centers in order to identify patterns (e.g., environmental influences on galaxy evolution) and monitor long-term variations and trends (e.g., in land use or climate). Data relevant to NASA-supported research programs may be held by other federal agencies or by other countries. It will be necessary to establish agreements to ensure that these data are also properly curated and made accessible and that the formats and metadata standards are compatible. Dealing with these management challenges will require more than simply increasing funding to the active archives, although providing increased funding to the centers is reasonable in many cases. To get the most out of its holdings, NASA will have to reexamine its overall strategy for collecting and managing data. This chapter focuses on the need for a comprehensive strategy for managing earth and space science data; the balance between acquiring, analyzing, and archiving data; and usefulness of federated approaches to managing databases.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data FIGURE 4.1 Projected growth in the volume of data at all active archives and data centers evaluated in this report (see Table 2.1), FY 2000 to FY 2005. Most of the data are held in earth science centers, particularly the Goddard Space Flight Center DAAC (GSFC), EROS Data Center DAAC (EDC), Alaska SAR Facility DAAC (ASF), and Langley Research Center DAAC (LaRC). SOURCE: Data provided by managers of the active archives and data centers.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data FIGURE 4.2 Projected growth in the volume of data at all active archives and data centers evaluated in this report except the four centers with the largest holdings (LaRC, GSFC, EDC, and ASF DAACs), FY 2000 to FY 2005. SOURCE: Data provided by managers of the active archives and data centers.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data A COMPREHENSIVE APPROACH TO INFORMATION MANAGEMENT Corporate America has long recognized the importance of its data to both daily operations and long-term corporate viability. Every major corporation has a chief information officer (CIO) who is responsible for all of the corporation’s data sets and who frequently has substantial power and budgetary authority. Whereas the collection and exploitation of data are critical to the operation of any modern business, it is not the primary focus of most. However, the task group contends that the collection and exploitation of data are NASA’s main business. Although NASA has a CIO, that person’s primary responsibility is for the business systems maintained by NASA.1 The enterprises are responsible for overall program planning, including scientific data management, in their disciplines. No NASA-wide mechanism exists for (1) advocating appropriate investment in data management, (2) ensuring that best practices are communicated across the scientific enterprises,2 (3) overseeing and evaluating the development of strategic plans for major data initiatives such as the National Virtual Observatory (NVO) and Strategic Evolution of ESE Data Systems (SEEDS), and (4) ensuring the preservation and accessibility of NASA’s valuable information resources. The task group believes that this set of responsibilities would be most effectively carried out through the leadership of a single individual with a very high level of training both in a science field related to NASA missions and in information science. For convenience, the person(s) assigned to manage these responsibilities is referred to here as the chief science information officer, or CSIO. This CSIO would carry out the following tasks: Provide strategic planning, oversight, and advice concerning the collection, processing, archiving, and dissemination of data and information collected by NASA’s space missions. Be the advocate for the appropriate balance of investment in data analysis. Ensure the preservation and accessibility of valuable space mission data and information. Require a data management plan for each mission and monitor its implementation. Provide oversight for design and implementation of software, hardware, and database systems for processing and storing NASA’s massive data sets. Develop a long-term software plan. Require interenterprise communication and sharing of successful methods and systems for data management. Work out the memorandums of understanding governing access to data from those missions that are carried out cooperatively with other countries. Determine how information generated by the space programs of other countries can be accessed and effectively used by U.S. scientists and institutions. 1   The general responsibilities of an agency’s CIO are (1) to ensure that information technology is acquired and information resources are managed effectively; (2) to develop, maintain, and facilitate sound and integrated information-technology architecture; and (3) to promote the effective and efficient design and operation of all major information-resources management processes for the agency, including improvements to work processes. See <http://www.hq.nasa.gov/office/cio/>. 2   NASA has developed procedures and guidelines for reviewing and applying lessons learned from previous missions to avoid the recurrence of mistakes and to share best practices. According to a recent General Accounting Office report, however, NASA managers do not routinely identify, collect, or share lessons. See General Accounting Office, 2002, Better Mechanisms Needed for Sharing Lessons Learned, GAO-02–195, Washington, D.C., 51 pp.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data Some of these responsibilities relate to cross-NASA issues, while others are more specific to individual program offices. While a single CSIO is referred to here, it may be that each of the enterprises requires its own CSIO. However, regardless of the administrative structure that is selected, it should be one that supports cross-enterprise communication and cooperation. If there is a single CSIO, that person might appropriately report to the NASA administrator. If there are CSIOs for each enterprise, they should report to the heads of the respective enterprises. The important point is that the CSIO(s) must report to a level within NASA that will provide the support and authority needed to ensure that the CSIO is effective in carrying out the functions identified here. Just as a major corporation assigns substantial budgetary authority to its CIO, the CSIO(s) should have budgetary authority for end-to-end management of data: collection, analysis, distribution, and long-term maintenance. The issue of balance between the funding for designing and deploying a piece of hardware and funding for collecting, analyzing, and storing the data sets produced by a mission has been addressed in earlier National Research Council (NRC) reports.3 When the cost for the hardware exceeds a particular mission’s budget, funds for data analysis may be reduced, particularly in programs with cost caps. As a consequence, data analysis may have to be funded through research and analysis programs, which are also tightly funded and already oversubscribed. The task group proposes an alternative model in which the CSIO(s) would either have the budget for designing the data collection/analysis/dissemination/archiving function for each mission or would have the right of refusal for projects or programs that do not handle the required balance adequately, or both. When trade-offs must be made between hardware and data components, the CSIO(s) would be responsible for ensuring that the mission investment in data management remained adequate for optimizing the scientific return. Recommendation. NASA should assign the overall responsibility for oversight and coordination of NASA’s data assets to a chief science information officer (CSIO) (or alternatively to multiple science officers). The CSIO(s) would provide leadership; long-term strategic planning; and advice on the collection, processing, archiving, and dissemination of data and information collected by NASA’s space missions to ensure the preservation and accessibility of these valuable resources. If a single CSIO is named, then this individual should report to the NASA administrator. Alternatively, CSIOs might be appointed for each of the enterprises and report to the heads of the enterprises, but in this case a mechanism should be established to ensure cross-enterprise coordination and communication of best practices. This recommendation does not imply that NASA should centralize all data aspects of all missions. Rather, a combination of distributed and centralized activities would best serve NASA’s scientific programs. For example, a distributed approach to developing software for managing data has proven to be the most cost-effective means for delivering usable software on the timescales required for scientific missions. Similarly, analysis and production of data products should continue to be performed in a distributed manner by scientists, whereas long-term maintenance is probably best handled centrally. The CSIO(s) would be responsible for overseeing the planning for the production of data products and assessing the outcomes, while leaving the actual production to the scientists. One of the charges to the NASA CSIO(s) should 3   National Research Council, 2000, Assessment of Mission Size: Trade-offs for NASA’s Earth and Space Science Missions, National Academy Press, Washington, D.C., p. 14, and references therein.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data be, however, to facilitate interenterprise sharing of methods and systems for data management. NASA has accumulated a wealth of space and earth science data that are archived, managed, processed, and distributed in a variety of methods with different levels of success. With an agency overview, the NASA CSIO(s) could seek out data management successes from one mission and apply them to future activities. It is possible that the Space Science and Earth Science Enterprises could benefit by emulating each other’s successes. The CSIO(s) will face many challenges, but none so daunting as the design and implementation of software, hardware, and database systems for processing and storing NASA’s massive data sets. Corporate CIOs have a range of choices of suitable database systems, analysis software, and so on, but there is minimal commercial interest in producing software specifically for use by NASA. However, creating custom software tailored to meet very specific requirements also presents problems, as NASA and other federal agencies such as the Federal Aviation Administration and Internal Revenue Service have discovered (see Chapter 2). Consequently, one of the first tasks that the CSIO should undertake is the development of a long-term software strategic plan. To the maximum extent possible, NASA should make use of commercial off-the-shelf software in executing its mission in order to maximize cost-effectiveness. To assist with evaluating options, the CSIO(s) should create an advisory panel composed of instrument scientists, computer scientists, an electronic-records expert from the National Archives and Records Administration, and CIOs from major corporations and government organizations with very large and complicated data sets (e.g., Wal-Mart, Sears, Sabre, and USGS). The importance of including corporate CIOs on the panel cannot be overemphasized. In order to be successful financially, corporations today rely on their CIOs to acquire and exploit their data sets to the maximum extent possible. The techniques they use would be invaluable to the success of the NASA CSIO’s mission. ISSUES OF BALANCE: ACQUISITION, ANALYSIS, AND ARCHIVING The goal of a scientific mission is to obtain the greatest scientific yield for a fixed amount of resources. The tasks that must be supported within a fixed budget are the following: Pre-mission science and technology definition; Mission development, flight, and operations; and Analysis of observations, modeling, archive, and education and public outreach. Optimizing the scientific return from a mission necessarily involves optimizing the relative investment in these three broad categories of mission activities. The current distribution—about 75 percent is spent on category B and 25 percent is spent on categories A and C together (see Table 4.1)—yields much good science. However, in order to optimize the science per dollar, the relative fraction of funds spent in each category will necessarily depend on the mission. As noted earlier, even after the fractions are fixed, cost overruns during mission development may threaten the investment in data analysis. It is critically important that trade-offs among capabilities that are inevitable in missions and programs with a fixed budget result in a balanced investment in hardware and software that optimizes the overall scientific yield from the mission.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data TABLE 4.1 Funding for Mission Development, Operations, and Data Analysis Activity OSS Budget ($M) ESE Budget ($M) FY 1995 FY 2000 FY 2005a FY 1995 FY 2000 FY 2005b Development 1,411 967 1,425 737 643 451 Operations 67 79 384 71 48 251 Research 141 250 320 269 371 439 Data Analysis 214 291 513 EOSDIS 221 279 69 Othera 199 608 1,173 46 102 69 Total 2,032 2,195 3,815 1,344 1,443 1,279 A+C (percentage)c 17 25 22 20 26 34 B (percentage)c 83 75 78 80 74 66 a   The Deep Space Network is scheduled to be transferred from the Office of Space Flight, greatly increasing the operations budget of the Office of Space Sciences. b   EOSDIS will be split between operations and development in FY 2003, and the ground network activity will be transferred from the Office of Space Flight into mission operations, greatly changing the operations and EOSDIS budgets. c   A=pre-mission science and technology definition; B=mission development, flight, and operations; C=analysis of observations, modeling, archive, and education and public outreach. SOURCE: Joseph Bredekamp, Senior Science Program Executive/Information Systems, Office of Space Sciences, and Martha Maiden, Code YF Data Network Manager, Earth Science Enterprise. Data Analysis Funding The adequacy of data analysis funding for space missions has long been a concern of the scientific community.4 These concerns are summarized below. 1. Data analysis funding decreased throughout the 1990s. A 1998 NRC report on NASA’s research and data analysis programs found that the fraction of the total science-related budget that was allocated to research and data analysis fell by at least 30 percent over the period 1991 to 1998.5 In response, the Office of Space Science (OSS) planned to “reallocate current budgets and to seek funds for new projects that will provide selected increases in data analysis funding at an overall rate of 8% per year.”6 Budget numbers provided to the task group showed that funding for space science data analysis has increased from about $140 million in FY 1999 to over $190 million in FY 2002.7 Moreover, projections to FY 2005 suggest that the trend of declining data analysis funding has been reversed in more recent years. The highly aggregated data in Table 4.1 show that OSS data analysis funding is projected to increase between FY 1995 and FY 2005, both in absolute and percentage terms. Data analysis funds grew from about 10 percent of the total budget in FY 1995 to about 13 percent in FY 2000 and increased from being equivalent to about 15 percent the size of mission 4   For example, see National Research Council, 1982, Data Management and Computation. Volume 1: Issues and Recommendations, National Academy Press, Washington, D.C., 167 pp. 5   National Research Council, 1998, Supporting Research and Data Analysis in NASA’s Science Programs: Engines for Innovation and Synthesis, National Academy Press, Washington, D.C., pp. 51. 6   Interim Assessment of Research and Data Analysis in NASA’s Office of Space Science, letter to Edward Weiler, Associate Administrator for Space Science, National Research Council, September 22, 2000. 7   Briefing to the task group by Gunter Reigler, director, Research Division, Office of Space Science, January 31, 2001.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data development budgets in FY 1995 to about 30 percent as large as development budgets in FY 2000. The fractional size of the data analysis budget remains about the same in projections to FY 2005 of the growth in the total OSS funding. The trends in the earth sciences are not as clear, because research and data analysis are not separated in the ESE budget. However, the fraction of the budget devoted to research appears to be growing (Table 4.1). The difficulty in interpreting these budget numbers underscores an important conclusion of a recent NRC report: “The fragmented budget structure for R&DA [research and data analysis] makes it difficult for the scientific community to understand the content of the program and for NASA to explain the content to federal budget decision makers”.8 The OSS developed plans in 2000 to establish a uniform procedure for collecting data,9 and the task group encourages them to continue this process. However, it is too soon to evaluate the results of the efforts so far. 2. Even if an adequate level of data analysis funding is planned, the funds are not always preserved to the end of the mission. The generally tighter mission budgets of the past few years, coupled with the fact that data analysis typically comes at the end of a mission when project funds are near exhaustion, make it difficult to preserve funding for data analysis.10 The loss of data analysis funding can have a greater impact on small missions than on large missions. According to a recent RAND Corporation report, which analyzed a set of small science missions, the resources devoted to scientific analysis averaged only 1.6 percent of the total mission cost.11 Given that targets for data analysis are generally an order of magnitude higher, it is unlikely that this level of funding achieved the maximum scientific return. 3. If extensions in data analysis are required, funding must be obtained from the science programs, which are already oversubscribed. The period of data analysis often has to be lengthened because (1) software delays or unforeseen calibration problems prevent timely delivery of data to the user community; (2) unanticipated discoveries lead to new lines of research equal in importance to the initial goals; or (3) the mission is extended, sometimes for many years, because it continues to collect high-quality data at a small incremental cost or because longer-term monitoring proves important. Some funding to analyze data in this lengthened collection period is available through competitive grants from the science program offices. Two factors provide some guidance as to the adequacy of funding for these areas: (1) the fraction of proposals submitted that can be funded and (2) the quality of the rejected proposals. The task group’s experience is that a 3:1 oversubscription rate is about optimum.12 If the oversubscription rate is significantly higher, many excellent proposals are rejected; if the oversubscription rate is much lower, choice is 8   National Research Council, 1998, Supporting Research and Data Analysis in NASA’s Science Programs: Engines for Innovation and Synthesis, National Academy Press, Washington, D.C., pp. 67–68. 9   Interim Assessment of Research and Data Analysis in NASA’s Office of Space Science, letter to Edward Weiler, Associate Administrator for Space Science, National Research Council, September 22, 2000. 10   National Research Council, 2000, Assessment of Mission Size Trade-offs for NASA’s Earth and Space Science Missions, National Academy Press, Washington, D.C., 91 pp. 11   L.Sarsfield, 1998, The Cosmos on a Shoestring, RAND, p. 105. 12   The oversubscription rate of OSS observing and data analysis proposals is 2:1 to 6:1. Presentation to the task group by G.Reigler, director of the Research Division, Office of Space Science, January 31, 2001.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data limited. A more quantitative measure could be provided by the proposal-review panels. As part of their review, a panel could identify the division point between programs that are likely to yield excellent science and those that will lead only to modest gains. If the number of excellent proposals substantially exceeded the number that could be funded, an increase in funding would be warranted. The senior reviews conducted in astrophysics, planetary science, and the Sun-Earth connection programs provide another mechanism for identifying where additional investments in spacecraft operations and data analysis after the prime mission phase are likely to yield important scientific returns. The senior reviews take a systems approach to evaluations. In the case of a recent senior review of the Sun-Earth Mission Operations and Data Analysis program, factors taken into account included (1) the health and status of each spacecraft and pay load, (2) the scientific strengths of proposed programs, (3) the relevancy to other NASA missions, (4) the accessibility of scientific data products to principal investigator teams and outside investigators, and (5) the record for education and public outreach. 4. The task group could not identify a systematic procedure for determining the balance of funding between the flight programs and the associated research and data analysis, especially across science programs. In a recent assessment of NASA’s research and data analysis programs, a 1998 NRC report recommended that each science office do the following: Regularly evaluate the balance between the funding allocations for flight programs and the research and data analysis required to support them; Regularly evaluate the balance among various subelements of the R&DA program; and Use broadly based, independent scientific peer review panels to define suitable metrics and review the agency’s internal evaluations of balance.13 In response, the OSS instituted a regular process of senior reviews of the research grants program. Senior reviews provide a mechanism for evaluating programs within a given discipline and within a fixed budget. They also provide a mechanism for terminating programs. Many astrophysics missions, for example, are long-lived, and the costs of operations and data analysis are substantial. Indefinite operation of all functioning satellites cannot be accommodated within available budgets. A mechanism already in place for considering issues of balance early in the development of individual missions is the non-advocate review, which takes place before a mission is funded and evaluates all aspects of the mission life cycle (Appendix B). As noted above, the CSIO(s) may have a role in shielding data analysis budgets from overruns that occur in missions after the non-advocate review is completed. While both senior reviews and nonadvocate reviews play important roles within NASA, neither is designed to address issues related to the overall budget or issues of balance across disciplines, or between new missions, extended missions, and postmission data management. Senior reviews evaluate only missions that are underway and delivering data. The non-advocate reviews address only a single mission and do not provide program-wide direction. Whatever process NASA chooses for addressing these balance issues, it should be one that (1) is open and engages the research community, (2) is 13   National Research Council, 1998, Supporting Research and Data Analysis in NASA’s Science Programs: Engines for Innovation and Synthesis, National Academy Press, Washington, D.C., pp. 3–4.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data carried out on a regular and systematic basis, and (3) is conducted early enough in the planning cycle to effectively influence mission and program priorities. Timeliness of Data Analysis When a new mission begins collecting useful data, it is essential that these data be analyzed quickly to discover errors or scientific results needed for follow-on missions. Usually these data are unique and lead to important insights that require rapid follow-up, especially in the case of short-lived missions. To accomplish this, a software system must be in place at launch that is reasonably mature and can provide high-quality, calibrated data products. This goal in turn requires that adequate resources be devoted to the development of the data system, beginning at very early stages in the project. Unforeseen problems often arise after launch (e.g., calibration changes), but such issues can be addressed more quickly if the basic data-processing package has already been developed and tested. The timely development of software is so critical that it should be properly funded even if it leads to a reduction in capabilities of the flight hardware. Budgets for mission operations and data analysis are usually separated from those for mission development. This makes it difficult to make trade-offs that will optimize the overall knowledge return. However, as suggested by recent program solicitations in both the earth and space sciences, this situation may be changing. For example, proposals submitted to the OSS Explorer program and the ESE Earth System Science Pathfinder program must encompass all mission phases, including concept study, definition and preliminary design, detailed design, development, mission operations, data analysis, data publication, and delivery of data and metadata to an appropriate archive.14 The task group encourages NASA to adopt this approach for all its earth and space science programs. Recommendation. Budgets for mission operations and data analysis should be included as an integral part of mission and/or program funding. Reviews, including NASA’s nonadvocate review, which is required to authorize project funding, should include assessment of the data analysis elements, including archiving and timely provision of data to users. While reviews of some projects already follow this recommendation, its implementation is not uniform across all NASA programs. The appropriate balance between hardware and software investment is best determined jointly by NASA managers and the user communities involved in the mission. FEDERATED DATABASES The ongoing revolution in data collection, storage, and analysis of large data sets will challenge scientists by presenting new opportunities to combine the results from different types of measurements to analyze complex problems from a systems point of view. Disciplines ranging from earth science to astrophysics are actively exploring techniques for providing (1) fast access to geographically distributed data sets through standard, easy-to-use interfaces; (2) seamless interoperability of large data archives; and (3) a usable base of information for scientific 14   For example, see Explorer Program Medium-class Explorers (MIDEX) and Missions of Opportunity, <http://research.hq.nasa.gov/code_s/nra/current/AO-01-OSS-03/index.html>; Earth System Science Pathfinder (ESSP) Missions, Announcement of Opportunity, <http://essp.gsfc.nasa.gov/opportunity.html>.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data explorations. Federations of data systems are a possible mechanism for addressing these requirements. Federated data systems are most likely to succeed if they are guided by the principles that have proven sound in the context of federated corporations: Power should be placed at the lowest possible point in the structure; Interdependence distributes power and avoids the risks of a central bureaucracy; An effective federation needs a common language and laws, and a uniform way of doing business; and Participants in a federation must recognize their dual citizenship—members in the overall federated structure, but with substantial local autonomy.15 Governance—the mechanisms by which the participants share in the design, implementation, management, and operation of the federation—is the most important function for the organization’s future.16 Some federated data systems already exist: for example, the Planetary Data System and the Earth Science Information Partners (ESIPs). Other ambitious efforts, such as the NVO, have received initial funding. And others, such as SEEDS, are in the planning stages. The National Virtual Observatory Astrophysics is entering the era of “precision cosmology.” Over the next decade or two, astronomers expect to be able to characterize the size and evolution of the primordial fluctuations that formed the seeds of the structure in the universe, observe galaxies in the earliest stages of formation and test models of how they formed, determine the nature and distribution of both baryonic and dark matter, and characterize the dark energy as a function of the age of the universe. Achieving these objectives will require the collection and integration of petabytes of data from space and ground surveys, each measuring different variables and observing different regions. The NVO, one of the highest priorities of the recent astronomy and astrophysics survey,17 will develop mechanisms to federate collections of data and information for an entire scientific discipline. The leadership of the astrophysics community in developing new techniques for data mining has been recognized by the National Science Foundation (NSF), which recently funded ($10 million over five years) a broad-based effort to create a framework for the NVO. Participants in the proposal included ground- and space-based data centers and key players in the university community. Both the astronomical and computer science communities played an active role in devising the implementation plan. The NVO is predicated on seamless access to ground- and space-based data, and key next steps for NASA are the following: (1) to coordinate closely with the NSF-funded effort to develop the framework for the NVO, (2) at each data node, to identify costs of making extant space-based data compliant, and (3) to develop and invest in a long-term 15   C.Handy, 1992, Balancing corporate power: A new federalist paper, Harvard Business Review, November-December, pp. 59–72. 16   National Research Council, 1998, Toward an Earth Science Enterprise Federation: Results from a Workshop, National Academy Press, Washington, D.C., 51 pp. 17   National Research Council, 2001, Astronomy and Astrophysics in the New Millennium, National Academy Press, Washington, D.C., 276 pp.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data strategic plan that builds on the NSF framework effort and the existing investment in space-based data centers to meet the scientific requirements of the space science community. An important element in NVO planning is the emphasis on developing bottom-up frameworks and toolkits to provide integrated services on whatever scale is appropriate to user needs and the scientific questions being asked. The NVO is not an effort to integrate all astronomical services via top-down control. The intention is to build the NVO as a science-driven, community effort with most of the funding distributed through peer review. The NVO plans to accomplish the following: Establish a common systems approach to data pipelining, archiving and retrieval that will ensure easy access by a large and diverse community of users and that will minimize costs and times to completion. Enable the distributed development of a suite of commonly usable new software tools for querying, correlation, visualization, and statistical comparisons. Utilize high speed networks that will provide the connectivity among active archives and terascale computing facilities.18 Each institution participating in the NVO will maintain control over its individual data sets but will conform to metadata standards and protocols that are extensible far into the future. With properly designed interfaces, it will be possible for anyone in the community to add analysis tools and data facilities. Interoperability of such a distributed system will require a core management group that maintains standards and tight communications while supporting distributed research and development. Although these goals are challenging, astronomers have an established track record of operating in this manner. The NVO is intended to be evolutionary so that it can respond to changing opportunities and to developments in both hardware and software. Fortunately, processing, storage, and networking are continuing to improve at an exponential rate, so it is likely that the hardware will keep pace with the growing volumes of astronomical data. Bandwidth remains a limiting factor, and so for the foreseeable future it is likely that the computation capabilities need to be close to the data so that large data sets do not have to be moved. The NVO will also take advantage of the development of grid technology, which is being widely embraced by many fields, including medical technology, earth sciences, high-energy physics, and astronomy. In fact, the inclusion of current grid technology in astronomy in the United States is being accomplished in large measure through the NVO. Grid technology allows not only access to remote data facilities but also “single sign on” remote access to computing and analysis facilities.19 The NVO has established intimate links to the high-energy physics grid program (GriPhyN) and to the information-technology community that is responsible for the development of GriPhyN. One of the principal architects of the high-energy physics grid is also leading the development of the grid architecture for the NVO. The close cooperation between the GriPhyN and NVO projects will ensure that the astronomy community, through the NVO, will have access 18   R.J.Brunner, S.G.Djorgovski, and A.S.Szalay, eds., 2001, Virtual Observatories of the Future, Astronomical Society of the Pacific Conference Series, San Francisco, California, p. 357. 19   For more information on grid technologies and collaborations, see the Particle Physics Data Grid at <http://www.ppdg.net/> or the Global Grid Forum at <http://www.globalgridforum.org/>.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data to current grid-based facilities that are also compatible with the grid networks being established in other fields. The astrophysics senior review held in June 2000 stressed the importance of providing a coherent data system and recommended that NASA continue to examine what services such a system might realistically be expected to provide, how it might be maintained at the cutting edge of available computational and communications technologies, and what the appropriate trade-offs are between the costs of providing these services and investments in new missions. While the astrophysics community is providing pioneering leadership in this field, other NASA-supported disciplines are beginning to explore ways of providing similar capabilities. For example, plans have also been developed for a prototyping study for a virtual solar observatory, modeled after the virtual observatory for astrophysics, and a recent senior review of the Sun-Earth Connection program recommended funding for the initiation of the virtual solar observatory.20 These plans are consistent with an earlier recommendation made by an NRC task group on ground-based solar research.21 NSF and NASA should collaborate on the development of a distributed data system with access through the World Wide Web. Such access requires easily searchable catalogs, user-friendly access software, and the capacity to handle large volumes of data. A number of organizations, both in the United States and abroad, have taken the initiative to preserve and provide data sets online. Acknowledging the importance of providing data to the community, the task group encourages the cooperation of observatories and institutions, especially NSF and NASA, in efforts to archive and ensure access to their data. In fact, the task group believes that provisions for data archiving and distribution should be an integral part of planning for future observing facilities. In developing these plans, the space science community should take into account the lessons learned from similar endeavors in the earth sciences, such as the ESIP Federation. Planetary Data System The Planetary Data System (PDS) has already taken initial steps to achieve the same goals as the NVO by combining geographically distributed active archives, which store data under the supervision of scientific experts, with centralized project management and system engineering. A PDS management council, which includes the nodes managers as well as the overall project manager and system engineer, makes the major decisions. Nomenclature and data formats have been standardized for all data sets, and all archived data are peer reviewed. While PDS provides its users with high-level catalogs for searching for data, this capability is not yet integrated seamlessly across all nodes. In the future, the PDS plans to develop this capability, store more data online, and increase automation of the validation and ingest processes so that mission data can be archived more quickly and made available sooner. A system is currently being implemented for online distribution of all Mars data, beginning in October 2002 with data from the 2001 Mars Odyssey mission.22 20   National Aeronautics and Space Administration, 2001, Final Report of the Senior Review of the Sun-Earth Connection Mission Operations and Data Analysis Programs, 27 pp. 21   National Research Council, 1999, Ground-Based Solar Research: An Assessment and Strategy for the Future, National Academy Press, Washington, D.C., 47 pp. +11 appendixes. 22   Elaine Dobinson, PDS manager, Jet Propulsion Laboratory, personal communication, March 2002.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data Federation of Earth Science Information Partners The ESIP Federation was created in response to difficulties of the EOSDIS system in responding to rapidly evolving technologies that, among other things, could improve both access and usefulness of ESE data, particularly for non-EOS communities.23 There are four types of ESIPs, each serving a distinct user community. Type 1 ESIPs (the DAACs and NOAA’s National Climatic Data Center) are responsible for standard data and information products. Type 2 ESIPs produce innovative science information products and services, primarily for the global change and earth science communities. Type 3 ESIPs (applications data centers) provide data products specialized for practical applications by nontraditional user communities, including teachers, students, policy analysts, and for-profit businesses. The type 2 and 3 ESIPs were chosen through a competitive proposal process. Finally, type 4 ESIPs are sponsoring agencies (currently only NASA) of the federation. The ESIPs are an experiment in creating and governing a federated system of heterogeneous units, driven by competition, with each unit relatively small, manageable, and able to respond to changing scientific and technical opportunities. The ESIPs developed their own governance structure in 2000, and the federation became a not-for-profit organization in 2001.24 Ten new partners have joined since the federation was created in 1998. The objectives of the federation are (1) to increase the diversity and breadth of users and uses of earth science data, information, products and services; and (2) to explore new ways to provide data and information operability.25 The first objective is being met by providing services in a wide range of application areas, including land management, commercial fishing, precision farming, K-12 instruction, weather, ranching, urban planning, and energy (see Chapter 3). The second objective is being met by providing catalog-level searching, distributed data exchange, and data discovery and access prototypes. More than 65 new information services are being provided, either by individual ESIPs or by self-organized clusters of ESIPs.26 However, interoperability at the data level has not yet been achieved. Clearly, the federation has achieved many positive things, but a formal evaluation of its success has yet to be done. Federation concepts are also being incorporated into plans for SEEDS. One of the objectives of the SEEDS program is to “establish a unifying framework of standards, core interfaces and levels of service to facilitate access to data and information as provided by a distributed, heterogeneous network of data systems and service providers.”27 An ESIP cluster is assisting with this issue, as well as providing metrics about design and performance, and is looking for ways to leverage current capabilities and expertise in existing data systems.28 SEEDS is still being formulated, so the task group cannot comment on the adequacy of the planning to meet this objective. However, it can comment on the importance of the objective itself. The task group believes that creation of a federated, distributed system of active archives should indeed be a key 23   National Research Council, 1998, Toward an Earth Science Enterprise Federation: Results from a Workshop, National Academy Press, Washington, D.C., 51 pp. 24   Bruce Caron, president of the ESIP Federation, personal communication, February 2002. 25   Briefing to the task group by John Townshend, past president of the ESIP Federation, University of Maryland, July 30, 2001. 26   There are currently 11 clusters, each of which includes 4 to 13 ESIPs, working on cross-cutting issues. See <http://www.esipfed.org/business/clusters/index.html>. 27   Briefing to the task group by Steven Wharton, NewDISS program formulation manager, July 30, 2001. 28   See <http://www.esipfed.org/business/clusters/newdiss/index.html >.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data component of the SEEDS program and that much can be learned from the approaches being prototyped and evaluated by the space science community and by the ESIP Federation. Recommendation. NASA should encourage efforts by the scientific community to develop plans for federations of data centers and services that would enable complex querying, mining, and merging of data from different instruments and missions in order to answer complex, large-scale scientific questions. The National Virtual Observatory, an astrophysics project funded recently by the National Science Foundation (NSF), will develop the architecture, standards, and so forth for creating a distributed system of data centers that can be cross-accessed and queried in a transparent manner by users. NASA should coordinate with the NSF-funded work on the NVO, which is predicated on seamless joint access to ground- and space-based data, to ensure that space data are compliant with NVO standards. NASA should encourage close communications among the groups operating or developing federated systems in order to transfer best practices among its various scientific programs. The successful implementation of methods for making complex queries of multiple databases is likely to be technically challenging and costly. The level of appropriate investment by NASA in federated data systems should be evaluated at regular intervals and should be based on (1) the importance of the scientific questions that can be addressed through the simultaneous mining of multiple databases, (2) demonstrated scientific return from past investments, and (3) the readiness of computational and communications technology to support data mining. ELEMENTS OF EFFECTIVE DATA MANAGEMENT In examining the various approaches to archiving and dissemination, the task group has identified a number of elements of the overall data management system that have proven to be important in meeting the expectations of the scientific community. These elements are listed below and should be included in planning for future missions and facilities: Archives and data centers should have (1) scientists on staff with a strong background in the scientific discipline being supported, and (2) scientific working groups to help set priorities for acquiring, managing, and discarding data. Prelaunch funding should be provided for software development to ensure the timely development of pipelines for processing newly acquired data. Multiyear funding should be provided for research, including research using archived data, on the basis of quality of the proposals received. A recent senior review of extended planetary missions, for example, noted the success of the archival research programs maintained in astrophysics and suggested that these programs might profitably be emulated by the Planetary Data System. Guest investigator programs should be established to allow the community to conduct research not planned by the initial project teams.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data Early and open access to data should be provided to permit follow-on proposals to take advantage of new discoveries. A mechanism should be established (such as the senior reviews in space science) for making trade-offs among operations of long-lived missions and operations of active archives and data centers in a way that reflects the scientific merit of the range of possible investments. The importance of managing data and information from NASA’s space missions will only continue to grow in coming years. Data volumes are increasing, both because of the accumulation of data from a steadily growing number of space missions and because improvements in technology have enhanced the data rates from individual missions. Maintaining the data in forms that are readily accessible and that meet the needs of very diverse user communities presents intellectual challenges that are at least the equal of the challenges of building and launching hardware into space. NASA can become a leader in developing the techniques and tools for querying and mining large nonproprietary data sets. Playing this leadership role will require a new emphasis on software management; rigorous review of the balance between investments in software and hardware to optimize the science return from both individual missions and suites of missions; and development of new techniques for exploring and intercomparing data contained in a distributed system of active archives, data centers, and data services located both in the United States and abroad.

OCR for page 62
Assessment of the Usefulness and Availability of NASA’s Earth and Space Science Mission Data