2
Overview of the DAAC System

This chapter synthesizes overarching issues that were noted, with varying perspectives, in the individual DAAC reviews. The issues are related to (1) the DAACs as a system, (2) the need for flexible approaches for implementing EOSDIS, (3) the relationship between the DAACs and their users, (4) life-cycle data management, and (5) the role of NASA.

WHAT IS A DAAC? WHY A SYSTEM?

DAAC Versus Data Center

As noted in Chapter 1, the DAACs were created to be different from data centers. Data centers are permanent facilities—their primary focus is on long-term distribution, maintenance, and archive of data and data products. On the other hand, DAACs are meant to exist for only about 15 years and to be involved in the initial, active stages of a satellite program, when the most intense scientific activity is occurring. (The DAACs also have heritage data sets, sometimes going back decades, from preexisting data operations.) Key DAAC tasks include (1) supporting the operational ingest and management of a suite of spaceborne sensors operated as part of the Earth Science Enterprise, (2) producing data products from remotely sensed and complementary in situ data sets as required, and (3) reprocessing data in response to improvements in the algorithms or to correct errors detected in the processing. Providing user services and access to the data is important to both DAACs and data centers. Consequently, the DAACs must operate according to sound principles of data center management (see "Life-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 27
Review of NASA'S Distributed Active Archive Centers 2 Overview of the DAAC System This chapter synthesizes overarching issues that were noted, with varying perspectives, in the individual DAAC reviews. The issues are related to (1) the DAACs as a system, (2) the need for flexible approaches for implementing EOSDIS, (3) the relationship between the DAACs and their users, (4) life-cycle data management, and (5) the role of NASA. WHAT IS A DAAC? WHY A SYSTEM? DAAC Versus Data Center As noted in Chapter 1, the DAACs were created to be different from data centers. Data centers are permanent facilities—their primary focus is on long-term distribution, maintenance, and archive of data and data products. On the other hand, DAACs are meant to exist for only about 15 years and to be involved in the initial, active stages of a satellite program, when the most intense scientific activity is occurring. (The DAACs also have heritage data sets, sometimes going back decades, from preexisting data operations.) Key DAAC tasks include (1) supporting the operational ingest and management of a suite of spaceborne sensors operated as part of the Earth Science Enterprise, (2) producing data products from remotely sensed and complementary in situ data sets as required, and (3) reprocessing data in response to improvements in the algorithms or to correct errors detected in the processing. Providing user services and access to the data is important to both DAACs and data centers. Consequently, the DAACs must operate according to sound principles of data center management (see "Life-

OCR for page 27
Review of NASA'S Distributed Active Archive Centers Cycle Data Management," below). To fulfill their DAAC mission, however, they must also be responsive to scientific needs in a dynamic environment. The ORNL DAAC neither manages satellite data nor works with EOS science and instrument teams, even though its in situ data sets are critical for calibration or validation purposes. Thus, by the definition given above, the ORNL DAAC operates more like a data center than a DAAC. Similarly, if the backup plans of the science teams are adopted, few DAACs will ingest or process EOS data. Those that do not will no longer be DAACs in the sense originally envisioned by NASA. The DAAC System The DAACs have a dual role within the Earth Science Enterprise. Not only do they operate as discipline centers that serve the needs of a relatively small, specialized constituency, they cooperate as elements of a larger system, which serves the broader earth science community. The former must be a primary role—otherwise the DAACs cannot operate as effective data centers. The latter is an additional responsibility of the EOSDIS DAACs. Fulfilling the second of these roles is difficult because the DAACs are profoundly different from one another. Comparison of Chapters 3 through 9 indicates the following differences among the DAACs: Their core constituencies are different. The discipline focus is different for each DAAC (Table 1.2), but even DAACs with overlapping disciplinary interests serve a distinctive set of users. For example, the ASF DAAC serves sea-ice scientists interested in synthetic aperture radar data, whereas the NSIDC DAAC serves the broader polar science community. Similarly, the EDC and ORNL DAACs serve terrestrial ecologists, but the EDC DAAC focuses on users of remote sensing imagery, and the ORNL DAAC focuses on users of in situ data from field campaigns and process studies. Different disciplines place different demands on the information system. For example, the cryospheric studies facilitated by the NSIDC DAAC require polar projections, and the field-based data of the ORNL DAAC require a broader metadata model than would be developed for remote sensing data alone. In addition, the standard EOSDIS data format, HDF-EOS, is poorly suited for ASF and ORNL DAAC holdings, and the three HDF data structures supported by the ECS—point, swath, and grid—do not apply to all LaRC DAAC data. They are hosted by a diverse array of institutions. The GSFC, LaRC, and EDC DAACs are housed in government-operated facilities, the ORNL DAAC and the PO.DAAC are located in facilities managed by private institutions, and the ASF and NSIDC DAACs are housed in universities (Table 1.2). They vary in size. In terms of the size of the budget and the number of staff, GSFC is among the largest and ORNL is the smallest of the DAACs

OCR for page 27
Review of NASA'S Distributed Active Archive Centers (Table 1.3). The ASF DAAC has the largest volume of holdings, and the ORNL DAAC has the largest number of data sets. The GSFC DAAC and the PO.DAAC have the largest user base. They have different readiness requirements. The GSFC, LaRC, and EDC DAACs must focus on preparing for the massive data flows from near-term missions such as the AM-1 platform. On the other hand, the NSIDC DAAC and the PO.DAAC will primarily manage data from later missions and can continue to focus on refining their service to existing users. Preparation time for the ASF DAAC, which has been receiving large volumes of data for several years, has already passed. Each has developed unique systems for managing data. For example, the GSFC DAAC developed the Archer file management system. In addition, the GSFC and LaRC DAACs developed information systems to handle TRMM data (TRMM support system and the Langley TRMM Information System, respectively). Comparable variability exists in the computing, storage, and communications hardware. Manipulating the data requires varying levels of technological sophistication from users. GSFC, ASF, and EDC DAAC users typically deal with very large data sets and require substantial computer resources to work with and analyze the data. They also require high-capacity media distribution. In contrast, ORNL DAAC users deal with many small ASCII tables, which are easily transmitted over ordinary Web channels and manipulated using standard personal computer software. In addition, the ASF DAAC differs from the other DAACs in three important ways. First, it is managed by ESDIS separately from the other DAACs. Second, its processing, distribution, and archive systems are being developed under a contract to the Jet Propulsion Laboratory rather than by the ECS contractor. This is a consequence of the ASF DAAC becoming operational far in advance of the other DAACs. Third, it manages data collected exclusively from foreign spacecraft. The space agencies that operate the satellites have placed severe restrictions on the amount and geographic coverage of data that U.S. researchers can obtain at affordable prices. All other EOSDIS data are available at no more than the cost of filling a user request. EOSDIS, however, is meant to be more than a collection of discipline centers. To fulfill their mission to serve the broader earth system science community, the DAACs must adopt a mind-set that they are also components of a coherent (but not necessarily uniform) system that (1) enables users to locate, access, and use various types of data with valuable scientific content, using a common set of tools, whatever the data type; and (2) stimulates collaborative, multidisciplinary research as a means for understanding earth system science processes. Although issues in interpreting and blending multisensor time-varying data sets at different resolutions and sampling strategies are unavoidable, comparable access tech-

OCR for page 27
Review of NASA'S Distributed Active Archive Centers nologies, a consistent terminology, and unobtrusive authorization procedures are key aspects of an information environment that fosters rather than hinders such integrative inquiry. At present, this outcome could be achieved by a coherent system of DAACs, although this management model may be superseded in the future by a federation of partners (see "The Need for Adaptability," below). By working together, the DAACs can take advantage of tools and technologies developed at other centers, rather than creating everything in-house. They also enjoy the benefits of collective bargaining with NASA management. Finally, as recent history has shown in the case of highly visible events (e.g., the Mars Pathfinder mission), the Web environment is particularly prone to extraordinary surges in user demand for certain types of data. This is another reason why the DAACs might profitably form an alliance to foster growth of their collective user base and to ensure a reliable level of service through the crises that are sure to arise. None of the panels detected significant coordination among the DAACs. A noteworthy exception is the User Services Working Group, which identifies and analyzes problems collectively and devises innovative solutions on a systemwide basis. The experience gained in this process, as well as past experience in developing Version 0 of EOSDIS, should provide a practical basis for creating a working system. However, the profound differences between the DAACs have made their integration into a seamless distributed system a daunting task, whose completion has largely eluded them so far. Moreover, the need to develop DAAC-specific systems for handling the AM-1 data streams is driving the DAACs further apart (see "The Need for Adaptability," below). Finally, a less tangible, but perhaps more difficult, barrier to overcome is the apparent reluctance of many of the DAACs to become a part of EOSDIS. It is clear from the panel reports that the ECS contractor failed to take the DAACs' views into account when designing the system. Thus, the DAACs do not have a sense of ownership in the ECS. Moreover, efforts to coordinate DAAC activities in the past seem to have been motivated more by strong leadership at ESDIS and NASA Headquarters than by self-interest or a belief in the goals of the system. Many of the DAACs feel that the long-range vision of Dixon Butler, former operations director of the Data and Information Systems Division of Mission to Planet Earth, and Gregory Hunolt, former DAAC system manager, is not shared by current managers. For the system to work, there needs to be a reaffirmation of EOSDIS goals at all levels—NASA management, the DAACs, and the scientific community. If a coherent system cannot be achieved, users will have to learn to deal with a disparate system of DAACs. In doing so, users may benefit from the results of NASA's prototype federation of Earth Science Information Partners (ESIPs). Indeed, some of the ESIPs were funded to find ways to link disparate data repositories and management centers together, and they are reportedly making progress. Through requests for proposals, the federation mechanism may promote innova-

OCR for page 27
Review of NASA'S Distributed Active Archive Centers tive and effective ways to help users access data from a wide range of sources (including the DAACs) easily and efficiently. Redundancy in the DAAC System It is probably intrinsic to the nature of distributed systems that they should suffer from some degree of redundancy. However, reckless tracking and elimination of redundancies may lead to a monolithic architecture that rapidly loses the ability to evolve. It is clear from the following chapters that each DAAC occupies an important, unique niche in the Earth Science Enterprise. None of the panels raise concerns about functional overlaps between DAACs or about other possible redundancies. In fact, one is hard pressed to find a clear-cut case in which a DAAC's function could easily be eliminated without incurring a worrisome loss to the science. Even the ASF and ORNL DAAC panels, which suggest fundamental changes to DAAC operations, argue strenuously for the scientific worthiness of these DAACs. The fact that the DAACs are housed within science facilities, where the appropriate scientific and data management expertise is readily available, is a strength of EOSDIS. Mergers between DAACs can be devised that might lead to economies of scale, but they would also lead to a loss in expertise and talent. The cost of this loss in expertise, although difficult to quantify, should not be underestimated. This argument holds true particularly in the case of DAACs that have achieved a high level of symbiosis with their user communities. The PO.DAAC is a case in point. JPL's efforts to outsource the operation, as a money-saving measure, would destroy the highly productive team spirit that has been carefully nurtured between the DAAC and its collocated science users. The PO.DAAC panel's central recommendation, to leave the DAAC embedded within the physical oceanography group at JPL, is not addressed to the DAAC, but to JPL and ESDIS. Conclusions and Recommendations The committee concludes that there is no obvious redundancy in the roles and responsibilities of the DAACs and that each DAAC has a critical role to play in the overall endeavor. However, the DAACs do not as yet function as a system, although they coordinate operations on a limited scale. The creation of a coherent, seamless system in the sense that was envisaged originally will require (1) a reaffirmation of the scientific advantages of functioning as a system, (2) a collective effort to counter the pressures driving the DAACs apart, and (3) more serious efforts to collaborate. Otherwise, there is real danger of losing coherence, no matter how many teleconferences and meetings are held by the managers. If a coherent system of DAACs cannot be achieved, it is incumbent on NASA to provide the necessary resources (e.g., through requests for proposals to partici-

OCR for page 27
Review of NASA'S Distributed Active Archive Centers pate in NASA's federation) so that a set of disparate data centers can serve the multidisciplinary users as effectively as an ideal DAAC system. Recommendation 1. The DAACs do not yet act as components of a coherent system. They share the responsibility for providing the vision and leadership toward this goal with the science teams, ESDIS, and NASA. If such a coherent system cannot be achieved, NASA should place a greater emphasis on user-generated proposals seeking to help the community deal with a disparate DAAC system. Recommendation 2. A DAAC alliance with a common goal will better serve the broader community than the collection of individual centers that currently exists. The DAACs should support each other and express a collective point of view on EOSDIS policies. THE NEED FOR ADAPTABILITY A Changing Paradigm EOSDIS is evolving rapidly. The changes are being driven by (1) delays in the ECS, (2) a rapidly changing network environment, and (3) new management approaches to EOSDIS. In the current paradigm that governs the DAAC system, the ECS is supposed to provide the glue for the system—the layer of uniformity that presents a seamless appearance to the users. Uncertainties concerning the ECS—in terms of both performance and delivery schedule—have led to irresistible pressure to turn to local, DAAC-specific solutions, which are not always transparent to the users. Such solutions are not necessarily bad—the development of DAAC-specific information systems was necessary for managing data from the TRMM mission—but they do pose a challenge to creating a coherent system. A question therefore arises: Is EOSDIS a concept that has been overcome by events, so that its time has passed? The committee believes that technological advances and new management approaches offer hope for achieving the ultimate goal of an integrated DAAC system, albeit by taking a completely different route than originally envisaged by its architects. For example, the Web provides a new way to link the holdings of the centers and to make data easy to find and access. All of the DAACs are well on their way to developing Web-based interfaces for their users, although these efforts still fall short of the full complement of capabilities that the panels deem desirable. Improvements are needed in the realms of tracking users, data requests, and center performance. Under the federated EOSDIS paradigm, the diversity of the DAACs is a good thing because it allows them to satisfy the needs of equally diverse groups

OCR for page 27
Review of NASA'S Distributed Active Archive Centers of users. For the federation to work, however, the newly found flexibility must be embraced by the system architects. The panels advocated different approaches to the system architecture of the DAACs: the GSFC DAAC should continue working with the ECS; the LaRC DAAC should work to make its systems compatible with the ECS; and the PO.DAAC should adopt only the ECS components needed to make the system work. The committee concurs with the concept of multiple architecture strategies, noting however that their implementation would make a seamless system more difficult to achieve. Nevertheless, by taking advantage of Web technologies and making a concerted effort to act as components of a system, it should be possible for the DAACs to realize this goal. Conclusions and Recommendations The CGED agrees with previous NRC committees (e.g., NRC, 1998a) that the new federated paradigm should meet the stated EOSDIS goals, provided the architecture is implemented in a flexible way. Indeed, with the adoption of DAAC-specific approaches to managing the AM-1 data streams, the increased participation of EOS science and instrument teams in processing data, and the initiation of the prototype federation, the move to an EOSDIS federation is timely and probably inevitable. Instead of requiring the DAACs to accept and implement the ECS as a complete system, the committee believes they should be permitted to select those subsets of the ECS that allow them to function most effectively as components of an adaptable system and to flow with the rapidly changing electronic data environment. With this flexibility, however, comes a clear responsibility: the DAACs will have to earn their title and their place in the system by discharging their duties as system components. Otherwise, they risk turning into relatively trivial custodians of the data sets in their charge or being replaced. Recommendation 3. To take advantage of the unprecedented flexibility afforded by the new Web-based technologies, ESDIS should allow the DAACs to incorporate only those components of the ECS that they require to satisfy their user community. This flexibility should not come at the cost of reducing the DAACs' ability to function as full-fledged members of the DAAC system. For the DAACs, the price of this flexibility is an increased individual responsibility to contribute to the overall goals of EOSDIS.

OCR for page 27
Review of NASA'S Distributed Active Archive Centers DAACs AND THEIR USERS The DAAC system is the principal element of EOSDIS through which the ESE interacts with its various constituencies. If the users are unable to obtain the data they need in useful form and in a timely manner, the ESE will fail. Therefore, the various panels and the committee devoted considerable attention to DAAC users. User Community As described in the recent NRC study on an Earth Science Enterprise federation (NRC, 1998b) the principal ESE constituencies include: data producers, including instrument teams and scientists conducting in situ studies (e.g., scientists contributing data to the ORNL DAAC); global change scientists, who use and synthesize a broad range of data from different sources, and who may also produce higher-level data products; knowledge brokers, including policy makers, teachers, students, and the interested public, who use reliable, interpreted data products or assessments; and for-profit businesses, which generate value-added data products for commercial purposes. According to the guidelines adopted by NASA Headquarters, EOSDIS will support the following constituencies, which overlap with the ESE constituencies listed above: national and international agencies and entities with whom NASA has written agreements or legal obligations concerning ESE data; NASA-funded ESE investigators (e.g., data producers, global change scientists); the broader U.S. and international earth science community (primarily global change scientists); and U.S. policy makers (i.e., knowledge brokers). If sufficient resources are available, EOSDIS will also support other knowledge brokers, including the U.S. education community, the U.S. general public, and other interested users (see Appendix C). Given that the Earth Science Enterprise is a science program, the committee agrees with these broad priorities, noting that most DAACs have a substantial outreach activity (see below). Most panels report that the DAACs assign their highest priority to data producers, global change scientists, and NASA's partner agencies. The data producers are important to the DAACs because they generate a significant fraction of the DAAC holdings. The science community, often labeled the

OCR for page 27
Review of NASA'S Distributed Active Archive Centers ''customers,'' is considered the primary user group. Failure to satisfy this user group would constitute a failure of the DAACs (although one must be careful not to attribute all shortcomings of EOSDIS to the DAACs). However, the committee feels that the best data centers (and DAACs) go beyond the minimum requirements and that failure to serve a broader community should be considered less than stellar performance. Some DAACs anticipate that for-profit businesses, although not a high-priority NASA constituency, will be a fast-growing segment of the user community. For example, the NSIDC DAAC sees a growing constituency among geotechnical engineers in permafrost areas, notably in other countries. Sea-ice products produced by the ASF DAAC are important to the shipping industry. Finally, potential users of the EDC DAAC are far more likely to be interested in commercial applications of Landsat than in scientific research. To reach the knowledge brokers, the DAACs sponsor outreach activities. Outreach is typically targeted at K-12 educators but also includes dissemination of information to the general public via the Web and a variety of media (e.g., brochures, flyers), as well as displays at conferences. In addition, it includes the dissemination of near-real-time data of general interest (whenever appropriate) via the Web and the production of data sets in a variety of popular formats. Although most DAACs are aware of the broad characteristics of their user communities (e.g., U.S. versus foreign user, scientists versus K-12 educators), few have a detailed understanding of their user profiles. Without a detailed understanding of who their users are and how they use the data, the DAAC will not be able to provide the specialized services necessary to get the most out of the data. Moreover, it will be more difficult for the DAACs to expand their user base. Consequently, most of the panels advise their respective DAACs to characterize their user community more quantitatively, to track its evolution, and to incorporate this information in performance metrics and in a strategic plan. It is important that such tools be implemented as early as possible, before the massive influx of EOS data. The problem of characterizing the user community is compounded by the fast-growing use of the Web. Tracking users who access the DAAC as casual browsers through the Internet is notoriously difficult, but this problem is faced by all providers of products and services on the Web, including commercial providers, so innovative solutions are bound to emerge. Some of these solutions will likely be applicable to the DAACs. Instituting log-in procedures is certainly possible, but such procedures place a high administrative burden on the DAAC and discourage casual browsers. Consequently, several DAACs have generally decided against log-in procedures, except for restricted data sets generated by foreign sources. In addition, national data centers have developed strategies for keeping track of the user communities they serve, updating user profiles, and soliciting user input on the usefulness of their products. Their experience would likely provide a useful guide to the DAACs.

OCR for page 27
Review of NASA'S Distributed Active Archive Centers User Survey Although not the only metric by which to measure the performance of a data center, user satisfaction is central to its long-term health, and dissatisfaction among a measurable fraction of the user base is a sure sign of a dysfunctional center. Several DAACs have conducted user surveys in recent years. Some are superbly comprehensive and informative, such as the most recent ASF DAAC survey mentioned in Chapter 6. Nevertheless, these surveys do not address the systemic aspects of the DAACs. To reduce this gap, the committee conducted its own informal survey, focused on the most important category of users, namely the scientific community. No effort was made to achieve a statistically reliable sampling. Rather, an electronic questionnaire was mailed to the entire membership (~1,000) of the EOS Investigators Working Group (IWG), and survey recipients were free to forward the questionnaire to other users. For instance, the PO.DAAC secured additional responses from a large contingent of users outside the IWG, including many educators and users from abroad. Consequently, the answers should be taken only as a qualitative assessment of opinion trends among users. The survey and responses (393 responses, including 184 from foreign users) are provided in Appendix D. The responses were divided into three categories: (1) sophisticated users, including members of the IWG and data providers; (2) casual users, including U.S. educators and individuals who used a DAAC a few times a year or less; and (3) foreign users, including scientists and graduate students. Basic patterns of interest are the following: Only about one-third of sophisticated users obtain data from a single DAAC. Most use several DAACs, sometimes as many as five. Casual and foreign users, on the other hand, tend to use a single DAAC. As illustrated in Figure 2.1, the majority of survey respondents declare themselves to be satisfied with the DAACs, and judge their performance to be above average. The only significant dissatisfaction was expressed—quite emphatically—by scientific users of the ASF DAAC and, to a lesser extent, the EDC DAAC. The major sources of strong dissatisfaction, described in sometimes long essays, had to do with (1) slow response; (2) difficulties in finding data; and (3) poor user services, particularly in tracking data requests. Significant problems with user services at these DAACs were also noted by the corresponding review panels. It is noteworthy that none of the casual or foreign users expressed anything but satisfaction with the system, sometimes in rather eloquent and glowing terms. For these users, the fact that the data are free and unrestricted far outweighs any difficulties they may have in obtaining them. A majority of survey respondents claim to access the DAACs several times a year. A sizable minority access the DAACs on a weekly or even daily basis.

OCR for page 27
Review of NASA'S Distributed Active Archive Centers FIGURE 2.1. User satisfaction. SOURCE: user survey (see Appendix D).

OCR for page 27
Review of NASA'S Distributed Active Archive Centers also in the long-term studies of changes in the global environment. Consequently, the notion of managing scientific data should not be reduced to a set of bureaucratically defined tasks. Life-cycle data management involves looking beyond immediate goals and deliverables and taking steps now in the interest of future generations of scientists and citizens to enhance their ability to make effective use of the unique and irreplaceable records that the EOS program is collecting at considerable expense. It begins with good instrument design and careful calibration, and extends through careful documentation of every step of the data processing and product generation, to reliable long-term archive. The collections must be designed to convey the scientific and operational context of the measurements and all other ancillary information, that may assist in their proper interpretation at a time when none of the individuals originally responsible are available for questions. Past experience shows that such design is difficult and is frequently neglected, particularly because the science questions being asked change unpredictably with time. However, a consequence of such neglect is that the value of the archive for future generations of scientists is greatly diminished. A life-cycle data management strategy is paramount when it comes to documenting long-term phenomena such as global environmental change. In the committee's view, data centers (and DAACs) should participate actively in all major stages in the life-cycle of a data set: Data collection. The credibility and reliability of the data and data products depend on careful attention to calibration, internal consistency, and version control. Consequently, instrument teams and supporting processing staff place a high priority on collecting and recording this information. Much more difficult is capturing information deemed immaterial or common knowledge at the time of the experiment, or recording the strengths and weaknesses revealed by later use of the data set. Yet, this information may be critical to later interpretation of the data. The DAACs can help by participating in the planning of the data collection, seeking to clarify both the information that is being captured and the inputs or parameters that are imported from other sources, and to facilitate the process of recording them. Similar involvement with the metadata during the product generation stage is equally important. The LaRC DAAC has been successful in this regard, particularly in its participation in field experiments. For most ESE missions and experiments, however, the data collection environment is specified with little or no DAAC input. Management of active data sets. Active data sets require a curator who cares about the quality and completeness of the collection and has at least a general sense of its potential scientific value. Such a curator will (1) understand the strategic data needs of the scientific constituency; (2) prepare guide information to assist users in evaluating the relevance of the data to their purposes; (3)

OCR for page 27
Review of NASA'S Distributed Active Archive Centers develop tools and services, such as subsetting capabilities and Web interfaces, to help users find and work with the data; (4) contact experts on behalf of users with complex scientific queries; and (5) reprocess data in response to scientific demands. The latter is exemplified by the PO.DAAC, which reprocesses data regularly to keep up with scientific advances. Also noteworthy are the pre-subsetted data sets prepared by the GSFC DAAC, which enable users to work with manageable amounts of data, and the tools for documenting data sets provided by the ORNL DAAC. Most of the other DAACs, however, have to place greater attention on subsetting (LaRC and EDC DAACs), reprocessing (GSFC DACC), or user services (EDC and ASF DAACs). Long-term archive. Archive of a valuable data set involves more than long-term storage. The assembly and presentation of useful information about a data set is equally important to its preservation for future generations. By planning for long-term archive, the DAACs ensure a greater likelihood that the data will remain useful beyond the period where a high volume of exchange, traffic, access, and manipulations takes place. Although the DAACs will not be responsible for EOS data collected more than 15 years past the end of the mission, their understanding of the data sets in their charge should be incorporated into the data sets before these are moved to a permanent archive. This will increase the probability that the holdings will retain scientific value. The ECS metadata model lacks flexibility and is not extensible to permit adequate content-based access. In particular, better levels of spatial queries must be supported if the system is to fulfill the simultaneous goals of providing better access to information in general and supporting content-based access and subsetting of information in particular. For this purpose, convenient and efficient geospatial access for scientific images and data sets is essential. This requires effective, generalized gazetteer services, together with the ability to visualize spatial footprints for items in the collections. It is the combination of these two sets of services, together with the adoption of standards for representing geospatial metadata, that will significantly increase most users' ability to access the data they need without extracting large volumes of unneeded data from the archive. Several DAACs reported that they have raised these issues with ESDIS (and the ECS contractor) but have not prevailed, even though the matter of a better, more flexible, and extensible metadata model is fundamental to support scientific research (see "The Role of NASA," below). Long-term custody of data is a difficult issue for NASA. NASA does not have a long-term archive mission, and it is not willing to shift funds from the EOS program to an archive agency such as NOAA. Nevertheless, NASA has concluded a Memorandum of Understanding (MOU) with the USGS, which now has a line item in its budget for the eventual acquisition of EDC DAAC holdings. MOUs are still being negotiated with the Department of Energy (DOE) for ORNL

OCR for page 27
Review of NASA'S Distributed Active Archive Centers data and with NOAA for the remainder of the DAAC holdings. In practice, however, there is no coherent plan derived from a vision of scientific needs (insofar as these needs have been identified) for long-term archive of the data. Any plan for long-term archive should consider both the scientific cost and the dollar cost of moving data sets to a different geographic location. The cost to the science comes from dissociating the data set from the scientists and data managers who have the relevant expertise to manage it. It is unlikely that staff at a scientifically unrelated data center, no matter how well intentioned, could manage the data as well. Additional scientific costs arise from the loss of data, which is inevitable in large-scale data transfers. The dollar cost of moving data sets from active archive to long-term storage is poorly quantified, but examples exist that could be used to calibrate the cost of such tasks. For example, the holdings of the now defunct Marshall Space Flight Center DAAC were transferred to several DAACs; the Spaceborne Imaging Radar-C (SIR-C) data set was exported—together with the processing hardware and software—from JPL to the EDC DAAC; and the Ocean Topography Experiment (TOPEX) data set was moved from the principal investigators to the PO.DAAC. A related issue is the cost of adapting to constantly evolving technology, especially storage media and associated hardware. Given the pace of technological change and the DAACs' strategy of remaining within the technological mainstream, the DAACs will have to address this issue in their mid-and long-term plans. Yet, only the PO.DAAC reports such plans, perhaps because its restoration of SeaSat data illustrated the difficulties that arise from gaps in the metadata, outdated storage media, and machine-specific data formats. Its excellent document, "SeaSat Data Restoration-Lessons Learned" that it produced for this review should serve as a valuable resource to other DAACs and data centers. Finally, it is well known among data center managers that 10% of the users access 90% of the holdings, whereas 90% of the users access 10% of the holdings (generally at a higher level of data processing). This causes a dilemma for the DAACs, particularly with the enormous volumes of data they will face. Ten percent of the users will require advanced data storage and distribution capabilities for dealing with enormous data sets, and ninety percent of the users will need much smaller, higher-level, interpreted data sets on more accessible media (i.e., CD-ROMs). The DAACs therefore must resolve one or both of the following issues: (1) subsetting large data sets into manageable size in a sufficiently short time to satisfy on-line data requests of a large fraction of users and (2) identifying and implementing means of distributing very large data sets to a small fraction of users. The DAACs should plan to satisfy both requirements. Strategic Planning To be successful, each DAAC must create a vision of what it wants to accomplish and a strategy for achieving that vision. The vision should go beyond

OCR for page 27
Review of NASA'S Distributed Active Archive Centers a simple statement of ESDIS requirements such as handling the EOS data flow and producing the planned products. It should influence every aspect of DAAC operations, including participation in the flight missions and experiments, acquisition of data sets, service to an expanding and evolving user community, and accommodation of the rapid evolution in computing, storage, and communications technologies. Similarly, the system-wide Strategic Management Plan produced each year by ESDIS and the DAACs (see Appendix C) should take into consideration the need for the DAACs to act as a coherent system. Once in place, the DAACs will also have to develop quantitative metrics to monitor the performance of the process or the system as a whole. Few of the DAACs, however, have engaged in this thought process, and most of the panels recommend some level of strategic planning. (Notable exceptions are the PO.DAAC and the NSIDC DAAC.) For example, a vision and an implementation plan would help the EDC DAAC to serve the needs of its scientific and, potentially much more numerous applications users, and would help the ORNL DAAC to become involved in EOS flight programs. The latter is particularly important because the unique biogeochemical holdings of the ORNL DAAC are essential to the proper validation and calibration of remotely sensed data. Unless the ORNL DAAC asserts itself, the various flight projects will be forced to develop independent solutions, and large components of the EOS program will fail. The panels' most common recommendation in this regard, however, has to do with technology. Strategies for acquiring and upgrading hardware and software are difficult to develop because such tasks are partly the responsibility of the ECS contractor. This issue is particularly important for the AM-1 DAACs, which have received significant amounts of ECS hardware and software (often at their own request) but know little about what will be delivered in the future. Much of this equipment is literally sitting on the computer room floor awaiting the anticipated huge data flows and is slowly becoming rather less than state of the art even before really being used. The system will thus employ multiterabyte technology when the most sophisticated users will expect petabyte capabilities. Although the DAACs cannot control completely the type of ECS equipment they will receive or its schedule for delivery, they can choose their own hardware and software for managing existing data sets. (The GSFC and LaRC DAACs also chose their own systems for managing data from the TRMM mission.) When acquiring hardware, most DAACs aim to stay within the technological mainstream, a standard goal for data centers because limitations on financial resources prevent much experimentation with technology. Moreover, current wisdom about the evolution of technology (i.e., ''Moore's law") is that capacity—processor speed, bandwidth, mass storage—now doubles every 18 months. This time constant is comparable to that of the procurement cycle, which makes it difficult for the DAACs to adopt an agile response to the acquisition of new technologies. The committee notes that this problem is particularly acute with the ECS, which chooses equipment far (sometimes years) before deployment. The panels report

OCR for page 27
Review of NASA'S Distributed Active Archive Centers that the equipment purchased by the DAACs generally falls within the technological mainstream, although some development efforts are based on technologies that are more advanced than those included in the ECS. For example, the transition to the next generation of processors will likely place the LaTIS software at the LaRC DAAC ahead of the ECS and make it incompatible with planned ECS distributions, or so-called drops. Thus, any strategy for the acquisition and upgrade of hardware and software will also have to take into account compatibility, not only with the ECS, but also with the other DAACs in the system. A related aspect is that the DAAC system—perhaps overwhelmed by the prospect of facing the huge EOS data flow—has spent little time selecting measurable goals and self-assessment criteria against which to gauge collective performance. Without quantitative measures of performance, it will be difficult for NASA or the DAACs to determine whether the needs of all the EOSDIS constituencies are met and, thus, whether the DAAC system as a whole is a success. At the moment, launch delays and ECS failures have tarnished the image of all components of EOSDIS, including the DAACs. Yet, as pointed out by the panels, most of the DAACs function well individually, even though the same cannot be said of their behavior as a system. A regular peer review that focuses on established performance measures for the DAACs and for the DAAC system would help provide a sound basis for determining the health of the DAACs. Conclusions and Recommendations The ultimate success of the Earth Science Enterprise will be judged not only by the immediate scientific gains that arise from use of newly collected data, but also by the ability of scientists to use the data in the long-term to study global environmental change. The data are most likely to retain their usefulness in the long-term if the DAACs adopt a life-cycle data management approach and become involved in everything from the collection of data to its eventual archive. The PO.DAAC has such a comprehensive management philosophy, but most DAACs focus on the middle step—management of active data sets—and they do so successfully overall. Only a few DAACs participate in the design of the data collection environment. The remainder either do not see such involvement as one of their roles or are discouraged from participating by the science and instrument teams. Both sides will have to come to the table if a mutually beneficial relationship is to develop. Finally, the committee notes that there is still no concrete plan for the long-term archive of the vast majority of DAAC data. Because NASA has no archive mission, the DAACs will have only a limited role in the transition of DAAC data sets to archives of other agencies. Nevertheless, their knowledge of the data and experience with data transfers should prove valuable in this process. Implementing a life-cycle approach to data management requires a greater degree of planning than currently exists at most of the DAACs. Only a few

OCR for page 27
Review of NASA'S Distributed Active Archive Centers DAACs have a clear vision of what they are trying to achieve or strategies for achieving their goals. In addition, all the DAACs could benefit from devising and monitoring quantitative performance measures. Such measures are useful for evaluating the performance of a function or process, as well as for determining the success of a DAAC or the entire DAAC system. Recommendation 6. Ongoing changes in data volumes, user expectations, and emerging technologies are powerful forces that put pressure on each DAAC to evolve independently of the others. In order to counteract such centrifugal forces, each DAAC should prepare and periodically update a practical strategic plan for dealing with change, while preserving the concept of a coherent system. Recommendation 7. Excellence in a research enterprise is best gauged through assessment of performance by one's peers, according to a commonly accepted set of performance criteria. The DAACs must develop a set of quantitative performance metrics by which they can measure their own progress and evaluate their success as individual centers and as a coherent system. Periodic peer review aimed at gauging accomplishments against these metrics should be incorporated as part of this ongoing process. The committee emphasizes that Recommendation 7 is aimed at the DAACs, rather than the ESDIS Project. This is because such peer reviews should not be a bureaucratic imposition, but a means by which the DAACs and their user communities achieve a greater understanding of one another and thus a more effective level of service. THE ROLE OF NASA As originally envisioned by its creators, EOSDIS serves three important roles. First, it provides a mechanism for distributing data from EOS-related missions and experiments. Second, it promotes creative scientific analysis of these data and, as such, must enhance opportunities for scientists to build on the unprecedented information it already contains and on the new data anticipated over the coming decades. Third, it is the largest single component in global efforts to understand, predict, document, and mitigate the impacts of global environmental change (although key data currently reside in other agencies such as NOAA.) Consequently, EOSDIS will greatly influence complementary efforts in other agencies and throughout the world. Strong leadership within NASA is critical for implementing this vision and for balancing the potentially conflicting demands of the constituent elements of EOSDIS—the ECS, DAACs, EOS investigators, and users.

OCR for page 27
Review of NASA'S Distributed Active Archive Centers Since EOSDIS was designed, the Internet and the World Wide Web, along with greatly reduced costs of computation, have changed the paradigm for scientific collaboration. Scientists no longer need to rely on a centralized warehouse of data equipped to respond to a sharply limited range of predefined queries. Instead, distributed databases and information systems offer the possibility of finding useful information in much more flexible ways. Probably the greatest challenge facing EOSDIS at this time is to adjust its social and management structures to take full advantage of these new opportunities without jeopardizing its ability to manage high-volume data streams from coordinated instrument systems. Changing established ways of doing things is generally painful and involves some risk. It also requires dedicated leaders with initiative and vision. For EOSDIS to succeed in this new environment, its constituent elements must be empowered to fulfill their special roles. For example, the DAACs (or relevant Earth Science Information Partners) should be vested with the appropriate authority to take all actions necessary to satisfy the needs of the science community. This is particularly true for the ASF DAAC, which lacks authority to compel JPL to develop an information system that is responsive to the needs of the DAAC or its users. Each center should be encouraged to develop its own personality in accordance with its special needs, as long as it is contributing to the evolution of a responsive and dynamic distributed system for the EOSDIS community at large. Neither the ORNL nor the ASF DAAC, for example, fits the ''standard" DAAC mold because they don't deal with data from EOS spacecraft. Yet each has a vital role to play in the Earth Science Enterprise and should be integrated conceptually into EOSDIS. (It is significant that NASA's model of the EOSDIS architecture divides the ORNL DAAC from the others [see Figure 1.1]. In the case of the ASF DAAC, NASA will have to first develop a long-term policy on the acquisition, processing, and use of SAR data for civilian purposes. ESDIS will have to create incentives for making the constituents respond to the needs of the broader community and safeguards for ensuring that they do not destroy the overall integrity of the data system. This delicate balance requires vision and leadership from both the DAACs and ESDIS. In fact, the ability to exercise leadership in this regard should be a substantial consideration in personnel selection. Finally, metrics of success must be devised and agreed to, that recognize that the value of EOSDIS lies in the scientific understanding and reliable information emerging from the entire program, rather than in the cost per byte of data processed. These are not trivial requirements for NASA management, and effective strategic planning requires leadership from the highest levels. Conclusions and Recommendations Under the new EOSDIS model, ESDIS will have to forgo the current mode of operations in which all strategic decisions are made centrally and adopt the

OCR for page 27
Review of NASA'S Distributed Active Archive Centers mode of providing incentives to individual DAACs (1) to serve their individual specialized scientific constituencies most effectively and (2) to collaborate to provide users with a common look and feel to the information system. The former will require close collaboration with science and instrument teams and active scientists. The latter will involve some compromises with respect to the ideal of one-stop shopping. In the right governance structure, the DAACs would task ESDIS and ECS to address common needs, which must be identified from the bottom up rather than from the top down. Such common needs range from technical issues, such as format translations and regridding data in different projections, to high-level issues, such as the completeness and extensibility of the metadata model and what kind of services to provide to users. In effect, ESDIS will have to foster the creation of a federation of DAACs by delegating some of its authority for serving users to the DAACs. Such a federation differs from NASA's prototype federation (see Chapter 1, "Federation and Recertification ") in that the partners require greater stability and continuity. In addition, because of product interdependence, the DAACs have a requirement for reliability and adherence to schedules. In this sense a DAAC federation would constitute a core around which the broader federation of (sometimes ephemeral) ESIPs could grow and function effectively. Recommendation 8. The DAAC-ECS-ESDIS model for managing EOSDIS data and information has not succeeded. To take advantage of new technological approaches and management models, ESDIS should foster the creation of a federation of DAACs by delegating to the DAACs some of its authority for serving users and by providing incentives to the DAACs to serve the broader community as well as their individual specific constituencies. OTHER ISSUES The Cost of the DAAC System The EOSDIS budget is currently about $2 billion over a 10-year period. The DAACs, including ECS-provided hardware, software, and personnel, account for about 30% of the EOSDIS budget, or $60 million per year. The remainder of the EOSDIS budget provides for data capture and communications, ECS development, science computing facilities, and program management. It is important to note that the DAAC budget figures provided by ESDIS and presented in this report are only approximations. The true cost of the DAACs is difficult to determine for the following reasons: the DAACs receive funds, services, and personnel from several sources, including NASA Headquarters, the ESDIS Project, and their host institutions;

OCR for page 27
Review of NASA'S Distributed Active Archive Centers resources are shared (i.e., the ECS development effort benefits several of the DAACs); neither NASA nor the DAACs (except ORNL) practice full-cost accounting; and congressional appropriations for the EOS program are commonly less than NASA's request. It is also important to note that the budget histories presented in the DAAC chapters were provided by ESDIS in May 1998. The values for FY 1994 to FY 1997 are actual values, but values for FY 1998 to FY 2002 are projections, which are likely to change significantly as funding for EOSDIS declines and backup plans are implemented. Because the numbers were determined by the DAAC's primary funding source (ESDIS), they provide the most complete and consistent picture available of the cost of the DAAC system. The cost estimate provided by ESDIS is significantly higher than the cost estimates provided by most of the DAACs because it includes the ECS-provided hardware, software, and personnel deployed at the DAACs. The DAACs have little say about these resources and tend to consider them ECS, rather than DAAC, expenses. The committee and its panels, however, believe that all resources at the DAACs should be included in cost estimates of the DAAC system, so both DAAC and ECS-related expenses at the DAACs are itemized in the DAAC budgets (see Chapters 3 through 9). However, even with the ECS costs factored in, it is clear that the DAAC system is less expensive than is commonly believed. For example, the DAAC budget was presented misleadingly as $100 million per year at a 1995 NRC workshop in La Jolla, California. The apparent high cost of the DAACs was one of the drivers for proposing a federation management model for EOSDIS. Several survey respondents also commented that the DAACs are too expensive. All of the panels attempted to assess the cost-effectiveness of the DAACs, but several DAACs could not even suggest suitable metrics. (The PO.DAAC and the ASF and GSFC DAACs measure cost-effectiveness as the unit cost of delivering a data set; the ORNL DAAC measures it as the cost per unit of data stored.) Consequently, the issue of cost-effectiveness could not be addressed in a significant way. It is noteworthy, however, that even though most of the DAAC budgets far exceed the budgets of national data centers, none of the national data center directors who served on the panels thought that the DAAC budgets were too high for the amount or complexity of data being handled, the size of the user base, or the services provided. Impact of Contingency Plans Because of delays in the ECS, both the DAACs and the science and instrument teams have developed fall-back strategies for processing and disseminating

OCR for page 27
Review of NASA'S Distributed Active Archive Centers EOS data and data products. Implementing these contingency plans will result in some data and data products becoming available, but not as many as had been hoped under the original EOSDIS design. For even a reduced number of products to be distributed, ESDIS, the DAACs, and the various teams will have to resolve the following issues: Documentation. The DAACs are likely to play a much smaller role in generating products and will therefore be less knowledgeable about the data sets and data products in their charge. Consequently, documenting what is being done to the data in the product generation stage becomes even more important under the contingency plans than it was before. If the DAACs are to provide an adequate level of user services, they will need to have a much closer relationship with the science and instrument teams than currently exists. In particular, they will have to become more involved in producing the metadata for the data products. Coordination. Under the 25-50-75 scenario, only 25% of the data will be made available initially. It is up to the individual science team to decide which 25% of the data will be processed. This decision will affect not only users of that data product but also other science and instrument teams and DAACs that need to use the data to produce other data products. To accommodate product interdependencies, the DAACs and teams both will have to place a high priority on coordinating schedules. Otherwise, the production of many important data products is likely to be further delayed. Dissemination. In the earlier stages of data processing and distribution, the DAACs are likely to be bypassed in favor of scientists calling their instrument team colleagues. In fact, the Science Computing Facilities where the data products are being generated are likely to become the primary distribution mechanism until the instrument teams members become so fed up that they relinquish the distribution task to the DAACs. (The DAACs serve tens of thousands of users each year [see Table 1.3]. As the ORNL DAAC can attest, awaiting data sets that could arrive from principal investigators at any time is frustrating and makes it difficult to allocate personnel and computer resources for distributing the data to the broader community. It is unclear whether one-stop shopping is still a realistic goal for EOSDIS as a whole. Evolving Web access tools will increasingly permit reliable distributed searches, but only if there is a common terminology of keywords that is also shared by users. The DAACs and their science constituencies together have to develop this terminology. If this is done, it should also be possible to make available complete granules from standard products in a manner that is reasonably consistent overall. Because the granules typical of large, low-level data sets are generally too large to download over the Internet, and because subsetting

OCR for page 27
Review of NASA'S Distributed Active Archive Centers tools designed by the ECS will not be ready in time for the AM-1 launch, the DAACs most affected will have to incorporate subsetting into their contingency plans. CONCLUSIONS With technical problems in the EOSDIS Core System and, more recently, flight operations commanding the attention of NASA and Congress, it is easy to lose sight of what the EOS program is all about—understanding the Earth and the processes that govern it. Because such a wide variety of data will be collected—atmospheric, oceanic, polar, biospheric, and solid earth—scientists will be able to use EOSDIS to support both disciplinary and multidisciplinary research. Although multidisciplinary scientists are only one component of the EOSDIS user community, meeting their needs will be the greatest challenge of the system. For these users, EOSDIS must be more than a collection of discipline data centers; it must be a real system that enables users to access and combine data from more than one DAAC or data center. The current collection of DAACs does not as yet function as a system. To become a system in reality, the DAACs and ESDIS will need a common vision of the goals of the system and a commitment to developing practical approaches toward achieving these goals. Developing such a system becomes an even greater challenge as EOSDIS evolves to a more distributed federation. NASA leadership is crucial for this transformation to succeed. In the near term, NASA's attention is necessarily focused on fulfilling existing software commitments and on supporting science and instrument teams for existing or near-term flight missions. However, in the longer term, EOSDIS must establish by force of example its role as the creative nerve center for the scientific understanding of the Earth in the next decade. Though in many respects still a dream, this goal is too important to let slip away.