Executive Summary

Environmental data centers have been successfully acquiring, disseminating, and archiving data for decades, but the increasing volume and number of datasets and more demands from more diverse users are making it difficult for data centers to maintain the record of environmental change. At the request of the United States Global Change Research Program (USGCRP), the National Research Council (NRC) held a workshop on Coping with Increasing Demands on Government Environmental Data Centers. The objectives of the workshop were to consider technological solutions that could enhance the ability of users to find, interpret, and analyze information held in environmental data centers and that could help data centers collect, store, share, manage, and distribute large volumes of data.

The workshop focused on technological approaches that should be given consideration not only by data center managers and their sponsoring agencies but also by various user communities. These solutions could improve both data center operations and the ability of a wide variety of users to obtain data. This report is based on discussions from the workshop and committee deliberations, and the focus areas were identified by workshop participants.

Data ingest into the major data centers appears to be well planned and executed. The process of acquiring environmental data from the centers for research or commercial use, however, continues to be difficult. The workshop considered the following areas where advanced technologies would help data centers’ performance:



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
Government Data Centers: Meeting Increasing Demands Executive Summary Environmental data centers have been successfully acquiring, disseminating, and archiving data for decades, but the increasing volume and number of datasets and more demands from more diverse users are making it difficult for data centers to maintain the record of environmental change. At the request of the United States Global Change Research Program (USGCRP), the National Research Council (NRC) held a workshop on Coping with Increasing Demands on Government Environmental Data Centers. The objectives of the workshop were to consider technological solutions that could enhance the ability of users to find, interpret, and analyze information held in environmental data centers and that could help data centers collect, store, share, manage, and distribute large volumes of data. The workshop focused on technological approaches that should be given consideration not only by data center managers and their sponsoring agencies but also by various user communities. These solutions could improve both data center operations and the ability of a wide variety of users to obtain data. This report is based on discussions from the workshop and committee deliberations, and the focus areas were identified by workshop participants. Data ingest into the major data centers appears to be well planned and executed. The process of acquiring environmental data from the centers for research or commercial use, however, continues to be difficult. The workshop considered the following areas where advanced technologies would help data centers’ performance:

OCR for page 1
Government Data Centers: Meeting Increasing Demands improved application of standard translatable formats; greater reliance on on-line data storage and network access; more sophisticated database technologies; expanded metadata management and lineage tracking; and greater reliance on nonspecialized, easily available hardware and software solutions. IMPROVED APPLICATION OF STANDARD TRANSLATABLE FORMATS Data and metadata formats evolve as the priorities of data producers and users change. Although it is not possible to create a single standard format for data and metadata that meets the needs of every purpose for every dataset and user group, greater uniformity would make it easier for users to query, search, subset, access, and integrate data. In particular, using a standard format, such as XML, for metadata would enable some of these data to be generated automatically, stored in searchable databases, and easily translated among user applications. Recommendation: With their user communities, data centers should accelerate work toward standardizing and making formats more transparent for data and metadata and thereby improve distribution and interoperability between data centers, between data centers and users, and between users. Metadata formatted in XML would assure that recipients would be able to parse data automatically and feed them directly to their applications. GREATER RELIANCE ON ON-LINE DATA STORAGE AND NETWORK ACCESS Providing network access to datasets in an accessible directory hierarchy would ease access to and distribution of data. This approach vastly increases distribution efficiency when subsetting tools are also made available by the data center holding the dataset: users can treat datasets as local files and use subsetting tools to extract only the portions they need, thereby reducing the network bandwidth needed for the acquisition. Network bandwidth is already widely available for retrieval of large volumes of data. However, the use of network bandwidth for data delivery relies extensively on the ability to access data randomly and would require the implementation of suitable database management

OCR for page 1
Government Data Centers: Meeting Increasing Demands and subsetting tools at the data centers. The off-line and near on-line storage techniques (e.g., tape robots) currently used by many data centers can hinder these solutions. Disk storage is now competitive with tape for long-term, archival-class storage. Over the past decade, disk storage and access have had a greater increase in performance for a given price than any other part of the computing industry, and other technologies for dense storage of information are the subject of much research activity in both industry and academia. Recommendation: Data centers and their sponsoring agencies should shift the primary storage medium from tape to disk. In addition, data centers and their sponsoring agencies should enable direct random on-line access through networks and provide support for remote queries on databases. MORE SOPHISTICATED DATABASE TECHNOLOGIES Files are a reasonable way to organize data when the physical storage medium is tape; however, disk storage permits data to be organized in much more flexible databases. Database techniques structure ordered and related lists of parameters for the application of efficient processing algorithms. The power of database techniques lies in the ability to relate parameters from one dataset to another, thereby reducing processing and storage requirements. Adopting database technologies could significantly improve data center operations because they change the way users search, query, and access data and the way data centers acquire and store data. Recommendation: Data centers and their sponsoring agencies should implement database technologies. When applicable, these technologies can improve data search and query, access and acquisition, interoperability, and retrieval from storage.

OCR for page 1
Government Data Centers: Meeting Increasing Demands EXPANDED METADATA MANAGEMENT AND LINEAGE TRACKING As more precise means of measuring and monitoring the environment are developed, the number and volume of the resulting data products increase, and the management of metadata, or data about data, becomes increasingly important. Metadata must be stored, queried, communicated, and maintained just like the data they describe. Data centers have spent considerable effort preserving metadata by routinely documenting information on data lineage, such as the source data, transformation processes, and quality assurance information, of their datasets. Open access to summaries of the dataset assembly processes and lineage has contributed significantly to ensuring user confidence in data product quality. However, the lack of a definitive universal system for lineage metadata has resulted in incomplete or missing information. The practice of retaining complete data lineage and authenticity information as metadata should be incorporated in the large volumes of scientific data being produced today. In addition, although data centers encourage citation, there is a need for an accepted universal method for citing data products, their origin, or the processing that has been applied to them. Most centers and even some scientific journals have a preferred mode of citation, but dataset citation remains uncommon. Routine documentation of the original data sources and the subsequent transformation and fusion steps used to develop a processed dataset would be most efficiently carried out by automated tools. Fortunately, database technology and standard formats can be as useful for metadata management as they are for data management. The self-describing approach adopted in the definition of extensible languages, such as XML Schema, is an important step in realizing technologies to support metadata management in government data centers. Recommendation: To ensure that the greatest use is made of environmental data, (1) data producers should include data lineage and authenticity information in the metadata; (2) data centers should improve management of and access to metadata through standard formats and database technologies; and (3) users should routinely cite the data products they use in their investigations, using agreed upon dataset identifiers. To the greatest extent possible, data centers and data producers should rely on automatic tools for creating and managing metadata.

OCR for page 1
Government Data Centers: Meeting Increasing Demands GREATER RELIANCE ON NONSPECIALIZED, EASILY AVAILABLE HARDWARE AND SOFTWARE SOLUTIONS Because development and support costs for widely used products are lower, more and more data solutions are likely to be adapted from market-driven and market-proven technologies in an environment of constrained resources. Data centers have dedicated substantial funds toward custom hardware and software development that were the “right answer” two decades ago, but today the data centers should embark on collaborations with industry to apply these proven, easily available technologies. The problems of managing large datasets have begun to receive the attention of the commercial sector, with the result that innovative, easy-to-use methods and tools for data search, retrieval, and analysis are widespread. Moreover, easily available commodity hardware can also be used for data ingest, storage, and distribution. In addition, the open-source movement (software with its source code made available without any restrictions) has addressed many requirements of the data centers (e.g., Nepster, ModSTER, authentication, lineage tracking). Therefore, a combination of commercial and open-source software minimizes the need for expensive custom development. Finally, while most data centers are managed as centralized organizations, a federated distributed system would formalize current user practices of obtaining some scientific products from colleagues and data projects instead of from data centers and could help reduce infrastructure and management costs for data centers. Recommendation: Data centers should adopt commodity hardware and commercial and open-source software solutions to the widest extent possible and concentrate their own efforts on problems that are unique to environmental data management. In addition, data centers and user communities should take advantage of federated distributed systems for making data available. IMPLEMENTATION To balance the risks of adopting new technologies, smaller-scale prototypes can create a framework for tests with operations and with users. Using demonstration data centers is one means of effectively jump-starting new applications and sharing of new technology.

OCR for page 1
Government Data Centers: Meeting Increasing Demands Recommendation: Data centers and their sponsoring agencies should create independent demonstration data centers aimed at testing applicable technologies and satisfying the data needs of a range of users, including interdisciplinary and nontechnical users. These centers might best prove technological approaches through several participants working in parallel. Deliberate and appropriate transition to new technology will require planning and testing of technology concepts. In many cases, it is possible to make a gradual transition with periodic migration of datasets and updates to data systems. In other cases, a disruptive transition may be justified. New technologies can help deal with increasing amounts of data, differing data types, changing user communities, and steadily increasing demands of users and data providers. However, in some cases transition will require that software for data ingest, data processing, and data access be rewritten. Recommendation: Data centers should aggressively adopt newer, more “bleeding edge” technical approaches where there might be significant return on investment. This should be done carefully to minimize the inevitable failures that will occur along the way. Even with the failures, the committee believes the cost savings and improvements for end users will be substantial when compared to the methods practiced today. The nation’s data centers have achieved notable successes. They store huge volumes of data reliably and provide some widely used and trusted products. The challenges posed by the rapidly expanding quantity and diversity of environmental data and increasing user demands can be met in part through technological solutions. The approaches identified in this report would substantially improve users’ abilities to search for, find, and retrieve information from data centers. The size of the user community would increase, users’ efficacy would improve, and scientific researchers would benefit: all would inevitably improve the information without which policy makers cannot make decisions on climate change. Although technology can contribute to the solution of important environmental data management problems, human effort is still central to data center operations. Data centers should ensure that the latest technologies are assessed for their relevance and utility. Without question, data centers should not rely solely on technology without continuing to invest in the scientific and human elements of data management and data center operations.