2
Challenges and Opportunities

Although data management is often viewed as the least glamorous aspect of science, access to well-managed data is critical to the work of many environmental researchers, as well as to an expanding pool of commercial and nontechnical users (NRC, 2001). This chapter reviews technological approaches for data management and storage that could improve the ability of users to search, query, subset, and access data. Consideration and implementation of these approaches have already begun at some data centers but are not yet pervasive. The committee based this chapter on the working group reports presented at the workshop (Appendix D), subsequent discussions, and background information provided to the committee. The committee’s expertise and deliberations form the basis of the conclusions and recommendations.

CHALLENGES IN DATA AVAILABILITY AND ACCESS

Data ingest into the major data centers appears to be well planned and well executed. The process of acquiring environmental data for research or commercial use, however, continues to be difficult. Users must first seek out the data they need, which can be time consuming and difficult because there is no comprehensive list of or universal access point to all government data holdings. Although multiple means exist to find data, the chance of missing key datasets is high. In addition,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 15
Government Data Centers: Meeting Increasing Demands 2 Challenges and Opportunities Although data management is often viewed as the least glamorous aspect of science, access to well-managed data is critical to the work of many environmental researchers, as well as to an expanding pool of commercial and nontechnical users (NRC, 2001). This chapter reviews technological approaches for data management and storage that could improve the ability of users to search, query, subset, and access data. Consideration and implementation of these approaches have already begun at some data centers but are not yet pervasive. The committee based this chapter on the working group reports presented at the workshop (Appendix D), subsequent discussions, and background information provided to the committee. The committee’s expertise and deliberations form the basis of the conclusions and recommendations. CHALLENGES IN DATA AVAILABILITY AND ACCESS Data ingest into the major data centers appears to be well planned and well executed. The process of acquiring environmental data for research or commercial use, however, continues to be difficult. Users must first seek out the data they need, which can be time consuming and difficult because there is no comprehensive list of or universal access point to all government data holdings. Although multiple means exist to find data, the chance of missing key datasets is high. In addition,

OCR for page 15
Government Data Centers: Meeting Increasing Demands knowing specifically what to ask for in a data search is not straightforward when query terms and procedures vary from center to center. For users who are less knowledgeable about the datasets they want, searches frequently require help from the centers’ customer service representatives. However, NOAA’s report to Congress, The Nation’s Environmental Data: Treasures at Risk, notes that, although requests for NOAA’s data increased from about 95,000 in 1979 to over 4 million in 1999, staffing levels decreased from 582 to 321 (NOAA, 2001). Another challenge for data centers is to deliver only the data that the user needs and requests, neither more nor less. Subsetting is the process of extracting portions of data, such as time slices or spatially defined sections. Subsetting is especially important in large datasets, such as those generated by remote sensing. However, despite consistent user demand, there continues to be a dearth of subsetting tools. Scientific products from the data are also available, but their coverage and diversity are sparse. Once users have found what they need, they face the challenge of obtaining the data, which can require complex skills. Although frequent users typically become adept at manipulating the infrastructure, access and retrieval methods differ from center to center, so even skilled users may be familiar with only one center’s approach. Inexperienced users and investigators using many different data sources require a substantial investment of time to acquire data. Almost without exception, data centers offer multiple methods of retrieving data in their holdings (e.g., file transfer protocol (FTP), which permits users to copy files stored on data center computers, and media order, in which centers copy the data of interest onto compact disk or tape). This provides flexibility but complicates the retrieval process. Even with the appropriate query term, knowledge of the best access methods, and available subsetting tools, access to data still depends upon the ability of the centers to store data on media that can be retrieved and manipulated easily. Data centers rely too heavily on off-line or near-line (e.g., tape robots) storage. The consequences of this are that retrieval can be slow and that searching and subsetting can be difficult. For interdisciplinary users, the real challenge arises with integrating disparate datasets, usually obtained from different data centers. Data interoperability remains difficult because standards, formats, and metadata were chosen to optimize the usefulness of a particular dataset, rather than a collection of diverse data. The growth of on-line distributed data archives has prompted many environmental research programs to address their own interoperability needs through data formats and metadata conventions (e.g., Federal Geographic Data Committee, 1998).

OCR for page 15
Government Data Centers: Meeting Increasing Demands However, data exchange between even the most advanced of these communities remains complex and unwieldy. As more precise means of measuring and monitoring are developed, the number and volume of the resulting data products increase, and the management of metadata, or data about data, becomes increasingly important (Sidebar 2.1). Proper metadata management is essential for government data centers to achieve their missions. Metadata must be stored, queried, communicated, and maintained just like the data they describe. Increasingly, metadata will be a key enabling element for use by communities (e.g., interdisciplinary and nontechnical user groups) that did not originally collect the data. SIDEBAR 2.1 Metadata Metadata describe data and data products, allowing users to find, understand, process, and reuse data and data products. Although metadata can require increased storage capacities, they are essential for establishing confidence in the data products by providing information about the history, or lineage, of the data. Metadata in government data centers should include the following types of information: data formats (how information is stored within data files); data describing how, when, and where raw data were collected; descriptions of how raw data were normalized, calibrated, validated, integrated, cleaned, checked for errors, and processed; statistics of value distributions, etc., needed for efficient database storage and access of data; descriptions of data use, such as how frequently a dataset is used, whether it is subsetted, etc.; and data specifically designed to enhance use by interdisciplinary scientists and/or nontechnical users. In the following sections, the committee describes some steps that would improve data availability and access, including improved application of standard translatable formats;

OCR for page 15
Government Data Centers: Meeting Increasing Demands greater on-line data storage and network access; more sophisticated database technologies; expanded metadata management and lineage tracking; and greater reliance on easily available, nonspecialized hardware and software solutions. STANDARD TRANSLATABLE FORMATS Typically, standards for data and metadata management are created by the individuals and organizations collecting the data; community organizations such as professional societies, data centers, and sponsoring government agencies; and international organizations. Formats evolve over time, with new formats introduced and others abandoned as community preferences emerge. This constant evolution results in a bewildering array of standards. Although it is not possible to create a single standard that meets the needs of every dataset and user group, greater uniformity and transparency would make it easier for users to query, search, subset, access, and integrate data. Formats that can incorporate metadata provide added benefits. Until the early 1990s, data from remote-sensing instruments were stored primarily in binary data files, each unique to a particular sensor. Because of the lack of alternatives and the efficiency of sequential binary data storage, the data had to be stored in files on disk or tape. Metadata, if stored at all, were placed in an accompanying text file. However, in the past decade, computer scientists have devised many self-describing formats for storage of scientific data. These data formats maintain efficient binary storage but allow nonexperts to understand the layout of the data. Two popular formats currently used are netCDF and HDF (network common data form and hierarchical data format, respectively); a version of the latter is a standard used by NASA’s Earth Observing System Data and Information System (EOSDIS). In essence, self-describing scientific data formats provide some level of metadata encapsulation with the data. Databases are intimately tied to metadata as a means of allowing users to search for data products of interest. Most databases are constructed specifically for their applications; custom software is written to extract metadata from multiple sources, including data files, into these databases. As an example, the database behind EOSDIS was fashioned over many years, with new datasets processed and specific metadata entered using custom software. This process is complicated and time consuming, but it leads to providing a mechanism for searching remotely

OCR for page 15
Government Data Centers: Meeting Increasing Demands sensed data. Moving toward standardized data and metadata formats would simplify the search process. The next step is to generate databases automatically from the metadata. It is possible to use XML Schema to generate database tables automatically from the structure and content of the metadata, as well as to create Web-based forms for database queries. Such query interfaces allow users to formulate restrictions on the data of interest, which are then translated into selection conditions in a query language, such as SQL or XQUERY. This does not relieve sensor operators from generating appropriate metadata for their data, but it eases the search through databases. Recommendation: With their user communities, data centers should accelerate work toward standardizing and making formats more transparent for data and metadata and thereby improve distribution and interoperability between data centers, between data centers and users, and between users. Metadata formatted in XML would assure that recipients would be able to parse data automatically and feed them directly to their applications. NETWORK AND ON-LINE RANDOM ACCESS Providing network file system access would ease obtaining and distributing data. Such a network would allow datasets to be used without the current formal process of copying the data across a network or sending the data physically by tape. The data become available immediately to as many users as want them. This approach can increase distribution efficiency when subsetting tools are also made available: users can treat datasets as local files and use subsetting tools to extract only the portions they need or only a transformation of the data, reducing the network bandwidth needed for the acquisition. Furthermore, once the data have been distributed, authenticity can still be guaranteed by digital signatures supplied by the national data centers. Protocols to compress and expand data automatically when they are transmitted would assist with effective network use. Network bandwidth1 is already widely available for retrieval of large volumes of data. However, the dependence on network bandwidth as a solution to the data delivery problem requires the implementation of 1   Network bandwidth—capacity to move large data files electronically.

OCR for page 15
Government Data Centers: Meeting Increasing Demands suitable database management and subsetting tools at the data centers. Few users will want gigabyte-sized datasets. In addition, network-based solutions rely extensively on the ability to access data randomly. The off-line and near on-line storage techniques (e.g., tape robots) used by many data centers can hinder these solutions. The transfer rates of modern tape systems are on the order of a few megabytes per second; common network transfer rates are 100 times faster. While disk storage capacities continue to increase dramatically, tape capacities and transfer speeds have barely increased during the past five years. In addition, without random access to on-line data, subsetting through a network is unworkable, as users cannot capture slices of the linearly stored datasets. Data that are kept off-line or near on-line cannot be used in database systems. Even databases that direct users to off-line data products must create well-defined delivery timelines. Tape systems at data centers can time-out on user requests, thus requiring a technician to process orders manually. In 1994 computing experts forecasted that disk storage would become cheap and efficient enough to eliminate the need for off-line storage (Davis et al., 1994). However, in some cases this transition to disk will require that software for data ingest, data processing, and data access be rewritten. As a result, data centers keep most data off-line, thereby reducing the ability of users to search through and retrieve data rapidly. Data centers are moving toward increasing the availability of on-line data; however, only 3 terabytes of NOAA’s 76-terabyte digital data archive are on-line (NOAA, 2001), despite the fact that disks to accommodate this amount of data would cost about $100,000 at current prices. Over the past decade, disk storage and access have had a greater increase in performance for a given price than any other part of the computing industry, and other technologies for dense storage of information are the subject of much research activity in both industry and academia. Price per unit storage has decreased during the past 10 years. Satellite missions of the next decade will generate about 1 petabyte of information per year. As recently as 1995, NASA estimated that today’s cost to store a petabyte off-line would approach $100 million, but it is now possible to obtain 1 petabyte of disks for on-line storage for less than $2 million, a very small fraction of the cost of the missions that generate the raw information. Disk storage is now competitive with tape for long-term, archival-class storage. Recommendation: Data centers and their sponsoring agencies should shift the primary storage medium from tape to disk. In addition,

OCR for page 15
Government Data Centers: Meeting Increasing Demands data centers and their sponsoring agencies should enable direct random on-line access through networks and provide support for remote queries on databases. DATABASE TECHNOLOGIES Files are a reasonable way to organize data when the physical storage medium is tape; however, disk storage permits data to be organized in much more flexible databases. Database techniques structure sets of parameters for the application of efficient processing algorithms. Traditionally, a database is composed of a number of interrelated tables containing sets of parameters such as number or text strings. The power of database techniques lies in the ability to relate parameters from one dataset to another, thereby reducing the processing and storage requirements. For example, using a numeric parameter, such as a zip code, to refer to a name, such as a city, makes it easier to store and search the information. Complex databases can have many layers of such associations. In the early 1990s, the Structured Query Language (SQL) was formalized and is used by most database software. The language provides a standard for the following: defining data structures; defining indices; formulating content-based queries; and maintaining data through inserts, deletes, and updates. Most database software (e.g., Oracle, MySQL, SQL Server) uses SQL as a core language for database interaction. Each has a unique method of optimizing the storage of data on disk or in memory. Capabilities for formulating spatial and temporal database queries are part of the most recent database query languages (e.g., SQL3), and support for indexing data on its spatial and temporal attributes enables efficient query execution. The complexity of the SQL query relates directly to the complexity of the database. Contemporary database technology permits random access to subsets of data stored on disk. In addition, object-relational databases are now capable of handling large, structured data, such as aerial photographs of the entire United States. For example, since its launch in June of 1998, TerraServer has delivered 108 terabytes of U.S. Geological Survey imagery to 63 million visitors (T. Barclay, Microsoft, personal

OCR for page 15
Government Data Centers: Meeting Increasing Demands communication, 2002). Concurrent requests from multiple users to read data can be supported efficiently without the waiting time typically incurred when many applications are writing to a database simultaneously. However, since database tables are constantly being accessed, they must be stored on-line rather than on tape. Although databases are commonly used by data centers for metadata management, they are not in widespread use for environmental data. However, application of database technology to environmental data is possible and may be useful for some environmental datasets. For example, Sky Server utilizes database technology to provide public access to Sloan Digital Sky Survey data (Szalay et al., 2001). Recommendation: Data centers and their sponsoring agencies should implement database technologies. When applicable, these technologies can improve data search and query, access and acquisition, interoperability, and retrieval from storage. METADATA MANAGEMENT Data centers have spent considerable effort preserving metadata by routinely documenting information on data lineage, such as the source data, transformation processes, and quality assurance information of their datasets. Open access to summaries of the dataset assembly processes and lineage has contributed significantly to ensuring user confidence in data product quality. For example, in most cases, users interested in data from a particular center can find information on the available data on the center’s Web site. In the past it was sufficient for data producers simply to develop good local data conventions and exercise the discipline necessary to generate the data and metadata in accordance to those conventions. However, the lack of a definitive universal system for lineage metadata can result in incomplete or missing data lineage information. In most cases, it is not possible to re-create data assembly information after the fact; in others it is costly and prone to error. Formatting data and creating metadata robust enough to be discovered and ingested by the emerging national and international data interchange networks would ensure that the data are as useful as possible, especially to other user communities. The practice of retaining complete data lineage information as metadata should be incorporated into the large volumes of scientific data being produced today. This will only be effective if accomplished with the participation and acceptance by the user communities.

OCR for page 15
Government Data Centers: Meeting Increasing Demands Authenticity is another important aspect of data archival. Users often obtain data from the easiest source, some of which may be three or four steps removed from the data centers. At each step, the data may have been processed or reformatted to suit one user’s particular purposes. Through neglect or, less likely, malicious intent, data products may become contaminated or altered, endangering their value and use. Consequently, information on authenticity should be included in the metadata. A related issue, specific to the research community rather than to data centers, involves citing data products in the peer-reviewed literature. The scientific practice of citing past research and methods, necessary for independent verification, has been neglected when citing data supporting an investigation’s findings. While this has been discussed for more than 10 years, the various publishing groups have not reached consensus on an accepted universal method for citing data products, their origin, or the processing that has been applied to them or on how to deal with the inherent challenges (e.g., numerous investigators for very large datasets). Most centers and even some scientific journals (e.g., American Geophysical Union journals) have a preferred mode of citation, but dataset citation remains uncommon. Dataset citation helps both data centers and data providers learn what data are being used and how. Routine documentation of the original data sources and the subsequent transformation and fusion steps used to develop a processed dataset would be most efficiently carried out by automated tools. Many practices in the software engineering field, such as testing, configuration management, and bug tracking, matured only after automated tools were developed to handle the complicated bookkeeping in a systematic manner. Moreover, the generation of structured lineage metadata suitable for ingest into other software presumes the existence of automated documentation tools. However, neither such tools nor recognized semantics to describe data lineage currently exist. Fortunately, database technology and standard formats can be as useful for metadata management as they are for data management. The self-describing approach adopted in the definition of extensible languages such as XML Schema is an important step in realizing technologies to support metadata management in government data centers. This self-describing approach would allow tools developed for data management to be applied to metadata. The data centers have worked to document data lineage, both by compliance with rich metadata standards (e.g., U.S. Geological Survey, 1995) and by the use of automated metadata tools such as the Science Data Production (SDP) toolkit (National Aeronautics and Space

OCR for page 15
Government Data Centers: Meeting Increasing Demands Administration, 2002b), both of which encourage detailed lineage information. However, a large body of scientific data generated outside of the data centers still lack sufficient metadata information to establish the data’s lineage and context. Examples of this are the Cooperative Ocean/Atmosphere Research Data Service (COARDS) and the recent climate and forecast metadata conventions, which use only a single broad “history” attribute to document the dataset’s lineage. Recommendation: To ensure that the greatest use is made of environmental data, (1) data producers should include data lineage and authenticity information in the metadata; (2) data centers should improve management of and access to metadata through standard formats and database technologies; and (3) users should routinely cite the data products they use in their investigations, using agreed upon dataset identifiers. To the greatest extent possible, data centers and data producers should rely on automatic tools for creating and managing metadata. HARDWARE AND SOFTWARE Because development and support costs for widely used products are lower, more and more data solutions are likely to be adapted from market-driven and market-proven technologies in an environment of constrained resources. The on-line database, entertainment, and gaming communities are all driving advances in large-scale data management, delivery, and visualization. Many researchers have learned how to construct plain-language database queries using Web search engines (e.g., Google). The data centers should be prepared to embark on collaborations with industry to apply such proven technologies and thereby reduce expensive custom development. The problems of managing large datasets have begun to receive the attention of the commercial sector, with the result that innovative, easy-to-use methods and tools for data search, retrieval, and analysis are widespread. For example, Google manages billions of individual records, yet searches return nearly instantaneously; digiMine, Inc. processes nearly a terabyte of data nightly (B. Nayfeh, digiMine, Inc., personal communication, 2002); and together America Online and Microsoft’s Hotmail handle the email accounts of more than 150 million people (Caslon Analytics, 2002). The challenges facing the data centers are small compared to the load experienced by any of the above enterprises.

OCR for page 15
Government Data Centers: Meeting Increasing Demands Large computational problems can be solved in small pieces by harnessing the power of desktop computing. For example, SETI@home and climateprediction.net use the processing power of millions of desktop computers to solve computationally intensive problems. The Center of Excellence in Space Data and Information Sciences (CESDIS) has constructed computing farms (commonly referred to as Beowulf clusters) to handle and process large datasets (Scyld Computing Company, 1998). Commodity hardware can also be used for data ingest, storage, and distribution. These computers generally have far smaller capabilities than the scientific computing hardware currently in the data centers. To be useful for scientific applications, the data segments, or granules, have to be broken into smaller units that can be ingested, processed, stored, and served with larger numbers of small processors. Current proprietary operating systems, such as SGi or Sun, to open-source platforms, such as Linux or FreeBSD Commodity solutions, could ease recompiling software on new computing architectures. In addition, the open-source movement2 has created the potential for data centers to meet future needs without enormous resource expenditures. Unrelated open-source projects (e.g., the Gnutella project, the XML standard) provide software tools at no cost that in some cases are better than unique proprietary solutions. Forms of authentication and lineage tracking common in the open-source communities should be adopted for improving metadata management. For example, one common practice in the open-source community is to publish an MD5—message-digest algorithm 5—listing the 32-character signature of the files with any piece of software or data that is distributed. The authoritative source publishes the digest, so that users can check the authenticity of their copies, regardless of where they got them. In summary, the commercial sector and the open-source movements have created robust software that meets many needs of the data centers. Usage and adaptation of these codes minimizes the need for expensive custom development. Since each is generally funded by a single agency and deals with a relatively narrow range of scientific disciplines (Table 1.1), data centers tend to be managed as centralized organizations. However, a federation of distributed systems, in which data centers remain the sources of 2   Open-source movement—software with its source code made available without any restrictions.

OCR for page 15
Government Data Centers: Meeting Increasing Demands authenticated environmental science data but not the only sources capable of distributing data, could help reduce infrastructure and management costs (NRC, 1995). Widely distributed data sources and grid infrastructures reduce resource contention at the data centers and provide a natural backup of earth science data. For example, Napster provided a global directory of on-line music. Users searching for particular music were redirected to numerous locations where search matches were encountered. Users then chose (based on bandwidth availability between their computers and the source, the authenticity of the source, and the exact characteristics of the music being searched for) where to download the music. The process is more complex for environmental science data than it was for Napster. In the environmental science community, the analogy would be to identify (by whatever means) a desired dataset and then request the dataset by name (not parameters) from a Napster analog. This approach would formalize current user practices of obtaining data from colleagues and data projects instead of from data centers. It would strengthen the data centers’ partnership with science by increasing the incidence of development of scientifically sound, useful products, reduce data transmission needs, and improve effectiveness and efficiency of the whole system. Multiple copies of products would be available from various sources; the data centers would become authenticators of data and the final archive and would implement production of new scientific products once a design is in hand; and users would have multiple options for retrieving data. Three current projects are attempting to implement this: MODster, NEpster, and the Distributed Oceanographic Data System (Sidebar 2.2). Recommendation: Data centers should adopt commodity hardware and commercial and open-source software solutions to the widest extent possible and concentrate their own efforts on problems that are unique to environmental data management. In addition, data centers and user communities should take advantage of federated distributed systems for making data available. IMPLEMENTATION On the one hand, new “bleeding edge” technical approaches offer ways to reduce costs and significantly improve data center performance. However, it is important to recognize that some new technical approaches may not prove successful and that even those that are

OCR for page 15
Government Data Centers: Meeting Increasing Demands successful may cause disruptions to center operations when implemented. Therefore, the data centers need to be able to test, prioritize, and develop the most promising new approaches at a smaller scale. SIDEBAR 2.2 Distributed Solutions Several ongoing environmental science projects are already be-nefiting from easily available nonspecialized solutions. Selected examples are described below. MODster Moderate Resolution Imaging Spectroradiometer (MODIS) provides global datasets with data on surface temperature, concentration of chlorophyll, fire occurrence, cloud cover, and others. Instruments on-board several NASA missions gather datasets covering a swath 2,330 kilometers wide, capturing 36 spectral bands of data at three resolution levels every two days for six-year periods. Due to their number and complexity, searching for a specific dataset is not a trivial task. To combat this, the Federation of Earth Science Information Partners is supporting the development of MODster to support the decentralization and distribution of MODIS data and services and to promote sharing of remote-sensing standard products. Organizations within the federation can retrieve standard MODIS data granules (the smallest increment of processed MODIS data that can be ordered, containing data for an area of 2,330 by 2,330 kilometers). The retrieval of these granules will be implemented by the Hypertext Transfer Protocol (HTTP) from a simple inventory server. The system will allow clients to reference MODIS granules by name alone. SOURCES: National Aeronautics and Space Administration (2002c); Federation of Earth Science Information Partners (2002). NEpster The Earth Science Technology Office (ESTO) and the National Polar-Orbiting Operational Environmental Satellite System (NPOESS) Preparatory Project (NPP) support the development of the NPP-ESTO Portal for Science, Technology and Environmental Research (Nepster) to serve the remote-sensing community better. The peer-to-peer architecture of the data archive system is based on the

OCR for page 15
Government Data Centers: Meeting Increasing Demands Napster model, a system developed for sharing music files. In NEpster, several additional features have been added to facilitate the handling of remotely sensed data, specifically (1) a temporary data storage area for sites that do not allow continual access to their servers; (2) an intelligent broker that controls data access in accordance with the distribution policies of each data source; and (3) a comprehensive geographically based query interface to expedite data searches. The NEpster system is made up of two major components: the data notification and entry subsystem, and the query engine. The first phase of NEpster development will focus on accessing and managing real-time data, and the second phase will focus on access to the MODIS Direct Broadcast data archives through the Goddard Space Flight Center DAAC. SOURCE: National Aeronautics and Space Administration (2002d). DODS The Distributed Oceanographic Data System (DODS) is a highly distributed software framework for requesting and transporting data across the Internet, which allows users to control both how their data are distributed and how they can access remote data. As data users prefer to work with software with which they are most familiar, DODS servers make data available regardless of local storage formats. In addition, DODS applications allow users to transform existing data analysis and visualization applications into those able to access remote DODS data. Because DODS data are distributed by the same scientists who develop the data, the DODS protocol and software rely on the user community to use, improve, and extend the system. The current DODS Data Access Protocol (DAP) frames requests and responses using hypertext transfer protocol (HTTP). This data model has already developed a transport protocol, software framework, C++ and Java implementations of the data model and transport protocol, and a set of DODS servers and clients. Users are allowed to access any data on a DODS server via the Internet regardless of native format, without disrupting local functions and access. Although DODS was originally developed for sharing oceanographic data, the design can be applied to other user communities. SOURCE: University Consortium for Atmospheric Research (2002). One way to accomplish this is to create independent demonstration data centers, each of which would build small functional prototypes with

OCR for page 15
Government Data Centers: Meeting Increasing Demands small efficient teams that would distribute data from a few substantial datasets that are well documented (such as those from NASA and NOAA). This would be similar to the smaller-scale Sky Server project (Szalay et al., 2001). The costs of implementing demonstration data centers can be minimized by building on work that is already in progress (e.g., Sidebar 2.2). Finally, the demonstration centers would also help the data centers and communities adapt to serving and interacting with a wider range of users. One possible choice for testing new technologies is Moderate Resolution Imaging Spectroradiometer (MODIS) data. In this example the goals of the demonstration data center could include the following: Define an XML Schema with the standard format definitions for the datasets. Show how the standard format definitions can be used to formulate queries on the data collection. Allow multiple avenues of network access to data already available. Specifically, provide real-time access to all data. Example access protocols include: FTP browse via a hierarchical tree (sorted by data/time and location). Network File System (NFS) access via read-only network drives. Implementation similar to NEpster/MODster (Sidebar 2.2), where multiple sites maintain subsets of the entire MODIS dataset. The participating data center could solicit participation of the MODIS science team and the other sites that have MODIS downlink systems, which have some (if not all) of the data. This might entail acquiring read-only access to datasets at non-data-center locations. The goal would be to leverage the work of researchers seeking to make science data community property. FTP subscription service, if it is not already provided. Enhance and publish XML-based metadata related to the datasets. This entails adding certain metadata to that already captured by EOSDIS, such as an MD5 signature for authentication. The metadata schema describing the layout of the demonstration data center and a method of providing direct SQL access of the database to users should also be published. The enhanced metadata will allow varied researchers the opportunity to explore the dataset in innovative ways. Utilize database technologies for user queries and searches. Identify and provide limited subsetting tools that run on the host computers. At a minimum, allow users to subset simple spatial grids and temporal intervals. Users would not need direct access to the data storage

OCR for page 15
Government Data Centers: Meeting Increasing Demands computers; perhaps a small application in a language such as Java could accept user subsetting boundaries, subset the data (and accompanying metadata), and deliver the data via FTP. Use commodity-level hardware and software where possible and cost effective. Monitor access statistics of FTP, NFS, and MODster and actively pursue user feedback. Recommendation: Data centers and their sponsoring agencies should create independent demonstration data centers aimed at testing applicable technologies and satisfying the data needs of a range of users, including interdisciplinary and nontechnical users. These centers might best prove technological approaches through several participants working in parallel. While the costs of implementing new solutions are likely to be significant, careful strategic planning and phasing in of new solutions could greatly reduce the need to invest substantial new resources in technology. By using opportunities to adopt incremental changes in technology, data centers can spread the costs of hardware and software acquisition over time. Recommendation: Data centers should aggressively adopt newer, more “bleeding edge” technical approaches where there might be significant return on investment. This should be done carefully to minimize the inevitable failures that will occur along the way. Even with the failures, the committee believes the cost savings and improvements for end users will be substantial when compared to the methods practiced today. After decades of development and at least one decade of substantial investment, the nation’s data centers have achieved successes. They store huge volumes of data reliably and provide some widely used and trusted products. The challenges posed by the rapidly expanding quantity and diversity of environmental data and increasing user demands can be met in part through technological solutions. Although technology can contribute to the solution of important environmental data management problems, human effort is still central to data center operations. Therefore, data centers should ensure that the latest technologies are assessed for relevance and utility but should not rely solely on technology without continuing to invest in the scientific and human elements of data management.