knowing specifically what to ask for in a data search is not straightforward when query terms and procedures vary from center to center. For users who are less knowledgeable about the datasets they want, searches frequently require help from the centers’ customer service representatives. However, NOAA’s report to Congress, The Nation’s Environmental Data: Treasures at Risk, notes that, although requests for NOAA’s data increased from about 95,000 in 1979 to over 4 million in 1999, staffing levels decreased from 582 to 321 (NOAA, 2001).

Another challenge for data centers is to deliver only the data that the user needs and requests, neither more nor less. Subsetting is the process of extracting portions of data, such as time slices or spatially defined sections. Subsetting is especially important in large datasets, such as those generated by remote sensing. However, despite consistent user demand, there continues to be a dearth of subsetting tools. Scientific products from the data are also available, but their coverage and diversity are sparse.

Once users have found what they need, they face the challenge of obtaining the data, which can require complex skills. Although frequent users typically become adept at manipulating the infrastructure, access and retrieval methods differ from center to center, so even skilled users may be familiar with only one center’s approach. Inexperienced users and investigators using many different data sources require a substantial investment of time to acquire data. Almost without exception, data centers offer multiple methods of retrieving data in their holdings (e.g., file transfer protocol (FTP), which permits users to copy files stored on data center computers, and media order, in which centers copy the data of interest onto compact disk or tape). This provides flexibility but complicates the retrieval process.

Even with the appropriate query term, knowledge of the best access methods, and available subsetting tools, access to data still depends upon the ability of the centers to store data on media that can be retrieved and manipulated easily. Data centers rely too heavily on off-line or near-line (e.g., tape robots) storage. The consequences of this are that retrieval can be slow and that searching and subsetting can be difficult.

For interdisciplinary users, the real challenge arises with integrating disparate datasets, usually obtained from different data centers. Data interoperability remains difficult because standards, formats, and metadata were chosen to optimize the usefulness of a particular dataset, rather than a collection of diverse data. The growth of on-line distributed data archives has prompted many environmental research programs to address their own interoperability needs through data formats and metadata conventions (e.g., Federal Geographic Data Committee, 1998).



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement