Appendix D
Workshop Discussions
WORKING GROUP REPORTS
Workshop participants divided into two working groups: (1) data access and ingest and (2) data distribution and processing. The questions posed to the working groups (given in italics) and their conclusions (indented) are listed below.
Data Access and Ingest Working Group
The working group on data access and ingest assessed the ways users access data and the ways that data centers collect data.
What is good and bad about the way users access data?
Subsetting capabilities should be improved, so that users obtain only the data they want.
Users do not always know of opportunities or “windows” for easier access to data. For example, potential users should be alerted before a data center transfers data to tape for storage or if the data are available elsewhere in a format that is more easily used. Data centers should track the diverse access
opportunities and improve the way users are informed about these opportunities.
Some users purposefully retrieve more data than they require, either because of uncertainty that the data will always be available or because it is often easier than retrieving subsets. This practice unnecessarily strains the network. In addition, hoarding data can waste users’ storage resources and result in datasets that are not kept up-to-date.
Data collected by individual researchers are not available to the community in a timely manner and are lost when the researcher retires.
Duplication of effort in data management has many benefits and some drawbacks. Duplication can lead to new ideas, better metadata, increased access options, and greater data security. On the other hand, duplication can make tracking the data lineage more complicated and can be a waste of resources.
The user community is broader and more diverse than the community for which it was originally planned. Facilitating the access and understanding of data by interdisciplinary and non-technical users should be a priority for the data centers.
What kind of infrastructure/technology would make it easier for users to access and exploit data? What search tools would be useful for isolating the requested data and obtaining them in a useable format? Is a common format (e.g., HDF) the right answer, or are there better formats for archiving, storage, and transmission?
Better dataset visualization tools would ease user access.
Using translatable structured formats would be a logical way to allow both independence and interoperability. The working group noted that XML, which was developed for the World Wide Web, might be a good starting point for standardizing metadata formats.
Libraries might be a key new player in the digital world as archival entities for global climate change data. Libraries
have a long tradition of preserving and indexing information, and many are expanding their scope as providers of information science services. University libraries could cache and copy datasets and enable users to access relevant information at other libraries.
How can the use and refinement of data be tracked? How can pedigree effectively be made a part of the data? How can the quality control and pedigree of data products best be assured?
It is not enough simply to document the data. Obtaining historical perspective from data requires the ability to query the entire sequence of data use and refinement. Some technological solutions exist, but it is still an active area of research.
What are the greatest problems getting data into data centers?
Shortage of ingest staff and difficulty with maintaining state-of-the-art hardware and software are challenges.
File transfer protocol (FTP) is not always an effective means of transferring data, especially if data volume rates are high.
Computing or creating metadata is the most time-consuming and labor-intensive part of data processing and may create a bottleneck. Some metadata could be computed and stored automatically when data are processed. However, before this can happen, it has to be determined which and how metadata should be stored. In addition, software that can extract and store the metadata must be developed. Other metadata cannot be automatically computed and stored but must be identified, created, and entered by human experts. Even in those cases, software should be developed to aid the human expert.
The practice of retaining versions of data at each stage of processing places heavy demands on storage space. Users should be able to reproduce different versions from archived raw data; however, hardware changes make it difficult, if not impossible, to do so.
Overall, the working group concluded that the main limitation in data ingest is not technology but the human expertise and time for building knowledge into datasets. In addition, although small datasets, such as those resulting from observing stations, are often quite useful, they are time consuming to maintain. Finally, better coordination and communication among agencies, data producers, data archivists, and producers of metadata would improve data ingest.
Data Distribution and Processing Working Group
The group on data distribution and processing examined the different data processing strategies at the NOAA, DOE, and NASA data centers. The participants were asked to consider how data can be accessed efficiently by increasingly diverse users once data become part of the national archive.
How do we handle both data- and compute-intensive processing? What about reprocessing? What about supply-driven versus demand-driven processing? Is it advantageous to do all processing on demand?
Data- and compute-intensive processing are typically handled separately, so it is not necessary to address both simultaneously. Reprocessing demands vary by data type and depends on the information needs. Reprocessing is done when there are new data or models and thus a chance to improve the usefulness of the data.
A data center’s decision to adopt a supply-driven or a demand-driven data processing model reflects the scientific and economic needs of its user base. Supply-driven data processing results in high data quality and availability at the cost of having to carry out continuous high-volume data reprocessing. Under a demand-driven model, only those data that a user requests are processed, resulting in lower processing and storage costs.
Can this processing be distributed to commodity-level computing equipment?
Although commodity hardware and software are easily applied to generic processing, widespread adoption of commodity solutions is hindered by data center needs for high bandwidth, fast processing speeds, and specialized error handling capabilities.
What technologies could make data distribution more efficient? Are there efficient and globally applicable subsetting tools that would dramatically decrease distribution costs while simultaneously simplifying exploitation of data?
Few users need all of the data in a database. Rather, they want to extract only the data relevant to their application (i.e., subsetting). Consequently, data subsetting is a necessary component of efficient data distribution. Although subsetting tools, such as all-disk storage and database technology, are available, the lack of familiarity within the file-centric science community has hindered their implementation.
How can access or resource restrictions be managed?
Restrictions can be managed by limiting consumption via charges based on resource usage, such as media use, consulting time, or data volume. Such charges must be in accordance with U.S. data policy (i.e., OMB Circular A-130 [OMB, 1996]).
REACTION PANEL SUMMARY
On the second day of the workshop, a panel of workshop participants representing data center managers, agency sponsors, data users, and the information technology industry was asked to react to the previous day’s working group presentations. Panelists were asked the questions given in italics below; responses are summarized with each question.
What is your reaction to the first day’s deliberations? Are we on the right track?
The panelists noted the following:
-
Improvements for user access are needed (especially for interdisciplinary and nontechnical users).
-
Data centers should focus on tools for finding data and for decision making.
-
Technology is but one of the challenges facing data centers.
-
Some of the technological challenges facing data centers have already been addressed in the information technology fields for other applications.
-
Humans are the limiting factor in adopting and adapting to new technological capabilities.
What technologies can be resource multipliers (near versus long term)? If there’s one you’d apply in the short term, which would it be?
Techniques for tracking, searching, and sharing metadata, especially through standardizing the use of a format such as XML Schema, would be a substantial benefit for data search and access. In addition, on-line datasets and databases and market-driven technologies have great potential applications in data center operations.
What have we missed?
Data centers and their sponsoring agencies must still consider the interaction between people and technology, rather than simply focusing on technology. One panelist suggested that it would be useful to improve the way available resources are promoted and communicated to users. Data centers should define the true metric of their performance carefully: better user services, decreased costs, increased number of users?