. "3 Improving Current Capabilities for Data Integration in Science." Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press, 2010.
The following HTML text is provided to enhance online
readability. Many aspects of typography translate only awkwardly to HTML.
Please use the page image
as the authoritative form to ensure accuracy.
Steps toward Large-Scale Data Integration in the Sciences: Summary of a Workshop
Further, there are multiple ways in which reuse might be hindered. The structure selected by the original researcher to organize the data might be inconvenient for a subsequent user—for example, they might be stored as geographic images, one for each time step, whereas the new researcher needs a time series for each spatial location. Or an underlying choice that was not even explicitly considered by the original researcher—perhaps the projection that was used to map the data from Earth’s surface onto two dimensions—might not be suitable for the reuse context. (Even for research areas that have matured, such challenges can arise whenever data are applied in unanticipated ways.) The parameters that characterize the projection, or even the units, might not be clear because of incomplete metadata. Lastly, the second researcher’s software tools may not be able to handle the individual data elements. To massage the data into correct format and organization may pose a tedious data-manipulation problem. It can take weeks or more of effort to convert data into a form suitable for reuse. Many new researchers give up before they get to this stage. In short, it is often just too difficult to reuse data gathered by other researchers.
It is crucial to focus on this transformation problem. Several workshop participants noted that it is not difficult to write clear transforms if the relevant metadata are available. Most popular transforms have been written multiple times by multiple labs, which is, of course, inefficient. Workshop participants said it was rarely easy to locate existing transformation software of interest, and some suggested that an online service to share transforms could be established. Such a service would allow scientists to avoid having to reinvent tools, but it would require publishing and documenting transforms in a systematic way so that others could locate them.
In our Internet-savvy world, one should be able to locate data sets and transforms of interest using the Web. At present this is a hopeless task. Workshop participants identified four steps that would make this task possible:
Repositories. Several participants noted the need for domain-specific (as well as general) repositories where scientific data sets can be archived. Because data decay over time and require periodic maintenance, such repositories must be staffed with professionals who can do such maintenance as well as assist scientists trying to use data sets in the repository. Good search tools are needed so the contents of a repository can be easily browsed and objects of interest located. Lastly, curation facilities are also needed so that the precise semantics of data sets can be documented. Obviously, the curation cannot be such an onerous human task that the repository will not