tal conditions, equipment calibrations, preprocessing algorithms, and so on can also be important. If data are being used for research outside the field for which they were collected, the risk of misinterpretation is severe, because research communities can have unstated assumptions about what to document or what to assume, and these assumptions can be overlooked during the integration process.

There are technical and policy challenges associated with the actual aggregation of data. If some data were collected with privacy guarantees, how should those guarantees be interpreted if only a subset of the data, or a summary of it, is used for a secondary analysis? There are also technical challenges in translating disparate data sets so that they can be merged: for example, putting maps into the same coordinate system, aligning data that were collected on different sampling grids, correcting for systematic differences among equipment, and so on.

For the purposes of the workshop, “large-scale data integration” was taken to refer to the aggregation of data sets that are so large that searching or moving them is nontrivial, a technical challenge that is becoming ever more common as it becomes easy to produce and store terabytes. Workshop participants were also aware that a growing number of opportunities require the aggregation of large numbers of modest-size datasets, and some of the workshop discussion reflects the challenges associated with those situations. To bound the discussion and produce the most useful outcomes, the workshop planning committee decided to focus on issues related to integrating scientific research data.1 The particular disciplines discussed include physics, biology, chemistry, Earth sciences, satellite imagery, astronomy, geospatial data, and research medical data. By and large, these are all structured data—that is, records of fairly rigidly formatted information. In contrast, many data integration efforts outside scientific research deal more with unstructured data (text) and semistructured data (want ads, personnel records, and so on). Unstructured data and the needs of nonresearch users with an interest in data integration were not a focus of the workshop. Of course, there is a substantial gray area. For example, even when one is seeking and aggregating structured scientific data, tools designed for unstructured data might be necessary because structure may not be readily recognizable.

Michael Marron of the National Institutes of Health (NIH), a cosponsor of the workshop, explained NIH’s interest in the topic. The long-


The statement of task and original work plan for the project documented in this report presumed two workshops and a committee consensus report. The project was scaled back to one workshop and a rapporteur-authored summary in order to align with available resources. The workshop planning committee decided that focusing the subject matter coverage on scientific research data and related communities would allow for the most productive discussion of issues and possible solutions during a single two-day workshop.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement