increases in the amount of data available to science and engineering researchers. This includes not only data from experiments and observations but also data generated by computer simulations. It is becoming common for research groups to quickly gather or generate terabytes of data, and a number of programs are accumulating petabytes of data. (One terabyte equals 1012 bytes and 1 petabyte equals 1015 bytes.) Data integration must overcome the challenge of finding disparate, distributed sources of data, which is often referred to as “data discovery,” and the challenge of effectively utilizing the collective information in those sources to produce new insight—a process known as “data exploitation.” The workshop on which this report is based did not try to characterize comprehensively the various ways in which data integration is useful or necessary for the advance of science.
The term “data integration” first emerged in connection with the need for organizations to provide data users “with a homogeneous logical view of data that is physically distributed over heterogeneous data sources” (Ziegler and Dittrich, 2004). The concept of data integration used here is a broad one, encompassing any technology, process, or policy that affects a scientist or engineer’s ability to find, interpret, and aggregate/mine/analyze distributed sources of information. Data interoperability and knowledge discovery are both intended to be within the concept’s scope.
All too often, data discovery depends on word of mouth: A researcher happens to have heard about a data set that might be useful in his or her own research or makes inquiries of colleagues in order to find relevant data. In fields where there are a limited number of large facilities (for example, high-energy physics and astronomy) or a predictable administrative structure for data storage (for example, national weather bureaus), the challenge may be manageable, although meeting it still often depends on a haphazard, serendipitous process. But in research fields where small groups can accumulate and store large amounts of data, valuable data sets can exist in many places. In particular, useful data might be held by someone who is outside the network of a researcher who is seeking those data. More problematic still are instances where a researcher seeks to integrate data from very different communities, such as geospatial data with sociological, medical, and other overlays. Such creative merging of knowledge can lead to very novel insights, but it is hindered by the data discovery challenge.
Once data sources have been found, data exploitation presents another set of challenges. A researcher must develop a clear understanding of the meaning of each of the data sets. Achieving such an understanding is difficult, because documentation of the conditions under which the data were collected can be spotty. Simple aspects such as the units of measure must be known definitively, and more subtle aspects such as environmen-