The following HTML text is provided to enhance online
readability. Many aspects of typography translate only awkwardly to HTML.
Please use the page image
as the authoritative form to ensure accuracy.
Steps toward Large-Scale Data Integration in the Sciences: Summary of a Workshop
given processing will not be optimal for everyone. To facilitate the flexibility that scientists need, one may have to make available the raw data and not just a highly processed derived data set.
Put the raw data in a DBMS and then run the processing inside the DBMS engine. The only feasible way to allow a scientist to insert his or her own components into the processing pipeline is to make the processing a collection of DBMS tasks. Otherwise, the complexity of altering the pipeline is just too daunting.
Record the provenance (lineage) of the data carefully, with an automated system. This is necessary for the raw data, of course, but it is also crucial to precisely record the semantics of any derived data, thus carefully maintaining the provenance of those data sets. This is not something that current application code or system software is good at. Also, anything that requires human effort is not going to be widely used, and so systems are needed that record provenance as a side effect of natural science inquiry and processing, not an additional step. One of the big advantages of a DBMS is that it can record provenance automatically by recording every query and update that has been run.
A better DBMS is obviously needed for science applications, one of the challenges called for in Chapter 2. Scientists who spoke at the workshop did not like current relational DBMSs, which were built for business data processing, because they do not work well, if at all, on science data. The six messages presented at the beginning of this chapter are unlikely to be successful with current commercial DBMSs. Self-documenting data sets, via RDF with reference to code systems, will be needed, along with separation of the data from the application/analysis software.
At present, most fields of science do not have systematic means for a scientist to make data available. They do not have public repositories in which to insert data, standards for provenance to describe the exact meaning of data sets, or easy ways to search the Internet looking for data sets of interest. In addition to data repositories, repositories of standards and translators are also needed.
While there was some discussion of these ideas at the workshop, no attempt was made to capture the range of opinions, and the thoughts presented in this chapter do not necessarily represent a consensus of the workshop participants.