The following HTML text is provided to enhance online
readability. Many aspects of typography translate only awkwardly to HTML.
Please use the page image
as the authoritative form to ensure accuracy.
Steps toward Large-Scale Data Integration in the Sciences: Summary of a Workshop
data in reusable form, such as by giving special consideration to proposals that include plans for careful data publication.
Moreover, funding agencies can encourage the establishment and maintenance of data repositories and work to improve the tools available for data curation and sharing.
An open-source tool-kit to assist with data transformations would be of immense value. This is something that agencies can budget for, solicit proposals for, and fund.
An open-source science-oriented DBMS would also be of immense value. Again, this is something that agencies can budget for, solicit proposals for, and fund.
Dr. Stonebraker offered his own thoughts on how to improve the software that enables data integration. Noting that scientists often build the entire software stack for each new project, he pointed out how this limits, even precludes, the reuse of software modules and the leveraging of well-established tools. Building afresh was followed by the Mission to Planet Earth a decade ago as well as more recently by the Large Hadron Collider project. In contrast, the Sloan Digital Sky Survey (SDSS) made data available in an SQL server database and allowed astronomers to run a collection of queries of interest.
Dr. Stonebraker suggested a number of ways to improve the common state of practice:
Send the query to the data, not the other way around. Currently, publication schemes typically send data sets to scientists who load these data into their favorite software system and then further reduce them to find actual data of interest. In effect, a central system sends data to scientists who query them locally to discover items of interest. This approach is an inefficient use of bandwidth, because large data sets are sent over networks only to then be reduced two or three orders of magnitude. It would be much more efficient to reduce the data upstream in response to a request and save the bandwidth.1 An alternate approach for saving bandwidth, which is sometimes practiced today, is to store the data in a processed form, so that their transmittal is easier. But this has the shortcoming that requesting scientists have different needs, so any
An anonymous reviewer pointed out that, in general, this approach may not scale, as some centralized stores will have to support an ever increasing number of queries. A complementary approach is to have replication on demand, where subsets of the data are replicated to secondary sites based on local demand. A form of this approach was taken by the LHC with its predetermined tier structure.