National Academies Press: OpenBook
« Previous: 4 Success in Data Integration
Suggested Citation:"5 Workshop Lessons." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

5
Workshop Lessons

At the end of the workshop, Michael Stonebraker presented the following list of messages that he thought were brought out by the discussions:

  • Many research groups leave the task of developing data integration software to science postdoctoral students, which is wasteful of the students’ time and can lead to inadequate results. Good DBMSs are difficult to write and take many person-years of effort. A better idea is to apply computer science expertise early in the process. A partnership of equals between computer scientists and natural scientists can pay off admirably. The successful collaboration between Alex Szalay and Jim Gray is a prime example.

  • It is impossible to build a complete software stack quickly. The best way to progress is to specify modest short-term goals and get them accomplished. Once something is working, one can build the next phase. In other words, one should take “baby steps,” always going from something that works to something that continues to work. What often kills projects is the desire to take a giant leap in functionality, without having intervening milestones.

  • Funding agencies can help scientists establish the capability for data integration by steps encouraging (or, indeed, requiring) the researchers they support to publish and curate their data. Agencies should strengthen the incentives for scientists to preserve their

Suggested Citation:"5 Workshop Lessons." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

data in reusable form, such as by giving special consideration to proposals that include plans for careful data publication.

  • Moreover, funding agencies can encourage the establishment and maintenance of data repositories and work to improve the tools available for data curation and sharing.

  • An open-source tool-kit to assist with data transformations would be of immense value. This is something that agencies can budget for, solicit proposals for, and fund.

  • An open-source science-oriented DBMS would also be of immense value. Again, this is something that agencies can budget for, solicit proposals for, and fund.

Dr. Stonebraker offered his own thoughts on how to improve the software that enables data integration. Noting that scientists often build the entire software stack for each new project, he pointed out how this limits, even precludes, the reuse of software modules and the leveraging of well-established tools. Building afresh was followed by the Mission to Planet Earth a decade ago as well as more recently by the Large Hadron Collider project. In contrast, the Sloan Digital Sky Survey (SDSS) made data available in an SQL server database and allowed astronomers to run a collection of queries of interest.

Dr. Stonebraker suggested a number of ways to improve the common state of practice:

  • Send the query to the data, not the other way around. Currently, publication schemes typically send data sets to scientists who load these data into their favorite software system and then further reduce them to find actual data of interest. In effect, a central system sends data to scientists who query them locally to discover items of interest. This approach is an inefficient use of bandwidth, because large data sets are sent over networks only to then be reduced two or three orders of magnitude. It would be much more efficient to reduce the data upstream in response to a request and save the bandwidth.1 An alternate approach for saving bandwidth, which is sometimes practiced today, is to store the data in a processed form, so that their transmittal is easier. But this has the shortcoming that requesting scientists have different needs, so any

1

An anonymous reviewer pointed out that, in general, this approach may not scale, as some centralized stores will have to support an ever increasing number of queries. A complementary approach is to have replication on demand, where subsets of the data are replicated to secondary sites based on local demand. A form of this approach was taken by the LHC with its predetermined tier structure.

Suggested Citation:"5 Workshop Lessons." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

given processing will not be optimal for everyone. To facilitate the flexibility that scientists need, one may have to make available the raw data and not just a highly processed derived data set.

  • Put the raw data in a DBMS and then run the processing inside the DBMS engine. The only feasible way to allow a scientist to insert his or her own components into the processing pipeline is to make the processing a collection of DBMS tasks. Otherwise, the complexity of altering the pipeline is just too daunting.

  • Record the provenance (lineage) of the data carefully, with an automated system. This is necessary for the raw data, of course, but it is also crucial to precisely record the semantics of any derived data, thus carefully maintaining the provenance of those data sets. This is not something that current application code or system software is good at. Also, anything that requires human effort is not going to be widely used, and so systems are needed that record provenance as a side effect of natural science inquiry and processing, not an additional step. One of the big advantages of a DBMS is that it can record provenance automatically by recording every query and update that has been run.

  • A better DBMS is obviously needed for science applications, one of the challenges called for in Chapter 2. Scientists who spoke at the workshop did not like current relational DBMSs, which were built for business data processing, because they do not work well, if at all, on science data. The six messages presented at the beginning of this chapter are unlikely to be successful with current commercial DBMSs. Self-documenting data sets, via RDF with reference to code systems, will be needed, along with separation of the data from the application/analysis software.

  • At present, most fields of science do not have systematic means for a scientist to make data available. They do not have public repositories in which to insert data, standards for provenance to describe the exact meaning of data sets, or easy ways to search the Internet looking for data sets of interest. In addition to data repositories, repositories of standards and translators are also needed.

While there was some discussion of these ideas at the workshop, no attempt was made to capture the range of opinions, and the thoughts presented in this chapter do not necessarily represent a consensus of the workshop participants.

Suggested Citation:"5 Workshop Lessons." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 35
Suggested Citation:"5 Workshop Lessons." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 36
Suggested Citation:"5 Workshop Lessons." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 37
Next: References »
Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop Get This Book
×
Buy Paperback | $29.00 Buy Ebook | $23.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Steps Toward Large-Scale Data Integration in the Sciences summarizes a National Research Council (NRC) workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop was held August 19-20, 2009, in Washington, D.C.

The workshop examined a collection of scientific research domains, with application experts explaining the issues in their disciplines and current best practices. This approach allowed the participants to gain insights about both commonalities and differences in the data integration challenges facing the various communities. In addition to hearing from research domain experts, the workshop also featured experts working on the cutting edge of techniques for handling data integration problems. This provided participants with insights on the current state of the art. The goals were to identify areas in which the emerging needs of research communities are not being addressed and to point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!