At the end of the workshop, Michael Stonebraker presented the following list of messages that he thought were brought out by the discussions:
Many research groups leave the task of developing data integration software to science postdoctoral students, which is wasteful of the students’ time and can lead to inadequate results. Good DBMSs are difficult to write and take many person-years of effort. A better idea is to apply computer science expertise early in the process. A partnership of equals between computer scientists and natural scientists can pay off admirably. The successful collaboration between Alex Szalay and Jim Gray is a prime example.
It is impossible to build a complete software stack quickly. The best way to progress is to specify modest short-term goals and get them accomplished. Once something is working, one can build the next phase. In other words, one should take “baby steps,” always going from something that works to something that continues to work. What often kills projects is the desire to take a giant leap in functionality, without having intervening milestones.
Funding agencies can help scientists establish the capability for data integration by steps encouraging (or, indeed, requiring) the researchers they support to publish and curate their data. Agencies should strengthen the incentives for scientists to preserve their
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 35
5
Workshop Lessons
A t the end of the workshop, Michael Stonebraker presented the fol -
lowing list of messages that he thought were brought out by the
discussions:
• any research groups leave the task of developing data integration
M
software to science postdoctoral students, which is wasteful of the
students’ time and can lead to inadequate results. Good DBMSs
are difficult to write and take many person-years of effort. A better
idea is to apply computer science expertise early in the process. A
partnership of equals between computer scientists and natural sci -
entists can pay off admirably. The successful collaboration between
Alex Szalay and Jim Gray is a prime example.
• t is impossible to build a complete software stack quickly. The
I
best way to progress is to specify modest short-term goals and get
them accomplished. Once something is working, one can build the
next phase. In other words, one should take “baby steps,” always
going from something that works to something that continues to
work. What often kills projects is the desire to take a giant leap in
functionality, without having intervening milestones.
• unding agencies can help scientists establish the capability for
F
data integration by steps encouraging (or, indeed, requiring) the
researchers they support to publish and curate their data. Agen-
cies should strengthen the incentives for scientists to preserve their
OCR for page 35
STEPS TOWARD LARGE-SCALE DATA INTEGRATION IN THE SCIENCES
data in reusable form, such as by giving special consideration to
proposals that include plans for careful data publication.
• oreover, funding agencies can encourage the establishment and
M
maintenance of data repositories and work to improve the tools
available for data curation and sharing.
• n open-source tool-kit to assist with data transformations would
A
be of immense value. This is something that agencies can budget
for, solicit proposals for, and fund.
• n open-source science-oriented DBMS would also be of immense
A
value. Again, this is something that agencies can budget for, solicit
proposals for, and fund.
Dr. Stonebraker offered his own thoughts on how to improve the
software that enables data integration. Noting that scientists often build
the entire software stack for each new project, he pointed out how this
limits, even precludes, the reuse of software modules and the leveraging
of well-established tools. Building afresh was followed by the Mission to
Planet Earth a decade ago as well as more recently by the Large Hadron
Collider project. In contrast, the Sloan Digital Sky Survey (SDSS) made
data available in an SQL server database and allowed astronomers to run
a collection of queries of interest.
Dr. Stonebraker suggested a number of ways to improve the common
state of practice:
• end the query to the data, not the other way around. Currently,
S
publication schemes typically send data sets to scientists who load
these data into their favorite software system and then further
reduce them to find actual data of interest. In effect, a central sys-
tem sends data to scientists who query them locally to discover
items of interest. This approach is an inefficient use of bandwidth,
because large data sets are sent over networks only to then be
reduced two or three orders of magnitude. It would be much more
efficient to reduce the data upstream in response to a request and
save the bandwidth.1 An alternate approach for saving bandwidth,
which is sometimes practiced today, is to store the data in a pro-
cessed form, so that their transmittal is easier. But this has the
shortcoming that requesting scientists have different needs, so any
1 An anonymous reviewer pointed out that, in general, this approach may not scale,
as some centralized stores will have to support an ever increasing number of queries. A
complementary approach is to have replication on demand, where subsets of the data are
replicated to secondary sites based on local demand. A form of this approach was taken by
the LHC with its predetermined tier structure.
OCR for page 35
WORKSHOP LESSONS
given processing will not be optimal for everyone. To facilitate the
flexibility that scientists need, one may have to make available the
raw data and not just a highly processed derived data set.
• ut the raw data in a DBMS and then run the processing inside the
P
DBMS engine. The only feasible way to allow a scientist to insert
his or her own components into the processing pipeline is to make
the processing a collection of DBMS tasks. Otherwise, the complex-
ity of altering the pipeline is just too daunting.
• ecord the provenance (lineage) of the data carefully, with an auto-
R
mated system. This is necessary for the raw data, of course, but it is
also crucial to precisely record the semantics of any derived data,
thus carefully maintaining the provenance of those data sets. This
is not something that current application code or system software
is good at. Also, anything that requires human effort is not going to
be widely used, and so systems are needed that record provenance
as a side effect of natural science inquiry and processing, not an
additional step. One of the big advantages of a DBMS is that it can
record provenance automatically by recording every query and
update that has been run.
• better DBMS is obviously needed for science applications, one of
A
the challenges called for in Chapter 2. Scientists who spoke at the
workshop did not like current relational DBMSs, which were built
for business data processing, because they do not work well, if at
all, on science data. The six messages presented at the beginning
of this chapter are unlikely to be successful with current commer-
cial DBMSs. Self-documenting data sets, via RDF with reference
to code systems, will be needed, along with separation of the data
from the application/analysis software.
• t present, most fields of science do not have systematic means for
A
a scientist to make data available. They do not have public reposi-
tories in which to insert data, standards for provenance to describe
the exact meaning of data sets, or easy ways to search the Internet
looking for data sets of interest. In addition to data repositories,
repositories of standards and translators are also needed.
While there was some discussion of these ideas at the workshop, no
attempt was made to capture the range of opinions, and the thoughts
presented in this chapter do not necessarily represent a consensus of the
workshop participants.