Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 69
3. How Might Open Online Knowledge Discovery Advance the Progress
of Science?
Technological Factors
Session Chair: Hal Abelson
-MIT-
FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 69
OCR for page 70
OCR for page 71
Interoperability, Standards, and Linked Data
James Hendler
-Rensselaer Polytechnic Institute-
I will focus on the technological challenge of online knowledge discovery. I will first
address interoperability issues and then talk about some promising developments.
Scientists can learn a lot about data sharing just from observing what is happening in
the outside world. There are many data initiatives and developments outside academia to which
we should be paying more attention.
Expressive ontologies are not a scalable solution. They are a necessary solution in
certain domains, but they do not solve the interoperability problem widely. They allow a
scientist to build his or her silo better, and sometimes they even let the silo get a little wider,
but they are not good at breaking down the silos unless the scientist can start linking
ontologies, in which case he or she has to deal with the ontology interoperability issue, as well
as the costs and difficulty of building them and their brittleness.
I am known for the slogan "A little semantics goes a long way." I have said this so
often that about 10 years ago people started calling it the Hendler hypothesis at the Defense
Advanced Research Projects Agency (DARPA).
We are used to thinking that a major science problem is searching for information in
papers, and we have forgotten that we also have to find and use the data underlying the science.
A traditional scientific view of the world might be, "I get my data, I do my work, and then I
want to share it." But the sharing should be part of how we do science, and other issues such as
visualization, archiving, and integration should be in that life cycle too. I will talk about these
issues, and then I will discuss the kinds of technologies that are starting to mature in this area.
These technologies have not yet solved the problems of science and, in fact, largely have not
been applied to the problems of science.
Scientists do use extremely large databases, and many of these data are crucial to
society. On the other hand, we scientists tend to be fairly pejorative about something like
Facebook, because Facebook is not being used to solve the problems of the world. I wish
science could get the number of minutes per day that Facebook gets, which is roughly
equivalent to the entire workforce of a very large company for a year. We are also not used to
thinking about Facebook as confronting a data challenge per se, but it collects, according to the
published figures, 25 terabytes of logged data per day. That sounds like the kind of numbers
for large science data collections. Facebook's valuation is estimated to be well over $33
billion, which is the size of the entire National Institutes of Health budget. Not surprisingly,
they are able to deal with some of these data issues that are discussed here. We need to look at
what they are doing and determine if we could take advantage of some of their approaches.
I do not have similar numbers for Google, because they have not been published. The
last good estimate I could find was in 2008, which was 20 petabytes per day, but that was 3
years ago. That also does not include the exchange of or the storage of YouTube data, which I
cannot find good numbers about either. Google's valuation in 2011 is about $190 billion,
FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 71
OCR for page 72
which is roughly a third of the Department of Defense budget. Not the research budget--the
entire budget.
Therefore, if we think that this kind of work is expensive, yes, it is, but there are other
people doing it, and they are investing large amounts of money. It is not surprising that they
have been focusing on big data problems in a way different from scientists, and they have been
able to explore some areas that are very difficult for us to explore. If a researcher wanted to
buy 10,000 computers for data mining purposes, it would difficult to do because of the lack of
resources.
Several speakers talked about semantics in the context of annotation and finding related
work in the research literature. It is an important problem, but I do not think it is the key
unaddressed problem in science. In fact, I would contend that we have spent a huge amount of
money on that problem, much of it on trying to reinvent things that already exist in a better
form from open sources in the real world.
Most companies today that want to work with natural language processing start with
open-source software. For example, there is a company that is taking everything from the
Twitter stream, running it through a number of natural language tools, and doing some
categorization, visualization, and other similar work. They did not build any of their language
processing software. I am also told that Watson, the IBM Jeopardy computer, had a large team
of programmers, but, in fact, that the basic language technology used was mostly off the shelf,
and that it was mostly statistical.
Semantic MedLine is a project that the Department of Health and Human Services has
invested in. It does a fairly good job but does not understand the literature sufficiently to find
exactly what we want. But we are not able to do that in any kind of literature, and Google is
working on that problem as well. Hence, I do not see the point of yet another program to do
that for yet another subfield or against yet another domain. Instead, we need to start thinking
about how to put these kinds of technologies together.
The Web is a hard technology to monitor and track because it moves very fast. It has
been moving very fast for a long time, however, and, as scientists, we need to start taking
advantage of it much more than reinventing it.
There are a few different tools and models on the Web that are worth thinking about.
One is to move away from purely relational models; from assuming that the only way to share
data is to have a single common model of the data. In other words, to put data in a database or
to create a data warehouse, we need to have a model, and that is done for a particular kind of
query efficiency.
Sometimes, however, that efficiency is not the most important factor. Google had to
move away from traditional databases to deal with the 20 petabytes of data generated every
day. Nobody has a data warehouse that does that, so Google has been inventing many useful
techniques. "Big Data" was one of their names for their file system. We cannot easily get the
details of how Google does it, but we can get published papers from Google people that will be
useful in learning how to do similar work. The NoSQL community is a fairly large and
growing movement of people who are saying that when you are dealing with large volumes of
data there is a need for something different from the traditional data warehouse.
72 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 73
I had a discussion with some server experts who said, "Just give us the schemas, and
we can do this work." I pointed out that we cannot always get the schemas, that some of these
datasets were not even built using database technologies with schemas, and in some cases
someone else owns the schema and will not share it. We can still obtain the data, however,
either through an Application Programming Interface (API) or through a data-sharing Web site.
Although Facebook's and Google's big data solutions are proprietary, the economics of
applications, cell phones, and other new technologies are pushing toward much more
interoperability and exchangeability for the data, which is mandating some sort of a semantics
approach. Consider, for example, the "like" buttons on Facebook. Facebook basically wants
their tools and content to be everywhere, not just in Facebook, which means it has to start
thinking more about how to get those data, who will get that data, whether it wants data to be
shared or not, and what formats and technologies to use. As a result, there are many
technologies behind that "like" button. Similarly, there are a number of approaches that Google
employs to find better meanings for their advertising technologies. What happens is that
Google recognizes that at one point it will not be able do the work by itself, because there is a
long-tailed distribution on the Web. This means that it has to move the work to where
Webmasters and other people will be able to develop the semantic representations for their
domains.
That is happening with all the search engines, and the big issue now in that area is
simple metadata and lightweight semantics. The notion of the complex ontology is getting
replaced by the notion of fairly simple descriptive terms--that is, I can probably describe the
data in my dataset with 10 or 12 different terms that will be enough for me to put in a federated
catalog for people to decide whether they want to read my data dictionary.
The idea here is that if the data are going to be in a 200-page, carefully constructed
metadata format with the required field-specific types, and they have to be compliant with the
standards of the Internet Engineering Task Force and ISO (International Standards
Organization), the vocabularies get harder to work with. It is an engineering problem that is
very similar to the integration problem, and what many people are realizing is that you can
arrange the data hierarchically and get good results for small investments.
Here is an example. Many governments are putting raw data on the Web now. They are
not just putting visualizations of the data online; they are putting the datasets themselves. From
a political point of view, the two biggest motivations are enhanced transparency and the chance
to inspire innovation. Promoting innovation can proceed in two different ways. First, the
governments are hoping that some people will figure out how to make useful and innovative
tools using the data. More important, especially for local governments, is that they pay many
people to build Web sites and interactive applications that they cannot afford to build anymore.
For instance, a government agency may be able to pay one contractor to build a big
system for a problem that is of high priority, but it may have 57 more priorities that it cannot
afford to fund. The governments need someone to develop those applications, so if it makes the
data available, at least some of those applications are done by other people. The development
of such applications is out of the agency's control, but if the work is getting done at no cost or
for some small amount of money, it can start planning strategically for other priorities.
The United States and the United Kingdom are the two leading providers in terms of
organizing their data. The United States has about 300,000 datasets, most of them geodata. If
FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 73
OCR for page 74
we take out the geodata, there are probably 20,000 to 30,000 datasets that have meaningful raw
data. The United Kingdom has less, but it probably has the best in the sense that, for example,
you can get the street crime data for the entire country for a number of years. They are
releasing very high-quality data down to a local level.
Many countries, including ours, are releasing scientifically interesting data, but you
would have to work to find them. To use these data, combining them with other data, can be
more important than just looking at them. Those entities releasing data include countries,
nongovernmental organizations, cities, and other groups. For example, one of the Indian tribes
in New York State has released much of its casino data.
There are groups all over the world working with these data. My group is one, and the
Massachusetts Institute of Technology has a group working jointly with Southampton
University, mostly on U.K. data. Many of these groups are in academia, but there are many
nonacademic groups doing this kind of work. One of my suggestions is to think about data
applications. You may have a large database, and if parts of it can be made available through
an application, an interface, or through an API, people would be using the data in a sharable--
and often an attributable--way.
Getting back to science issues such as attribution and credit, we have one of our
government applications in the iTunes store. I know exactly how many people downloaded it
yesterday, and how many people are still using an old version.
When some students and I were in China, we discovered that China was releasing much
of its economic data, so we took China's gross domestic product (GDP) data and the U.S. GDP
data to do a comparison. To do that comparison, we needed to find the exchange rate. Luckily,
there is a dataset from the Federal Reserve that has all of the exchange rates for the U.S. dollar
weekly for the past 50 years or so. We got those data, we mashed them together, and we got a
chart that looks like Figure 3-1:
FIGURE 3-1 GDP Chart.
SOURCE: James Hendler
74 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 75
We also divided the GDP by the population (data we found in Wikipedia), so that we
could click a button to go between total GDP and per capita GDP. The model was built in less
than 8 hours, including the conversion of the data, the Web interface, and the visualization.
That is a game-changer. It would completely change the way we work if we could get this
down to a few minutes. When my group started working with this kind of technology several
years ago, it would take us weeks to do this kind of work. Part of the improvement is because
some of the technology was immature, part of it is because the tools are now available
commercially, and another part is because there is now visualization technology freely
available on the Web. Building visualizations is labor intensive, so by moving to simple
visualizations, we can use visualizations much earlier. We are also able to link government
data to social media.
One interesting question that has not really been part of the discussion that we have had
in the scientific community is how to find data. For example, we noticed that most of the U.S.
government data were about the states, and the metadata would represent the data as being
about the 50 states, but very few of the sets actually covered all the states. Some states were
missing. Some databases included American Samoa, Guam, Puerto Rico, et cetera, and the
District of Columbia (which is not officially a state). In this case, there is a very loose
definition of a state as opposed to a territory, and no one has much problem with that. But if a
researcher wants to find datasets about Alaska, it can be erroneous to just assume that all of the
datasets that say they cover the states will actually have Alaska data.
The other problem is that we cannot search for the keyword "Alaska" within the
datasets. If there is a column that represents the states, it may be called Alaska, it may be called
AK, it may be called state two, or it may be called S17:14B/X. The government has terabytes
of data, so how do we find the data that are for Alaska, for example? We cannot just call for
building a data warehouse and rationalizing the process, because these datasets are being
released by different people in different agencies in different ways.
Thus, metadata becomes very important. Simple and easy-to-collect metadata can allow
building faceted browsers and similar tools. It is a real research challenge, however, to
determine the kind of metadata for real scientific data that are powerful enough to allow useful
searches. With the government data, one problem is that all of the foreign databases are in their
own languages, and it is hard for those of us who are English language speakers to figure out
what is in the Chinese dataset, unless you hire a Chinese student. You cannot just use Google
Translate and expect the result to be academically useful.
People are starting to consider integrating text search and data search. This is an
application that one of my staff did, joined with Elsevier's new SciVerse, which was featured
on the U.S. Data.gov site (Figure 3-2).
What this application does is that when we are doing a keyword search for scientific
data, it is also looking for datasets that might be related to that same term. We are using very
lightweight ontologies that mostly just use the keyword. We are working on making it better.
We think it would be good when someone is looking for papers, the data in or about that paper
were available, but also finding what is in the world's data repositories that might be useful.
FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 75
OCR for page 76
FIGURE 3-2 Search in SciVerse Hub on Climate Change.
Source: James Hendler
When we start thinking about data integration, using data, or searching for data, one
thing that happens in the discovery space is that the data become part of the hypothesis
formation. That means that looking at, visualizing, and exchanging the data cannot be an
expensive add-on to a project. It has got to become a very key part of the standard scientific
workflow, with appropriate tools to reduce the costs.
What is promising in this area? We have been looking at linked open data issues
outside of science, and some of its promise has been explored (mostly within the ontology
area). Genomics and astronomy are two fields where we have actually seen semantic Web
technologies deployed, but many other science communities are still mostly thinking about
their own data holdings, not about being part of a much larger data universe. It is interesting
that when we talk to the Environmental Protection Agency, or to the National Oceanic and
Atmospheric Administration, or similar agencies, they say that they are providing data to many
communities, so they cannot easily use the ontologies of a particular one. Hence, we need to
learn how to map between these approaches.
How do I know what is in a large data store? How do I know what is in a virtual
observatory? I need services, metadata, and APIs. I also need to know the rules for using the
data.
Other issues we need to think about are related to policies. If I take someone's data
from a paper, mix them with someone else's data, and republish them, those people would
probably like to get credit for the data being reused. Or if I do some work on your data that you
do not like, you might want to rectify it. How do we do that?
76 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 77
National Technological Needs and Issues
Deborah Crawford
-Drexel University-
I am currently the vice provost for research at Drexel University, but I used to work at
the National Science Foundation (NSF). I spent almost 20 years at NSF and was fairly
significantly involved in the agency's cyberinfrastructure activities. My experience within NSF
and now at Drexel has provided me with an interesting perspective on what data-intensive
science means in a research university that aspires to be more research intensive. The main
message of my talk is that there have been many advances in data-intensive science in some
fields, but we have massive amounts of work still to do if data-intensive science is to realize its
full potential across all of the disciplines.
There have been many reports issued over the past decade that address the importance
of data-intensive science and the role of information technology in advancing science. An
important one is the Atkins report that Dr. Hey referred to earlier. In that report, Dan Atkins of
the University of Michigan and his committee speak of revolutionizing the conduct of science
and engineering research and education through cyberinfrastructure, and they examine
democratizing science and engineering. Those were tremendously powerful statements in
2003, and they stimulated a great deal of excitement within the scientific community.
Since then, there have been a number of reports that have specifically addressed data-
intensive science, several of which were released in the past couple of years. This is a quote
from a joint workshop between the NSF and the U.K. Joint Information Systems Committee
that was held in 2007: "The widespread availability of digital content creates opportunities for
new kinds of research and scholarship that are qualitatively different from traditional ways of
using academic publications and research data."
What we have heard so far in this workshop was from those who I would describe as
the visionaries and the trailblazers in science and engineering, representing those communities
that were very motivated to take advantage of information technology in their work in order to
advance their field of science.
I want to focus now on the long tail of science--the researchers in those fields where
the immediate advantages of information technology and collaboration are not so readily
apparent. There is tremendous opportunity in those communities, but we do not quite know
how to realize those opportunities.
I would like to provide a snapshot of some surveys of researchers working in different
communities. The main message is that computer-mediated knowledge discovery practices
vary widely among scientific communities and among colleges and universities. In the United
States and in the United Kingdom, for example, this is certainly true. There are some colleges
and universities that know how to take advantage of their digital technologies and their digital
capabilities, while there are others that simply are way behind the curve.
There are three fundamental issues that communities or universities must address: (1)
What kinds of data and information are made open, at what stage in the research process, and
FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 77
OCR for page 78
how? (2) To which groups of people are the data and information made available and on what
terms or conditions? (3) Who develops and who has access to the tools and training to leverage
the power of this discovery modality? These are fundamental questions and are key to the
realization of data-intensive science.
I am going to talk about two case studies, one conducted in the United Kingdom by the
Digital Curation Center, and one conducted within Yale University. Both point to some key
features we need to address.
The study done by the Digital Curation Center in the United Kingdom was called
"Open to All." Its purpose was to understand how the principles of digitally enabled openness
are translated into practice in a range of disciplines. These are the kinds of questions we need
to be asking ourselves to determine the actions that we need to take to make sure that this
modality of science is accessible and advantageous to everyone.
In this study, the authors characterized a research life-cycle model and asked different
communities how they were using digital technologies within the context of that life cycle to
further their science. They surveyed groups among six communities: chemistry, astronomy,
image bioinformatics, clinical neuroimaging, language technology, and epidemiology. The
range of responses within those different communities was fascinating to see. Surprisingly
chemistry was the trailblazing community, at least among the individuals surveyed for this
study. The chemists were using social networking tools, Open Notebook Science, wikis, and all
the modalities of digital technologies to collaborate, and to collect, analyze, and publish their
data. Everything was quite seamless from a community that I had not anticipated to be one of
the trailblazing communities.
It was interesting to hear from the clinical neuroimaging group. They were so skeptical
of the value of data-intensive science and open data sharing that in this study they refused to
even disclose their identities as individuals. We therefore went from one extreme to the other,
and we saw the range of practices and values within different scientific communities.
From their conversations with these communities, the group that conducted these case
studies in the United Kingdom came up with a list of the perceived benefits of open data-
intensive science. It included improving the efficiency of research and not duplicating previous
efforts, sharing findings, sharing best practices, and increasing the quality of research and
scholarly rigor. This last one was especially true for the members of the chemistry community
who were surveyed. They saw a great opportunity in making available in blogs the day-to-day
data that they collected--not just raw data, but derived data. They found tremendous benefits
in making that open to the community to provide more scholarly rigor.
Among the other perceived benefits were enhancing the visibility of access to science,
enabling researchers to ask new questions, and enhancing collaboration in community-
building, which all of the groups surveyed agreed was a benefit. There was also a perceived
benefit of increasing the economic and social impact of research about which the report was
ambiguous. Although economic and social impacts each are treated separately as a perceived
benefit, the sense was that the real value could not actually be measured. So, there was a
question: Can we create economic value by making our data--essentially our intellectual
property--much more openly accessible?
78 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 79
One of the perceived impediments was the lack of evidence of benefits. Researchers
who felt this way were not motivated to make their data openly available. Other impediments
were the lack of clear incentives to change and the values in academe not being consistent with
open data sharing. For tenure and promotion and the drive to publish, the perceptions were that
the only way to publish is in the open literature and that it is not in a researcher's interest to
share data, because someone else may use the data to advance farther than the one who shared
the data. Competitiveness is a big issue.
The conflict with the culture of independence and competition is absolutely related to
the lack of clear incentives to change. Other impediments were inadequate skills, a lack of
time, and insufficient access to resources. Another big concern was how to train both the
scientists who are practicing today and the scientists and engineers of the future. Researchers
were also worried about how long it took them to prepare their data for open access, about
quality, and about ethical, legal, and other restrictions to access. Those were big issues,
especially in the life sciences community.
The report's recommendations called for policies and practices for data managing and
sharing. Communities are desperate for guidance on these issues. What have the trailblazers
learned that can be shared and applied more broadly? Contributions to the research
infrastructure should be valued more--that is something we have heard often. Training and
professional development should be provided, and there should be an increased awareness of
the potential of open business models. That is related to the attitude among researchers that
their data are their intellectual property and they want to derive some value from that; thus,
they feel that if they make their data openly available, they are giving that value away.
Assessment and quality assurance mechanisms should be developed.
The study done by the Yale University Research Data Taskforce was conducted by an
organization within Yale called the Office of Digital Assets and Infrastructure, which is an
organization established to accrue the value over time of the digital assets that result from
research and scholarly activity within the institution. The office conducted this study to
determine the requirements and components of a coherent technical infrastructure, to provide
service definitions, and to recommend comprehensive policies to support the life-cycle
management of research data at Yale.
Given the discussion earlier about research libraries, this is interesting. Here is Yale
University doing a survey that includes both the librarians and the information technology
enterprise staff at the university to determine what their faculty base most needs for managing
digital data. Very much like the other survey, this one found that data-sharing practices vary
widely among the disciplines they surveyed. The researcher has the most at stake when
determining what the data-sharing practices are. Yale is going to create an institutional
repository to secure, store, and share large volumes of data.
There are certainly some institutional pioneers in this area, such as the Massachusetts
Institute of Technology and Indiana University, and much of what I characterize as "slow
followers." A major concern is how an institution can afford to create and maintain
infrastructure like this.
Yale University understands that it needs to develop and deliver research data curation
services and tools to all of its interested parties, not only in science and engineering, but also in
the humanities and in the arts. Recognizing the importance of persistent access and
FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 79
OCR for page 122
addressed in the other two branches of the scientific method, the deductive and the empirical
branches.
In the deductive branch in mathematics and logic, people have worked out what it
means to have a proof, to really communicate the thinking behind the conclusions that are
being published. Similarly, in the empirical sciences, there are standards that have been
developed, such as controlled experiments and the machinery of hypothesis testing, and how
they are communicated. In a methods section, there is a very clear way that these results are to
be written for publication, designed so that other people can reproduce the thinking and the
results themselves.
In computational science, we are now struggling with this issue: how to communicate
the innovations that are happening in computational science in such a way that they will meet
the standards that have been established in the deductive and empirical branches of science.
My approach is to understand these issues in terms of the reproducibility of
computational science results, and this gives me the imperative to share the code and the data.
We have seen many interesting examples of how reuse can be facilitated and what happens
when someone actually shares open data. This gives rise to a host of issues about ontologies,
standards, and metadata. This is framed within the context of reproducibility. The reason that
we are putting the data and the code online is to make sure that the results can be verified and
reproduced.
Here is an example. In 2007, a series of clinical trials were started at Duke University
that have since been terminated, but it took a few years to terminate them. They were based on
computational science results in personalized medicine that had been published in prestigious
journals, such as Nature Medicine. Researchers at the M.D. Anderson Cancer Center tried to
replicate the computational work that had gone into the underlying science, and uncovered
serious flaws undetected by peer review. The study was plagued with a myriad of issues, such
as flipped drug labels in the dataset and errors of alignment between observations in treatment
and control groups--errors that are simple to make. The clinical trials were canceled in late
2010, after patients had been assigned into treatment and control groups and had been given
drugs. One of the principal investigators resigned from Duke at the end 2010. The point is that
we have to assume that errors are everywhere in science, and our focus should be on how we
address and root out those errors.
There was a discussion earlier of how the data deluge is a larger manifestation of issues
that have been seen before. Also, Dr. Hey gave the example of Brahe and Kepler and how what
must have been a data deluge in their context ended up engendering significant scientific
discoveries. In that sense, there is nothing fundamentally new here. We are doing the same
thing in a methodological sense as we have always done, but we are doing it on a much larger
scale. The scope of the questions that we are addressing has changed. In that sense, the nature
of the research has changed.
Dr. Hey told us that we need more skills to address this concern. Dr. Friend then said
we need verifiability, a point that I have also attempted to make in this talk. This means that
the infrastructure and incentives need to adapt to the research reality even though the process
of science is not changing in any fundamental way. That in turn means that it will be important
to develop tools for reproducibility and collaboration. For example, some presenters also talked
about provenance- and workflow-tracking tools and openness in the publication of discoveries.
122 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 123
In short, there are many different efforts that are needed. The solutions in this area will
not be something that comes down as an executive order and then all scientists are suddenly
open with their data and code. The problems are much too granular, and so the solutions must
emerge from within the communities and within the different research and subject areas.
Vis Trails is a scientific workflow management system. It was developed by a team
from the University of Utah that is now moving to New York University. It tracks the order in
which scientists call their functions and the parameter settings that they have when generating
the results. These workflows can then be shared as part of the publication, and they can be
regenerated as necessary. Vis Trails also promotes collaboration.
Some recent work by David Donoho and Matan Gavish was presented for the first time
in a symposium on reproducibility and computational science, held at the American
Association for the Advancement of Science. They have developed a system for automatically
storing results in the cloud in a verifiable way as they are generated, and creating an identifier
that is associated with each of the published results. For example, if a paper contains a figure,
then we would be able to click on it and see how it was done and reproduce the results, as the
means to do this are automatically in the cloud.
Another useful tool is colwiz, a name derived from "collective wisdom." Its purpose is
to manage the research process for scientists.
One of the major problems with reproducibility is that, unless a scientist is using these
specialized tools, there is no automatic mechanism for researchers to save their steps as they
advance. After they have finished an experiment and written the paper, they may find that
going back and reproducing the experiment is even more painful than going through it the first
time. Thus, tools like colwiz could help both with communicating scientific discovery and with
reproducibility.
These issues are all related to the production of scientific data and results. There are
also some aspects related to publication and peer review. It is a lot of work to request
reviewers, who are already overworked, to review code or data and incorporate them into the
review process. Maybe we will get there one day for computational work, but certainly not
now.
The journals Molecular Systems Biology and the European Molecular Biology
Organization are publishing the correspondence between the reviewer and the authors,
anonymously but openly, along with the actual published results. This is one approach that is
being tried to be more open and transparent.
One of the reasons for this practice is that there is a great inequality in the power of the
reviewers and the authors. In particular, reviewers can ask for additional experiments and more
exploration of data from the person who is trying to get the paper published. Particularly for
prestigious journals, reviewers have a lot of power. These journals are trying to balance this
power by allowing readers to see the dialogue that took place between editors and authors
before a publication.
Furthermore, many journals now have supplemental materials sections in which
datasets and code can be housed and made available for experimentation. The sections are not
reviewed and so far have had varying amounts of success.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 123
OCR for page 124
In the February 11, 2011, issue of Science, there was an editorial emphasizing that, in
addition to data, Science is now requiring that code be available for published computational
results. That is extremely forward thinking on the part of Science. Science is folding this into
its existing policy, which is that if someone contacts an author after publication and asks for
the data--and now the code--the author must make it available.
An approach that other journals have taken is to employ an associate editor for
reproducibility. The Journal of Biostatistics and the Biometrical Journal do that. The associate
editor for reproducibility will regenerate and verify the results, and if the editor can produce
the same results that are in the manuscript, the journal will Kitemark the published article with
a "D" for data or "C" for code. In this case the journal can advertise that readers can have
confidence in the results, because they have been independently verified.
There are also new journals that are trying to address the lack of credit authors get for
releasing and putting effort into code or data or for attaching metadata. They are trying to
address the issue of incentives, and their focus is on open code and open data. Open Research
Computation offers data notes, for example, and BioMed Central has research notes.
PubMed Central and open access are older concepts embedded within the infrastructure
of the National Institutes of Health (NIH). But why does it stop with NIH? Could we have a
Pub Central for all areas and allow people to deposit their publications when they publish,
similar to the NIH policy?
There has been much discussion about the peer-review data management plan at the
National Science Foundation (NSF). This is a very important step even though it has been
called an unfunded mandate. It is an important experiment in that it creates the possibility of
gathering information about how much it will cost and how data should be managed. It is, in a
way, a survey of researchers on how they are conceiving of these issues. Maybe the costs are
less than NSF worries about, or maybe they are more, but at least we will be able to get a sense
of this.
One report that I was involved with along with John King was for the NSF Office of
Cyberinfrastructure on virtual communities. We advocated reproducibility as part of the way
forward for the collaborative, very high-powered computing problems that we are addressing.
As part of the fallout from the problem I mentioned with the Duke University clinical
trials, the Institute of Medicine convened a committee to review omics-based tests for
predicting patient outcomes in clinical trials. "Omics" refers to genomics, proteomics, and so
on. The committee is chaired by Gil Omenn, and part of its mandate concerns issues of
replicability and repeatability and how the articles published in Nature Medicine that led to the
clinical trials could have gotten through with what were, in hindsight, such glaring errors.
There seems to be a hesitation on the part of some funding agencies to fund the
software development or infrastructure necessary to address reproducibility and many of the
other issues that we have discussed so far. Let me give an example of an e-mail that was sent to
a group e-mail. The author of the e-mail was talking about how his group develops open-
source software for research. He is a prominent researcher, who is very well known and very
influential. His group develops open-source tools, and it is very difficult for him to get funding
even when applying to NIH programs that are targeted at promoting software development and
maintenance. In particular, he said, "My group developed an open-source tool written in Java.
124 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 125
We started with microarrays and extended the tool to other data. There were 25,000 downloads
of this tool last year. So we submitted a grant proposal. Two reviewers loved it. The third one
did not because he or she felt it was not innovative enough. We proposed three releases per
year, mapped out the methods we would add, included user surveys, user support, and
instructional workshops. We had 100 letters of support."
This quote is from the negative review: "This is not science. This is software
development. This should be done by a company." We can see that there seems to be a
bifurcation in understanding the role that software plays in the development of science.
One idea for the funders of research might be: Why not fund a few projects to be fully
reproducible to see what barriers they run into? Is the problem that they do not have
repositories where they can deposit their data? Is the problem that they encounter issues of
maintaining the software? Where are the problems? Let us do a few experiments to see the
stumbling blocks that they encounter.
On the subject of citation and contributions, as we incorporate databases and code, we
need to think about how to reward these contributions. Many contributions to databases now
are very small, and there are databases where 25,000 people have contributed annotations.
Hence, there are questions about how to cite and how to reward this work. What is the relative
worth between, for example, a typical article with a scientific idea versus software
development versus maintaining the databases? Typically the last two have not been well
rewarded, and our discussion here calls that practice into question.
I will end with a figure from a survey I did of the machine-learning community (Figure
3-11). These are the barriers that people said that they faced most dramatically when they were
sharing code and sharing data.
FIGURE 3-11 Barriers to data and code sharing in computational science.
SOURCE: Victoria Stodden
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 125
OCR for page 126
OCR for page 127
A Government Perspective
Walter L. Warnick
-Department of Energy-
I am the director of the Office of Scientific and Technical Information, which manages
many of the scientific and technical information operations of the Department of Energy
(DOE). Our goal is to maximize the value of the research and development (R&D) results
coming from the department.
To put this into perspective, each government agency has an organization that manages
information. Those organizations have gotten together and formed an interagency group called
CENDI. Bonnie Carroll is the executive secretary of CENDI. The National Science Foundation
is represented in CENDI by Phil Bogden, and the National Institutes of Health (NIH) is
represented by Jerry Sheehan. I represent the DOE, and all the other agencies have
representatives too, including the Library of Congress, Government Printing Office,
Department of Agriculture, Department of the Interior, and practically every other organization
that has a big R&D program. Ninety-eight percent of the government's R&D is represented,
and several organizations that do not have R&D programs are also in CENDI.
The results of the U.S. government R&D investment, which amounts to about $140
billion a year, are shared via different formats. There is the text format, which includes
journals, e-prints, technical reports, conference proceedings, and more. There is nontext data,
which includes numeric datasets, visualizations such as geographic information systems,
nonnumeric data such as genome sequences, and much more. And there are other formats,
including video and audio.
Each format is in a state of change and presents its own set of challenges. With journal
articles, for example, the big issue within the government is public open access versus
proprietary access. I think we all agree that the gold standard for text-based scientific technical
information is the peer-reviewed journal article. Many highly respected journals are available
only by proprietary subscription access. NIH has pioneered a transition to make journal
literature publicly accessible. The effort has attracted a lot of attention, and it is still getting a
lot of attention within the government. Principal investigators are asked to submit journal-
accepted manuscripts for publication in the NIH public-access tool, PubMed Central.
What is significant now is that the America COMPETES Reauthorization Act, which
became law in December 2010, calls upon the Office of Science and Technology Policy
(OSTP) to initiate a process to determine if public access to journal literature sponsored by
government should be expanded. For now, NIH is the only agency that makes a requirement of
public access to journal articles. The DOE and other agencies are already empowered by law to
adopt that requirement, but we do not have to adopt it, and as a matter of practice we do not. I
think that OSTP will soon formulate a committee, which the COMPETES Act calls for, to get
input from stakeholders, consider the issues, and develop some recommendations.
Beyond the journal literature, there are gray literature issues, and integrating them with
journal literature is important. Gray literature includes technical reports, conference
proceedings, and e-prints. It is typically open access, but not all of it is. All of the DOE's
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 127
OCR for page 128
classified research is reported in gray literature, and of course that is closely held, but I am
talking here about the open access part of DOE's offerings. Gray literature is often organized
into single-purpose databases. For example, in the DOE we have something called the
Information Bridge, which has 300,000 technical reports produced by DOE researchers from
1991 to the present. It is all full-text searchable. The average report is 60 pages long, so they
are fairly detailed. Many other agencies have similar resources. Other databases handle e-
prints, conference proceedings, patents, and more.
DOE pioneered novel and inexpensive ways to make multiple databases act as if they
are an integrated whole, one example of which is Science.gov. Science.gov posts the
publications of all the agencies that are in CENDI, so it is a very large virtual collection of
databases. It is all searchable, and a single query brings back results ranked by relevance. It has
won awards for being easy to use and is an example of transparency in government.
The DOE's largest virtual collection that integrates gray literature, and some journal
literature, is WorldwideScience.org, which is a computer-mediated scientific knowledge
discovery tool. WorldwideScience.org makes the knowledge published by, or on behalf of, the
governments of 74 countries, including the United States, all searchable by a single query. The
amount of content is huge, about 400 million pages. A user can enter a query in any one of nine
languages, and the system will search all the databases in the language of the database and then
bring back the list of hits in the language requested. It is new, and it is growing very rapidly
under the supervision of the international WorldWideScience Alliance.
We also manage nontext sources--the numeric data, genome sequences, and so forth.
The main questions are to what extent should such sources be made accessible and for how
long. Some agencies are grappling with the issue by formulating data policies. Some agencies
require principal investigators to propose data-management plans. The America COMPETES
Reauthorization Act calls upon OSTP to initiate a process to encourage agencies to consider
these issues, and it is the same part of the act that I mentioned previously that addresses journal
literature. Hence, committees stemming from the act are handling both text items and nontext
items.
Everything we do entails cost. Whether it is just sharing information or doing analysis
of the information, there is always a cost. Here is a way that I talk to my funding sources about
cost. Imagine a graph whose vertical axis is the pace of scientific discovery and whose
horizontal axis is the percentage of funding for sharing of scientific knowledge (see Figure 3-
12). I think everybody agrees that science advances only if knowledge is shared. Therefore, let
us postulate an imaginary situation in which no one shared any knowledge. That would take us
to the origin of this graph, because there would be no funds expended for the sharing, but there
would be no real advance in science either. At the other end of the x-axis, at the 100 percent
mark, if we spent all our money sharing and none of it doing bench research, soon your pace of
scientific discovery would draw down close to zero too.
128 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 129
FIGURE 3-12 Knowledge investment curve.
SOURCE: Walter Warnick
We have two data points on this graph, both lying on the x-axis. In between the two
points, there is a curve, which we call the Knowledge Investment Curve. We do not know the
shape of the curve, but it is likely to have a maximum. The point is that decision makers affect
the pace of discovery when they determine the fraction of R&D funding dedicated to sharing.
That is the argument I make to my funders.
The point of the Knowledge Investment Curve is to make funders realize that while
they can dedicate funds to buy computers, hire more researchers, or build a new facility, they
should also weigh that investment against the benefits of getting information out better, sharing
it with more people, making searches more precise, and doing the kinds of analyses and data
mining we have talked about here today.
It would require a significant research program to calculate what the shape of this curve
is and where that optimum is, but we know that such an optimum exists somewhere.
Furthermore, the optimum is not the minimum, which is another message I give to the funding
sources. If we think that the purpose of an information organization is to be a repository where
information goes in, seldom comes back out, and seldom sees the light of day, that is not the
optimal expenditure for sharing.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 129
OCR for page 130
Discussion
DISCUSSANT: My question to Dr. Warnick and Dr. Stodden is related to the
knowledge investment curve presented by Dr. Warnick. In a sense, the peak of that graph is the
amount of funding spent on infrastructure that enables research versus the amount of funding
spent on research. Where do you think that should fall for any example you choose?
DR. WARNICK: We would probably reach a consensus that there is not enough money
being spent on, for example, sharing of knowledge and analysis, and the development of some
of the tools that were discussed earlier. As to how much below the optimum it is, let me give
an example.
Consider the National Library of Medicine as an excellent example of an information
management organization. The funding for that organization exceeds the funding for all the
other information organizations combined. I am not suggesting that the National Library of
Medicine is overfunded, but I will say that the others are underfunded.
DR. STODDEN: I absolutely agree. I think it became much harder than it has been
traditionally to share our work. As our science becomes more data intensive and involves code,
those two areas add extra expenses involved with sharing that are not wholly taken care of in
funder budgets. The science itself, through technology, has changed, and it is making ripple
effects through our funding agencies, which have not quite caught up yet, I think.
DISUSSANT: Dr. Hey talked about the new data-intensive work as a new paradigm,
yet so much of the discussion has been about things like reproducibility in the traditional sense,
but with code and data added, or sharing in the traditional sense, but with code and data added.
So where does thinking about new paradigms or new ways of doing things come in? Where do
you see that falling in the spectrum of who is responsible and how that affects this whole
question?
DR. WARNICK: Even the sharing part is being subjected to new paradigms. Just to
make a point, the infrastructure behind Science.gov and WorldwideScience.org is something
that we see very rarely in everyday experience on the Web, and it was developed and matured
as a result of some government investment.
To take your point directly about the other kinds of analysis that were discussed earlier,
however, since the government is providing $140 billion of funding for research, then the
analysis that gets more mileage out of that research ought to be funded by the government too.
Of course, the government always welcomes the idea that the private sector can take and utilize
these results, but it must be the funding agencies that do the initial work. I think that the reason
why people have not heard the Department of Energy mentioned in this discussion before now
is because we were doing very little in this regard compared to the National Institutes of Health
(NIH) and the National Science Foundation, and that ought to change.
DR. STODDEN: My perspective within academia is that processes are changing for
hiring, promotion, and work committees. The scientists who are clued into these issues of
reproducibility, open code, and open data seem to be a little more interested than people who
are carrying on in a different paradigm. Academia is conservative in the sense that things
change slowly. Therefore, it takes time for these practices to percolate through.
130 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 131
DR. BERMAN: The issue of the gap really intrigues and concerns me, because in our
real life, if we want to find a restaurant, we can go to Yelp. We can find which restaurants are
near us, what is available, who likes them, and so on. There is an application for that. We can
get this information on an iPhone.
Consider taking all of our scientific products and putting them in that world. Is there a
place where we can find scientific databases and see who liked them? Can we see who added
metadata to them in a very user-friendly way? Can we access them easily?
We are starting to see crossover between the academic world and the world of
commercial applications. Phil Bourne has a project called SciVee, where we can show how to
do different kinds of experiments or give a talk on a data YouTube-style. We can imagine
using many of these commercial types of applications and technologies in academia. Some
interesting questions arise: What does it mean if we have a data collection and many people
like it? Does that mean it is a good data collection? Does that mean it is an economically
sustainable data collection? We should not utilize one set of tools for our academic work and
another set of tools for applications in the real world without bridging the gap.
MR. UHLIR: There is a proposed act in Congress, the Federal Research Public Access
Act, that broadens the NIH PubMed Central grantee deposit policy to include other agencies
with annual research and development budgets of $100 million or more, although I do not
know if it is going to actually become law.
Also, in the list of peer reviews presented earlier, there is one other model that was
missing: postpublication review. It is not a traditional peer review. It is an open peer review, it
is moderated, and it is ongoing. There are two kinds of this model. One is just commentary,
and one is papers generated in response to a big paper. The model I am thinking of is the
European Geosciences Union's Journal of Atmospheric Chemistry and Physics in Munich. I do
not know how many other journals do that, but it is yet another model for a review. Even if the
code and the data are not available, people can ask very pointed questions that can be answered
by the researchers. That is a potentially valuable way of reviewing results.
In response to Dr. Hendler's comment, there have been several people who have made
some intermediate suggestions, such as Dr. Friend's journal of models being published. But the
fundamental problem is that we have moved all the print journals wholesale onto the Web
without hardly changing the model at all. To obtain good models, one would deconstruct the
scholarly communication process used in the print paradigm and reconstruct it in a way that
makes sense on the Web. Thus, we are repeating everything we have done before and not
really thinking about what the Web allows us to change in order to achieve greater efficiency. I
think the print journal system itself is an outmoded way of communicating. I am sure there are
all kinds of new paradigms, but I will leave it at that.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 131
OCR for page 132