Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 133
4. Summary of Workshop Results from Day One and Discussion of
Additional Issues
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 133
OCR for page 134
OCR for page 135
Introduction
Bonnie Carroll
-Information International Associates-
Today, we will take the results of the discussion from the first day of the workshop and
build on them. We would like to hear about some options for what could be done over the next
few years and what the value proposition is. Closely coupled with that is the issue of how we
will know when we have succeeded.
We had some excellent concepts and ideas during the first day of the workshop that
highlight much about the nature of the changes that are ongoing. I heard that we should not
stay static and that the community might benefit from a less linear and more dynamic
approach. We talked about trailblazers versus the long tail of data activities, and whether we
are getting into a credibility crisis, because many of the things we are doing now are not as
shareable as they were when they were in print. Many codes are not available, for instance.
The title of the workshop is "The Future of Scientific Knowledge Discovery in Open
Networked Environments." The title recognizes that we are adding a fourth paradigm to the
research process. We already have three paradigms--the theoretical, the experimental and
observational--but now we have added the computational or the data intensive. The
presentations we heard during the first day of the workshop focused mainly on the latter
paradigm.
How do these approaches build on each other and work together? Why did we
immediately jump to the data-intensive sciences when the title referred to a networked
environment? The answers are obvious. First, it is because the technology is so enabling, and
second is the data deluge.
In the text that follows, we discuss the four sets of topics identified in the statement of
task. Each section begins with an Introductory Summary of the Issues Identified in Day One of
the workshop. In the first and fourth sections, Puneet Kishor reviews the sets of issues from the
first and fourth items in the statement of task, namely, the Opportunities and the Benefits and
the Range of Options. In the second and third sections, Alberto Pepe addresses the Techniques
and Methods, and then the Barriers, respectively. Each of these Introductory Summaries of the
Issues Identified on Day One of the meeting by these two rapporteurs is then followed by a
Summary of the Discussion of each topical area by all the workshop discussants on Day Two
of the meeting.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 135
OCR for page 136
OCR for page 137
4. Opportunities and Benefits for Automated Scientific Knowledge Discovery
in Open Networked Environments
Introductory Summary of the Issues Identified on Day One
Puneet Kishor
-University of Wisconsin-
Among the issues we heard about during the first day of the workshop was that certain
problems in science, no matter what domain, are really the same. A scientist performing
observational research, for instance, generates data and then manipulates, organizes, analyzes,
visualizes, and reports on them, or maybe preserves them, and then the cycle repeats. The
digital medium is one of the best ways to communicate knowledge and is probably also the
best way to create knowledge. The concept of a knowledge ecosystem, the whole system of
producing, storing, and retrieving knowledge, depends on standards, metadata, discovery,
association, and dissemination.
One presentation focused on creating a precompetitive commons, using examples from
drug discovery, where fierce competitors can cooperate to produce evolving models of
diseases. Cooperation can improve the translation of publicly available molecular data into
biomarkers, and increase the opportunities for using drugs meant for one disease to treat
another disease.
The "crowd" plus the "cloud" concepts are important to the future of the network.
There was an assertion that science can learn about data sharing from the World Wide Web,
which could be argued either way. Discussants also mentioned exploring linked open data for
interdisciplinary uses. Access to many data is very restricted.
Also, there was much focus on the long tail of science, which is defined as those fields
where the benefits of information technologies are not immediately apparent. There were
conversations about quality, format, access, financial support, and policy issues.
Other topics that were discussed related to opportunities for innovation that we cannot
even imagine right now, serendipitous innovation, still others focused on government policies
that encourage or even require scientists to share their scientific data and information, such as
the America COMPETES Act.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 137
OCR for page 138
Summary of the Discussion of Opportunities and Benefits
This session examines the opportunities and benefits in a 5- to 10-year time horizon, so
it is important that we understand the value proposition. Do we really know what the benefits
and the opportunities are?
Building and Sustaining Knowledge Networks and Infrastructure
Can we say that one opportunity is to provide more and better infrastructure? If a
common infrastructure is not there, is it possible to have such a common infrastructure so that
anybody can use it?
When we hear a term such as knowledge ecosystem, it raises the concept of a
knowledge network. The discussion on the first day of the workshop about a nonlinear way of
thinking or a linear way of doing science reminds us of a neural network. That is, there are
many different synapses in the brain that connect different facts and different memories and
experiences that lead to an understanding of a situation. The same kind of process happens in
science.
We saw the example of the pharmacological studies that found linkages between drugs
that were used to treat one disease and that could be used to treat another disease because there
was a common gene sequence. There was a connection in the network that showed that the
same thing that is being used somewhere else actually does link back. This suggests moving
toward not just linked data or a knowledge ecosystem but a linked knowledge system. That is,
we could merge those two ideas. We could benefit across all disciplines and truly enable long-
tail science--the science that is so removed from our experience that we never would have
thought of doing it.
We are likely in a period of persistent restricted resources, however. Hopefully it is not
a zero sum game, but assuming it is such an era, what is the best way to deal with that
situation? Do we need to think of ways of developing systems that could be used by many
different kinds of scientists? Are there general characteristics of such systems? Again, we do
not need something only for computer science; we may also want something for the agencies
funding research.
Demonstrating the Value of Data Sharing
Change is always going to be before us, so how do we work together in our public-
private partnerships to adapt to change? We need to show the value of sharing data to our
principal investigators. Science is going to be done differently than it was by scientists 20 or 30
years ago, so we have an opportunity to redefine the science as well.
We can imagine a scenario in which a scientist is speaking to a congressional staffer
who has heard that scientists are just "welfare queens" in white coats and wants to know why
they are being supported. We need some convincing examples where data have led to
important discoveries. Earth sciences data and medical data could provide some useful
anecdotes, for instance.
We could survey the number of hours people spend analyzing the databases as opposed
to the number of hours they spend reading traditional journals. For example, Princeton
University spends about $10,000 per faculty member per year subscribing to journals. If we
could quantify the relative costs, we could put a value on the database. The bottom line is that
138 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 139
when talking to somebody who is not a scientist, such people do not necessarily start with the
assumption that science is worth something, so we need more descriptive stories. For example,
the digital library program is justified by pointing to Google. We need more compelling stories
that appeal to outsiders.
Also, a protein chemist cannot work without the protein data bank. There are many
scientists who do not work the way it was done before there were digital databases. They are
data-mining researchers. If you take away the databases, their type of research would stop. You
will get different answers in different kinds of research, however. For instance, there is much
more to online astronomical research than the official National Virtual Observatory or Virtual
Astronomical Observatory (VAO) projects. Although the VAO has led to many useful results,
so far it has focused only on infrastructure, without building end-user tools. In astronomy or in
other sciences, there are databases that are even larger than the protein data bank. Researchers
in these fields use them every day, but the results in some seminal papers may not be cited.
Astronomers use data collection facilities such as the Sloan Sky Survey, the Two
Micron All Sky Survey (2MASS), and the upcoming Large Synoptic Survey Telescope.
Scientists see these huge databases as something on which they have come to rely. What they
may not understand well, however, is what would happen if the medium- and small-scale
datasets that are attached to literature would also be openly contributed. Incentives alone might
not work. Scientists may not be convinced that their datasets are going to be worth depositing,
unless the funding agencies make that a condition of their grants.
Big science examples may be the wrong approach to convince scientists, however,
because they are too common. In the physics community, researchers would rather work with
the Large Hadron Collider (LHC) than their own small data collections, because the LHC will
have a huge amount of data. There are other fields in which the generation of the data may not
be very expensive, especially when amortized over a longer time period, but may nonetheless
be costly to maintain, because many people are involved or for some other reason.
There also may be a potential problem in asserting that it is more efficient to share data,
because if this were true, why are scientists asking for more money to do that? If researchers
are going to be more efficient, they should be able to perform better science with fewer dollars.
If this is not the case, then there is a question about the efficiency of the system. Also, when we
say that it is going to be more efficient, another issue is: Where will the savings go? The
savings could go into the analysis of existing data, rather than into collecting and organizing
new data. Put differently, the budgets may not go down, but the analytical capabilities could
increase.
The research funders could identify the different areas where data sharing is extremely
important, and they could then establish a reasonable time span for making the data available.
Some guidelines to the scientific community could be useful in this regard and constitute an
opportunity. For example, we might eliminate the 2-year gap between obtaining the research
results to the sharing of those results through publication. This could allow the timelier sharing
of the methodology, the data, and even the codes developed for the study. It could lead to real
benefits.
Too much specificity concerning what should be shared, however, can be a problem as
well. If science were segmented into different categories according to a level of what should be
shared, it could be counterproductive. A specific time frame for sharing data absent
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 139
OCR for page 140
countervailing considerations, such as the Health Insurance Portability and Accountability Act
(HIPAA) or similar requirements, could be very useful and would signal to scientists that there
is recognition that sharing is very important. It would encourage more scientists to start
thinking about the next steps after they produce the data, what infrastructure they will need,
how they will have to reorganize their own budgets and funding to do that, and so on.
A rigorous deadline can be a problem. Voltaire said, "The best is the enemy of the
good." Everybody is quick to complain that data curation is too much work. What is the cost,
however, to the nation's scientific effort to peer review weak papers? In some fields, preprint
distributions and conference presentations have essentially taken over the publication function,
and the journals exist so that people can get tenure. A system relying on something like the
Cornell arXiv might be better for scholarly communication, if there were something equivalent
to page ranking that would help people decide what they should be reading. If the publication
system were less strongly tied to tenure, it might be easier for the universities to determine that
they could use other metrics. This could save a lot of money if the universities no longer relied
on the ranking of publications to determine who would get tenure.
On the one hand, the promise of data sharing is becoming real. On the other hand, there
is a powerful reinforcement of Max Planck's observation that science proceeds funeral by
funeral. In the health sciences, the progress of science may be slower than it has been
historically. Some scientific processes may even retard the pace of scientific progress.
The most difficult organizations to consult for are the ones that are doing well, because
if they need to change the things that they did to make them successful or rich, their view, of
course, is that they are fine and they are not receptive to good ideas. The difficulty is to get the
existing order to change their approach, when it is needed.
This is a collective action problem. Individual scientists who would like to fulfill their
ideals as scientists often find that they cannot. Framing the problem in terms of reproducibility
of scientific results can be very appealing to scientists, however. This can be a way that gives
them guidance about when to share data and what types of data to share. The same argument
can be helpful in convincing people of the importance of openness in this networked
environment, because this is a core principle of science. Nobody argues that reproducibility is
not important in the sciences, so this can be useful as a guiding framework for what steps are
important.
The climate data controversy at the University of East Anglia in the United Kingdom
and elsewhere has indicated that transparency is also key. The average person is not going to
reproduce those climate simulations, but citizens want some assurance that there is access to
how the model simulation was made and what data came out of it. Therefore, transparency and
reproducibility are both important.
We also may separate the different axes along which we can argue that there are
benefits. We talked about the speed of communicating research results, so presumably the
discovery of new results is one aspect. We have talked about value, and a couple of discussants
reminded us that the value of discovery, at least when we are talking to the public, is predicated
on the listener's view of the value of the underlying science. If someone believes that many
discoveries in the underlying science are not particularly valuable in the first place, being able
to make more of them faster does not seem to accomplish very much.
140 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 141
A strong case for value can be made in the biomedical sciences, particularly as research
is translated into health care. For example, it might be possible to determine the real dollar
value in repurposing pharmaceuticals. It might be worth picking an area in which the economic
value is fairly noncontroversial, as opposed to something like basic mathematics or astronomy,
and then start quantifying the value and take a look specifically at data reuse in those settings.
One thing that has not been mentioned in this discussion, but was talked about
yesterday, is the idea of data as an unexploited resource, because data have not been part of the
traditional journal-based way of communicating science. Negative results in early clinical trials
were traditionally not reported, for instance. Because doing that work costs money, there might
be some quantitative way to get at the value of what is now being withheld.
A couple of the discussants at this meeting were recently at a geosciences data
workshop, and the question arose about how to demonstrate the value of initiatives that
integrate data and make those data easily available. For example, congressional staff members
are interested in problems such as energy, food, and water. If we are asking for additional
resources to enable data sharing, it can be useful to tie that to solving some of the issues
important to them--to medical benefits, for instance, or better ways of managing natural
resources. That also bears on the discussion of incentives and mandates, because when
scientists feel that their work is directly related to producing a social benefit, there is additional
incentive for them to share the data.
Speeding Up the Pace of Science
The repurposing of data can be a real opportunity in speeding up research. How do we
learn from the World Wide Web? What are the inherent opportunities? Companies are
investing a lot of money and making advances in communication that people are using every
day. Scientists may have a problem in seizing the opportunities, yet everyone is using the
commercial media to communicate, to make decisions, and do many other things. Is there
something we can learn from the commercial community to apply to science in this regard?
Can we encapsulate very concisely some of the opportunities or the benefits?
In March 2011, none of the U.S newspapers had any headlines about the earthquake
and tsunami in Japan, because, of course, it happened early in the morning, and they had
already gone to press. In the old days of newspaper-based journalism it would be in the
evening edition before most of us knew what was happening. Then when radio and TV
journalism came along, they became the media that provided news during the day. Now we can
watch videos of the tsunami almost instantaneously on the Internet.
Think about science at that pace. An astronomy course 20 years ago had films that were
not very compelling. It was like playing chess. Not only did you have to be smart but also very
patient while the other player moved. To keep up with the modern world, the pace of science
has to change. It is clear that data science--not just the big data science but the type of science
where we can find the appropriate data, pull it together, and quickly get it into a workflow--is
what will change the face of science. We need to keep up with that kind of speed.
Much of the reason that we may use the internet as an analogy for how science can be
done is because of the speed of change. Just a few years ago, Twitter did not exist. Many
people who use it now cannot live without it. We are being asked to solve problems that are
changing at the pace of the internet.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 141
OCR for page 142
The geosciences are well positioned to react to this real-time aspect that was just
alluded to, whether it is in the study of earthquakes, tsunamis, flash floods, or tornadoes.
Scientists have the capability today to bring that information to be used in real time within
seconds of the event happening. This also bears on the point that was made earlier about tying
the research to societal benefits that resonate with people in Congress. What does this mean for
the protection of lives and property? The rudimentary informatics are in place that can be
greatly expanded in the future with the mobility that comes with the smartphones and other
mobile devices, so that you do not have to be at your computer anymore to get real-time
information.
The Web and semantic data, and linked open data, have had major effects, but not all
science may be done at the speed of Twitter. It is one thing to report some results quickly,
which is what Twitter and similar tools do; it is another thing to trust science that has been
rigorously tested. For example, there is a 10-year study on body fat as a predictor for heart
diseases. A 10-year study cannot be compressed into 24 hours, however. Some things just
cannot be hastened to completion. This comparison with the speed of Twitter and Facebook,
therefore, can be misleading, because it is comparing apples to oranges. The problem with
putting these sorts of ideas in a recommendation is that a policy maker might latch onto this
and say we want science to be done overnight, which could be even more of a problem rather
than a benefit.
Preserving the rigor in science is a very important message, but in the current system
there is a major delay from obtaining the research results to publishing them. We should not
say: "I am sorry you cannot see my data, because that is how I am going to win my Nobel Prize
30 years from now." These practices and attitudes are barriers to the goal of solving people's
real needs with science and to better understand the phenomena around them. The 10-year
study is great when you need a 10-year study, but not if it is the primary way of doing science.
For the next 9 years, you may have people dying who could have been treated by something
that might have been ready to work. Verifying, validating, and doing rigorous science, of
course, are important. It is the sharing of information that can naturally speed up some of these
processes, however.
People in unrelated fields right now may not easily discover prominent work until it has
appeared in the archival journal. This means it has been written up and the visualization of data
is prepared, sent to the journal, reviewed, and gone through the publication deadline. There are
faster approaches that tend to be very different from field to field, but the main issue is the
delay of disclosure, the long time it takes to go through this life cycle, workflow, or ecosystem.
We also ought to keep in mind that one size does not fit all for these processes. When
we examine the management of data, we may be talking about long-term experiments or about
datasets that are developed to help in an emergency. Just those two instances present very
different kinds of data types and uses.
Supporting Interdisciplinary Research
Most of the presentations on the challenges from the first day of the workshop were
very interdisciplinary, and most of the presentations on the solutions that described what
people are building were inside disciplinary boundaries. That actually has to do more with how
the funding mechanisms work than with what people would work on in a natural setting, but
that interdisciplinarity of the key challenges is important. What happens on the Web is that a
142 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 143
small number of people who can work between areas can transfer huge amounts of information
back and forth between those areas. That kind of work tends to be viewed as an ancillary
process, not as a major part of doing science. Further, most scientists who have done this kind
of work did so at the peril of their careers and may have been fortunate.
Another point is the importance of informal communication. Scientists are used to
formal communication. Some of the systems that were mentioned yesterday and some of what
the astronomers are doing now have, in addition to the primary scientific channel, an extra
social and communicative back channel. For example, informal communication is how
scientists may get information from conferences they have not attended personally. Journal
papers are the end product that gets put in the library for historical reasons. That is why they
are called archival journals and not always the primary reporting of the scientific activity.
Recognizing the need for much more of that informal interchange is also important. Those are
two key aspects of research--an interdisciplinary approach and informal social
communication.
Informal communication can be structured like the World Wide Web: as a network of
information and knowledge that you can search and also find the linkages between them. We
can take advantage of that type of structure, which does not eliminate the journal system, but
improves the communication among different sciences, and not just within a single science.
If we were going to try to find these commonalities that allow us to cross boundaries in
data sharing, those are problems that can be addressed. There are high-dimensional image
formats, there are complex text formats, there are big table formats, and it does not matter
which discipline community produces them. The tools for analyzing, visualizing, and
integrating them depend on the structure of the data, not on the contents of the data. Thus, we
can look at what those different kinds of datasets are and what the solutions might be.
We have talked about knowledge networks and collaborative environments as being
able to communicate and get results out more quickly. One caution is that there are many
models for those kinds of practices. As new ways of communicating and collaborating are
developed, it will be important to think about what actually is happening (in knowledge and
experience), how we collaborate and communicate, and how we capture that so that it can be
reused and mined.
One of the issues of citizen science and informal education is the quality of the data,
how good the data are for "real scientific research," and how to describe the quality of the data.
That remains an ongoing question.
The National Ecological Observatory Network (NEON) is sponsoring another project,
called Budburst. It is a project in which people with cell phones go around and take photos
when the first leaves come out from deciduous trees and when flowers first appear, because
that timing is expected to change with climate change. It is a very potent opportunity to provide
a bit of informal education to citizen scientists. They try to figure out what a plant is, they learn
something about how climate and water affect the plants, and when they upload their data, the
Web site offers some educational opportunities.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 143
OCR for page 168
Reviewing Computational Research Results
One point Dr. Stodden made on the first day was that the thing people hate to do more
than anything else is to review someone else's software. It therefore is unclear how many
people would review software that someone else has written. There may be some ways to do
that, however. If a piece of software has 500 or 1,000 users, that is a fairly good indication that
it is a usable and useful program, and the review could then be much less difficult.
This issue comes up frequently in the discussions about reproducibility and other
aspects of working science. A code review as part of the peer-review process before
publication is difficult, because it is an enormous amount of work that people do not feel
equipped to do. There are middle-ground approaches that might be able to be implemented,
however.
Nature published two articles in October 2010 on software in science that discussed
how it is used and often broken, that it is generally not reviewed, that it should be open, and so
on. Mark Gerstein, a bioinformatics professor at Yale, and Victoria Stodden both wrote a letter
to Nature commenting on this and saying that the scientific community needed to move toward
code review. They suggested a broader adoption of what some journals have done, which is to
have an associate editor for reproducibility who will look at the code and try to reproduce it.
Then it would not be such a burden that is imposed on reviewers. That is another possible way
forward, having code submitted, made open, and then incorporating this aspect of review.
Nature did not want to publish that letter, but it is on Dr. Stodden's blog, and the idea is being
discussed further.
Another discussant noted that she has reviewed code plus data for journal publication
and an example of the kind of problems that come up is "the compiler did not run because they
changed something in the most recent version of Unix." Every now and then you find real code
errors. There are a number of domains that do this as a regular task. A list of approved
software is not likely, but different audiences and different kinds of follow-up are definitely
worthwhile.
Incentives versus Mandates
Of the 10 stories that Wired magazine listed as the biggest science breakthroughs of last
year, 5 seemed to be based on the use of databases--2 concerning GenBank and 3 concerning
astronomy. The year before, there were three GenBank successes identified, but no astronomy
and no earth science breakthroughs.
There are two fundamental human motivations: greed and fear. Greed is working, but
very slowly. This carrotstick dichotomy indicates that it is still something to be sold to
researchers rather than something that is a so-called "killer app." Should we be appealing to
fear, telling researchers that the real problem with not having data is that somebody in India or
China will be running rings around them? Perhaps the most effective approach is to pick one
message, whether it be enforce the data management mandates, support tool building, or
support sociological analysis of what scientists do with data, and describe why just one of these
messages should be used and focus on it, because 15 seconds of attention is all that we are
likely to get from the research community and the decision makers.
Much of the successful work happens through bottom-up approaches, and this is a
lesson we should learn. Identifying and rewarding the people who are already doing this is a
168 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 169
strong option. Focusing on what can be done from the bottom-up to acknowledge good work
and to help keep researchers doing that work can get them more visibility across the sciences
and other benefits. The Web can be a very forgiving medium in the sense that if something
wrong is fixed, it is okay as long as it is fixed fast. The scientific community generally does not
have the kind of culture that gets things moving quickly, however. We do a lot of design and
architecture. There is nothing wrong with that, but as this area evolves from the top down, the
bottom-up efforts can be rewarded as well.
Journals also can help enforce reporting guidelines if there are standard metadata on
which you can get a community to agree. In this case, journals can tell authors that anybody
who submits a certain kind of experiment has to include certain information about their
experiment. They are effective in applying pressure to authors, because publishing is one of
their motivating factors.
Regarding prizes, a couple of years ago something like Kiva19 was proposed for
science--a type of micro-loan that researchers could apply for, say, $50,000 from the National
Science Foundation (NSF), to try something new. Perhaps it would be $100,000 or $200,000,
depending on the scope. To determine how scientists should meet the data management plan
requirements, we could have a contest for a couple of years where people could try different
approaches.
If a technology has the right branding, then other people will discover it more easily.
Referring to the "app store" example, the app store has applications that get recommended by
Apple, and suddenly millions of people download them. Another option is for NSF or some
other entity to have a competition with small grants to find approaches that seem to work as
widely applicable solutions. The prize would be that they get a brand that says something like
"NSF Approved."
Another similar suggestion focused on some sort of sandbox, where people can
experiment with different approaches. The people who would take on these kinds of projects
would not take on all the ones that were listed earlier in this summary, but they would have
various missions. Some individual investigators might accept a 5 percent tax so that
technologies for data management systems could be developed.
There is already a group of people who are reusing data and making good discoveries.
We have seen some policies started, such as the call for data management plans, which are
helping to get people who produce data to think about managing and sharing them. One of the
best incremental steps on the production side right now, other than letting some of those
mandates play out and trying to extend them to other funding agencies, will be to continue to
encourage people to contribute to collective databases. There are many mechanisms for doing
that, such as journals requiring registry numbers. We actually know a good deal about how to
get collective databases implemented successfully.
Another option that can be pursued is to enable innovators who are doing productive
work. This can be done in different ways. One option would be to highlight the researchers
19
Kiva is a nonprofit organization with a mission to connect people through lending to alleviate poverty.
Leveraging the Internet and a worldwide network of microfinance institutions, Kiva lets individuals lend as little
as $25 to help create opportunity around the world. See http://www.kiva.org/about
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 169
OCR for page 170
who are responsible for breakthroughs, as suggested earlier. Another option would be to
initiate a prize for the most innovative piece of data reuse for the year--not for the people who
supplied the data, but for the person who had the bright idea and went after the data and did
something extraordinary with them. Another option is to work with people who are producing
results by reusing other people's data to get them more data. If the theory is that "more data
wins," let us try in some very focused areas to further enable the people who are most active.
The NSF data management plan can be seen as the first iteration of the network plan.
That is, from the top-down perspective, the way to break through the discipline silos is to
require people within each silo to say how they are going to use their funding to be good
network citizens. In what way are they going to be a responsible steward of the research funds
with the network in mind? Tell us how you are thinking about the network in your plan to use
these resources. Will you make your database available so that it becomes a networked
resource, and will you annotate it? The point is that if government agencies start requiring
people to plan for using the networks, that could lead to a powerful shift in the way people
think about it themselves.
If a checklist of the properties of a good network scientist citizen were compiled, not
every scientist would check off all the boxes, because science is heterogeneous. If scientists
checked a certain number of those boxes, however, then you would tend to move toward those
goals. That would be a documentable approach. When a scientist submits a proposal for
funding, he or she can point to a previous project and say, "I was a good network citizen based
on this." For example, one of the boxes could be about reproducibility. If a researcher's field is
not amenable to reproducibility because of the size of its datasets, it might not apply, but it
would apply to someone else. There would be some basket of these properties that would
determine how good a network citizen you are. That would be something measurable and
doable.
The Department of Defense (DOD) has gone completely net-centric. It talks about
everything in terms of being net-centric. It might be interesting to look at what the DOD is
doing and see how civilian science agencies might think about net-centricity. Keep in mind,
however, that the DOD is very command-and-control oriented, so it can do these things top
down.
A study could be conducted to look at the impact of the NIH access policy and ask
whether it should be extended to any other federal research agencies. Other agencies also could
follow NIH's lead in requiring that those who submit grant proposals include in their
bibliographies not only lists of their papers but also lists of the places where that information is
publicly available. When a scientist reports on results of prior research, the scientist could list
where those data are publicly available. The National Library of Medicine (NLM) has
developed many process metrics. These include, for example, determining the percentage of
the documents produced with NIH grants that are getting deposited in PubMed Central. That
number has substantially increased.
At this juncture, rather than investing heavily in making data available, maybe time is
ripe to put some resources into the people who are actively trying to get the data to do some
valuable scientific work. Maybe that would be a good strategy for a year or two, and maybe
that would be a good thematic response.
170 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 171
Developing the Supporting Infrastructure
The NIH's National Center for Research Resources (subsequently abolished) and the
Chinese have a strategic plan for research that includes infrastructure development and
funding. Since there is a lot of talk about funding the enabling tools for research, one option
could be a similar kind of mechanism in the U.S. government for general science, not just for
biomedical science.
This discussion has identified a problem or barrier to the development of infrastructure
that crosscuts across science domains. This could be further investigated at a higher level. In
fact, some previous reports have noted that, and the President's Council of Advisors on
Science and Technology (PCAST) report on the Federal Networking and Information
Technology Research and Development (NITRD)20 Program also looked at this.
This infrastructure is hard to develop, however, because it depends in part on a large
research group that can support it. The community would have to determine how best to
promote and cite that to help solve some of these problems. A top-down management approach
is one option, but it could also be good to think about doing it through bottom-up scientific
innovation. Whatever approach may be taken requires an advocate to make that happen, and it
cannot be done very easily.
Technology Transfer Mechanisms
The role of the university technology transfer office in all of these processes remains a
question. The typical instantiation is toward promotion of commercial interests and licensing.
What is the relationship between the technology transfer office at the institutional level and the
funders themselves? Could that relationship be a bridge toward addressing some of these
unfunded areas?
If there is a way to follow up on this idea of a public arm of the technology transfer
office, that could be a way to disseminate more broadly the technology, information, and
datasets that have commercial potential, but also to do other things, such as resolve ownership
issues. If academic datasets were shared more broadly, we would see certain patterns evolving,
and those could become templates for how those issues--in a legal and a citation sense--could
be sorted out. The technology transfer offices could be one venue in which to build
partnerships to encourage this, at least at the institutional level.
Would there be a possibility of organizing some projects? Among the universities and
other entities, could a few demonstration projects be organized to show how we see it being
done? They would experience various difficulties, but in the end they could see how it can be
done.
If a project cannot get funding because it is too "applied," could the NSF have a
technology transfer operation? The idea would be not to fund things that are possible to be
commercialized, but rather to fund things that would be used to advance scientific research. If
there were a GenBank-like division at NSF, for example, it could encourage the application of
the results of research that it funds, although not necessarily all the way through to
commercialization. In some cases, such as drug discovery mechanisms, it could lead to
commercialization, but in other cases, such as discovering stars in the sky, probably not.
20
Available at http://www.nitrd.gov/.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 171
OCR for page 172
If NSF did have a technology transfer concept, how would it be different from a
university's technology transfer function? The NSF does have the interdisciplinary Office of
Cyberinfrastructure, which is supporting the development of new software systems, but it tends
to be a much higher-level activity at the moment. One of the challenges is determining if this
could be done at a specific domain level, or whether we would want to do it more generally.
In addition, a 5 percent tax on research budgets is an option that could be explored
further. It is not clear how many people, particularly the program officers at funding agencies,
would endorse a 5 percent tax of the funding, but they might. Maybe the agreement could be
that if one agreed to pay the tax, it could come with certain benefits, such as access to some
repository. That is, researchers could have the carrot and the stick at the same time.
Another model within government science agencies could be to collect technologies
that might be broadly used from various directorates or projects and have them available in one
place, or at least have pointers to all of them. This is already being done to some extent, but
informally. Technologies are created using public research funding, and if they are of value,
then they may be put on the Web to let others know about them. There is nothing very
systematic, however, that gives NSF or some other agencies credit for having funded
something that resulted in a technology or a tool that can be broadly used. So one question
would be: Is there a mechanism within the Office of Cyberinfrastructure that could be used to
make these tools or technologies available?
There are other government entities that are not explicitly funding agencies--NITRD is
a good example--that have the mission of supporting information technology (IT) research and
development and have the use of that IT for national priorities, including science. NITRD
could be a good ally in this area. The NITRD director is very interested in data issues and data-
intensive science, but the organization has no money. All it has is the right to convene groups
of experts, mostly in IT hardware, but it could be beneficial for NITRD to be involved.
The recent PCAST report that was discussed above makes the point that the issue of
infrastructure needs to be moved up to a level where it is not individual agencies competing
with the individual scientists' budgets, but rather at a level where somebody is looking at this
infrastructure as a national priority for innovation and science. There is now much more
discussion at the top levels and the research agency managers are beginning to see the value of
it.
It is harder to do than it sounds, however. At the University of California, for example,
anything that costs more than $500 has been considered capital equipment for many years. This
is not rational, because the university has enormous amounts of space. It has junkyards of
technologies that cost more than $500, because if they were more than $500, they have to be on
the capital equipment inventory, and it costs more to dispose of them than to store them. The
reason for such a ridiculously low limit is that anything that is considered to be capital
equipment does not get taxed for overhead on grants. The principal investigators have wanted
to keep the capital equipment threshold as low as possible, because it benefits them in the
grants. It sounds preposterous, but California still has a $500 threshold for capital equipment,
and it has been this way for some 15 years.
Publicly funded research organizations are doing projects that are overlapping or
similar to each other, whether in medicine or climate change or other areas. It would seem as
though they could repackage those projects to demonstrate the multi-disciplinary needs and
172 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 173
benefits of such work. Such projects sometimes also have international partners, such as Brazil,
Australia, and Europe, for example. That would allow them to demonstrate something similar
on an international scale.
The Virtual Acquisition Office science advisory committee made the mistake of asking
the members of the committee to come up with test projects. Some people were very opposed
to limiting something like that to a small number of investigators. A regular (small) grant
competition could specify that it is closer to infrastructure than research. This would have to be
made extremely clear to the reviewers, because otherwise they will not understand. Grant
competitions have become a grandiose vision with millions of dollars to do research. That was
not what was originally intended.
Cultural Change
Another important feature of this data-intensive phenomenon is that there is a general
sense that we ought to change practices and change culture so that the new and better ways are
accommodated along with the existing ways of doing things.
How do we change culture by changing practices? How do you change culture? How
do you change practices? It is an easy thing to say, but it is a very hard thing to do. It takes a lot
of time and effort, and often a lot of pressure, because people's interests at the local level are
not served by changing practices to which they have become accustomed.
You can also turn it around to make it "change practices by changing culture." The idea
that you need voluntary kinds of encouragement to stimulate open access has been found to not
work very well. That is why we have had to resort to mandates, whether through the journals or
legislation or funding mechanisms, to change both culture and practice.
At the same time, so much has been added to the list of mandates that accompany
proposals. We have to promote minorities, help K12 education, and so on. Of course, those
who submit proposals claim that they are totally behind these mandates and support them, but
frequently they do not do anything to advance these goals.
One of the things that was proposed to the NSF leadership was letting grantees pick
what they are going to affect and then incorporate that in the proposal. These interest groups,
however, have worked for a long time to get themselves into that mandate list. The fact that not
much happens from them being in a mandate list is not as important to them as being in the
mandate list itself. There is a kind of culture of incumbency just around being listed. It does not
necessarily have anything to do with meaningful change.
We can push hard on this changing-practice, changing-culture approach, because it is
hard to do. What can the agencies do to help change practices and cultures? Probably no
amount of studies will result in changing practices. Adding something to the mandate list may
not change practice either, because the mandate list is already so long that it is just an exercise.
The steps involved in influencing culture are not interchangeable, however. The way to
change culture is by changing practice, and this example of NSF requiring data management
plans is a good first step, although it is a very tentative step, and there is still a long way to go.
Everyone now is holding workshops on how to write the correct data management plan. Maybe
in 3 years there will be many more people thinking that this is a good thing to do, because they
would have been made to do it and they would have started seeing the benefits of doing so.
Thus, there could be some value in recommending specific things for changing practices.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 173
OCR for page 174
One answer to this question is the old adage "you get what you measure." If we are not
measuring the impact of the data plan, we are not tracking it, we are not enforcing it, and then
nobody cares. Suppose, for example, researchers were asked how many minorities they
supported on their last NSF grant? If it was one-third of what was promised, then the
researchers could get one-third of the requested money on the new grant. You would suddenly
have many more people matching those mandates.
It is easy to talk about measurement and enforcement. It is harder to make that work.
We may be coming to a point where we need to start thinking about what is going to work.
The NSF seems to be trapped in this situation where it is supposed to be getting ideas
from the community about where to go with the science. Although there may be many people
interested in this issue, the population of people doing science and being funded by NSF is
much larger, so the view being expressed by those sufficiently interested in any particular issue
is a minority view within that sourcing process. How do you change this? Where are the
pressure points and the leverage that will work, given the limited resources available?
Another discussant made a comment about the socio-cultural attitudes that we may
wish to change. Reflecting on the tradition of science, it is important to celebrate that tradition
and use it as a basis to build upon. Understanding traditions is very effective in introducing
change to different cultures. In working with different cultures you discover that you need to
take different approaches. If, for example, we are talking about the transmission of HIV in a
culture that has polygamy, it is irrelevant to talk about being faithful. Therefore, if you know
the important traditions to select, you can build upon and influence that change more rapidly.
Role of Libraries
In setting up a national information-organizing center that oversees and manages
standards and vocabularies, it is important to remember that when data and information are
created, they may have to be maintained over long periods of time and adapted as things
change. Therefore, there is a certain amount of infrastructure that has to be supported from the
top down. It is not possible to get universities to do this in a distributed way.
Libraries are giving much of the advice for data management plans right now. Research
libraries have seized upon this as an opportunity to do outreach to the science departments at
their universities and to help them figure out how to do the data management plans that the
researchers are now being required to do. Hence, research librarians are allies on the ground at
universities, and they reach out to the scientists to help disseminate cultural ideas and strategies
that funders and policy makers would like also to see implemented.
One of the big questions circulating in the library community currently is what to do
with print collections now that we are moving much more toward digital materials. There are a
number of issues that have never been encountered before. We have been building up print
collections for the past several hundred years, so we know a lot about that. We do not know
much about reducing the collections, however. It is hard. It is more of an ecological approach
than telling others what they should do.
A problem is that it is very easy for scientists to say that issues or concerns expressed
by one community have no corresponding value in their community. There are many ingrained
practices, attitudes, and traditions that are candidates for change, but can we deal with them
174 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 175
sensibly? Research librarians may be able to provide assistance to scientists that would not
otherwise be available, particularly at the college and university level.
One thing that may be helpful is delineating the roles of different kinds of experts who
are involved in the data management process. The funding agencies are like the data investors.
The data producers are interested in their data, but not necessarily in how the data will be
reused. A problem in some cases is that research funders want data producers to let other
people reuse their data, but it is extra work for the producers to deposit their data someplace
and to annotate the data in a way that makes them useful. Then you have the data reusers or
analyzers as well as those who are the intermediaries between that data producer and the data
reuser, acting as the data translator, manager, or marketer. This latter type of person will help
the data producers--who really do not want to do this kind of work or think they do not have
the tools available to do it--become aware of the practices or tools that exist for them to do
this kind of work, and also manage the data and train them in how to do that more efficiently.
Research librarians do that kind of work in genomics, for example.
In addition to the NIH Center for Research Resources that was mentioned earlier, there
is another institute at NIH that deals entirely with information science and information
resources: the National Library of Medicine. There is not an equivalent library for the basic
sciences; although NLM has done some work expanding its focus into other related areas, it
does not cover geosciences and other disciplines.
In the Agricultural Research Service (ARS) there is the National Agricultural Library.
The ARS includes librarians in research project planning and data management. Without trust
among the parties, however, we cannot open up a legitimate dialogue, innovative direction, and
mutual support. How to build trust really depends on those who are involved in a research
project, but without trust they may fight against each other.
Finally, one of the options that was suggested for changing and improving data
management practices is to reduce the costs. Better leadership can direct people to what those
practices need to be. One research agency has a saying concerning data management: "Should
we, could we, and will we?" There is no argument about whether data should be managed, but
the scientists want to know how to do that. Clear answers can foster effective leadership in
different disciplines for preserving data. Some sort of basic minimum guidelines would be
helpful--if not necessarily just tools or methods. Technologies have changed what we can do,
but they have not changed what we do. Appropriate leadership can help us define what we do.
Communicating and Influencing Understanding of Scientific Knowledge
Discovery in Open Networked Environments
It appears that there are at least four communities that are listening and thinking about
these issues. One is government leaders and elites, who believe that there is something big here
and are looking for guidance on how to think about it.
Then there are the people in the institutions--librarians, research leaders, and others--
who are thinking about these issues from a different perspective. For example, many state
universities are now focusing on what to do with the flagship institutions of higher education
among the others. The flagship schools are ostensibly the research institutions, but the
distinction that some schools are for research and some are for teaching is somewhat dated.
One big question here is what the flagships should do for the others. Should they do anything?
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 175
OCR for page 176
This will have an effect on access. A lot of science and scholarship can be done by secondary
sources. As those secondary sources get better, the quality of scientific scholarship will
improve, and it can be ever more distributed. That is a productivity issue as well as a
participation issue. Therefore, this is a second audience, the combination of locally based
research entities and the elites of those locales.
A third audience is researchers who are thinking seriously about the approaching
change and wondering how to respond. This is important. It is not like it was 40 years ago
when the more senior scientists were trained. This is different. What do we do about this?
Perhaps that is a channel to approach some of the younger people, because they are coming up
through the apprenticeship structure.
The final, fourth, group that is usually not addressed are the creators and innovators--
the Facebook and Google types and the people who are doing research in these hacking
centers. In short, good things can happen in this self-motivating way. Moreover, they do not
have to ask for federal money or congressional approval to do it.
There is a big difference between just communicating versus the process of scientific
knowledge discovery. Some of what is being discussed is the need for better tools and
applications to do science over Facebook and other new approaches. According to some
studies, data practices are driven very much by the kinds of tools they use. By talking to
people, we can know immediately the kinds of data practices and the kinds of research that
they are doing, just by knowing the kind of tools that they are using.
One discussant noted that he just got on a big grant with people at the Mayo Clinic that
he has never met in person. He only met them through Twitter, but he is now on their grant.
His best paper recommendations come through people he follows on Twitter. Many of his own
paper citations can be attributed to tweeting about his work. He also gets invited to conferences
through people who follow him on Twitter. This is an example of the value of scientific
networking.
Another discussant noted that the topics of creating scientific discoveries and
communicating knowledge are being conflated to some extent in this discussion. He uses
software and writes code, but does it for a science that can never be practiced on a social
network. For example, there are foresters who work with trees that take 50 or 60 years to grow.
There is much science that cannot be done socially. At the same time, the communication of
science or even enabling the process of science can be done with these tools. He is in user
groups, and when he asks for help, people respond to that. He is able to read up quickly and
does not have to go through a hierarchy to acquire that knowledge. He can utilize tools that are
changeable, because they are all open-source. Therefore, separating the functions of creating
knowledge from communicating knowledge is important. There may be some overlap, but they
are not the same things at all.
It is helpful to discriminate between the two and to ask what we get from doing that. It
is also useful to note that things that we think of as nailed down, such as mathematical proofs,
are not settled until the mathematicians say so. Mathematics is essentially a social process.
There are some things, such as the proof of Fermat's last theorem, that not all mathematicians
agree on at this point. A preponderance of mathematicians think it has been proved, but there
are some holdouts who are very strongly opposed to that position.
176 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS
OCR for page 177
Scholarship is communication. We should not assume that geology is going to be done
on the Web. Geology is probably going to be done with rocks and other things gathered onsite.
But the phenomenon of geology--what people agree about or come to hold as geological
fact--is going to be socially constructed over time, and that is a communication process. The
idea that science and communication are two different things is not valid.
THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 177
OCR for page 178