This page intentionally left blank.
-Information International Associates-
Today, we will take the results of the discussion from the first day of the workshop and build on them. We would like to hear about some options for what could be done over the next few years and what the value proposition is. Closely coupled with that is the issue of how we will know when we have succeeded.
We had some excellent concepts and ideas during the first day of the workshop that highlight much about the nature of the changes that are ongoing. I heard that we should not stay static and that the community might benefit from a less linear and more dynamic approach. We talked about trailblazers versus the long tail of data activities, and whether we are getting into a credibility crisis, because many of the things we are doing now are not as shareable as they were when they were in print. Many codes are not available, for instance.
The title of the workshop is “The Future of Scientific Knowledge Discovery in Open Networked Environments.” The title recognizes that we are adding a fourth paradigm to the research process. We already have three paradigms—the theoretical, the experimental and observational—but now we have added the computational or the data intensive. The presentations we heard during the first day of the workshop focused mainly on the latter paradigm.
How do these approaches build on each other and work together? Why did we immediately jump to the data-intensive sciences when the title referred to a networked environment? The answers are obvious. First, it is because the technology is so enabling, and second is the data deluge.
In the text that follows, we discuss the four sets of topics identified in the statement of task. Each section begins with an Introductory Summary of the Issues Identified in Day One of the workshop. In the first and fourth sections, Puneet Kishor reviews the sets of issues from the first and fourth items in the statement of task, namely, the Opportunities and the Benefits and the Range of Options. In the second and third sections, Alberto Pepe addresses the Techniques and Methods, and then the Barriers, respectively. Each of these Introductory Summaries of the Issues Identified on Day One of the meeting by these two rapporteurs is then followed by a Summary of the Discussion of each topical area by all the workshop discussants on Day Two of the meeting.
This page intentionally left blank.
Introductory Summary of the Issues Identified on Day One
-University of Wisconsin-
Among the issues we heard about during the first day of the workshop was that certain problems in science, no matter what domain, are really the same. A scientist performing observational research, for instance, generates data and then manipulates, organizes, analyzes, visualizes, and reports on them, or maybe preserves them, and then the cycle repeats. The digital medium is one of the best ways to communicate knowledge and is probably also the best way to create knowledge. The concept of a knowledge ecosystem, the whole system of producing, storing, and retrieving knowledge, depends on standards, metadata, discovery, association, and dissemination.
One presentation focused on creating a precompetitive commons, using examples from drug discovery, where fierce competitors can cooperate to produce evolving models of diseases. Cooperation can improve the translation of publicly available molecular data into biomarkers, and increase the opportunities for using drugs meant for one disease to treat another disease.
The “crowd” plus the “cloud” concepts are important to the future of the network. There was an assertion that science can learn about data sharing from the World Wide Web, which could be argued either way. Discussants also mentioned exploring linked open data for interdisciplinary uses. Access to many data is very restricted.
Also, there was much focus on the long tail of science, which is defined as those fields where the benefits of information technologies are not immediately apparent. There were conversations about quality, format, access, financial support, and policy issues.
Other topics that were discussed related to opportunities for innovation that we cannot even imagine right now, serendipitous innovation, still others focused on government policies that encourage or even require scientists to share their scientific data and information, such as the America COMPETES Act.
Summary of the Discussion of Opportunities and Benefits
This session examines the opportunities and benefits in a 5- to 10-year time horizon, so it is important that we understand the value proposition. Do we really know what the benefits and the opportunities are?
Building and Sustaining Knowledge Networks and Infrastructure
Can we say that one opportunity is to provide more and better infrastructure? If a common infrastructure is not there, is it possible to have such a common infrastructure so that anybody can use it?
When we hear a term such as knowledge ecosystem, it raises the concept of a knowledge network. The discussion on the first day of the workshop about a nonlinear way of thinking or a linear way of doing science reminds us of a neural network. That is, there are many different synapses in the brain that connect different facts and different memories and experiences that lead to an understanding of a situation. The same kind of process happens in science.
We saw the example of the pharmacological studies that found linkages between drugs that were used to treat one disease and that could be used to treat another disease because there was a common gene sequence. There was a connection in the network that showed that the same thing that is being used somewhere else actually does link back. This suggests moving toward not just linked data or a knowledge ecosystem but a linked knowledge system. That is, we could merge those two ideas. We could benefit across all disciplines and truly enable long-tail science—the science that is so removed from our experience that we never would have thought of doing it.
We are likely in a period of persistent restricted resources, however. Hopefully it is not a zero sum game, but assuming it is such an era, what is the best way to deal with that situation? Do we need to think of ways of developing systems that could be used by many different kinds of scientists? Are there general characteristics of such systems? Again, we do not need something only for computer science; we may also want something for the agencies funding research.
Demonstrating the Value of Data Sharing
Change is always going to be before us, so how do we work together in our public-private partnerships to adapt to change? We need to show the value of sharing data to our principal investigators. Science is going to be done differently than it was by scientists 20 or 30 years ago, so we have an opportunity to redefine the science as well.
We can imagine a scenario in which a scientist is speaking to a congressional staffer who has heard that scientists are just “welfare queens” in white coats and wants to know why they are being supported. We need some convincing examples where data have led to important discoveries. Earth sciences data and medical data could provide some useful anecdotes, for instance.
We could survey the number of hours people spend analyzing the databases as opposed to the number of hours they spend reading traditional journals. For example, Princeton University spends about $10,000 per faculty member per year subscribing to journals. If we could quantify the relative costs, we could put a value on the database. The bottom line is that
when talking to somebody who is not a scientist, such people do not necessarily start with the assumption that science is worth something, so we need more descriptive stories. For example, the digital library program is justified by pointing to Google. We need more compelling stories that appeal to outsiders.
Also, a protein chemist cannot work without the protein data bank. There are many scientists who do not work the way it was done before there were digital databases. They are data-mining researchers. If you take away the databases, their type of research would stop. You will get different answers in different kinds of research, however. For instance, there is much more to online astronomical research than the official National Virtual Observatory or Virtual Astronomical Observatory (VAO) projects. Although the VAO has led to many useful results, so far it has focused only on infrastructure, without building end-user tools. In astronomy or in other sciences, there are databases that are even larger than the protein data bank. Researchers in these fields use them every day, but the results in some seminal papers may not be cited.
Astronomers use data collection facilities such as the Sloan Sky Survey, the Two Micron All Sky Survey (2MASS), and the upcoming Large Synoptic Survey Telescope. Scientists see these huge databases as something on which they have come to rely. What they may not understand well, however, is what would happen if the medium- and small-scale datasets that are attached to literature would also be openly contributed. Incentives alone might not work. Scientists may not be convinced that their datasets are going to be worth depositing, unless the funding agencies make that a condition of their grants.
Big science examples may be the wrong approach to convince scientists, however, because they are too common. In the physics community, researchers would rather work with the Large Hadron Collider (LHC) than their own small data collections, because the LHC will have a huge amount of data. There are other fields in which the generation of the data may not be very expensive, especially when amortized over a longer time period, but may nonetheless be costly to maintain, because many people are involved or for some other reason.
There also may be a potential problem in asserting that it is more efficient to share data, because if this were true, why are scientists asking for more money to do that? If researchers are going to be more efficient, they should be able to perform better science with fewer dollars. If this is not the case, then there is a question about the efficiency of the system. Also, when we say that it is going to be more efficient, another issue is: Where will the savings go? The savings could go into the analysis of existing data, rather than into collecting and organizing new data. Put differently, the budgets may not go down, but the analytical capabilities could increase.
The research funders could identify the different areas where data sharing is extremely important, and they could then establish a reasonable time span for making the data available. Some guidelines to the scientific community could be useful in this regard and constitute an opportunity. For example, we might eliminate the 2-year gap between obtaining the research results to the sharing of those results through publication. This could allow the timelier sharing of the methodology, the data, and even the codes developed for the study. It could lead to real benefits.
Too much specificity concerning what should be shared, however, can be a problem as well. If science were segmented into different categories according to a level of what should be shared, it could be counterproductive. A specific time frame for sharing data absent
countervailing considerations, such as the Health Insurance Portability and Accountability Act (HIPAA) or similar requirements, could be very useful and would signal to scientists that there is recognition that sharing is very important. It would encourage more scientists to start thinking about the next steps after they produce the data, what infrastructure they will need, how they will have to reorganize their own budgets and funding to do that, and so on.
A rigorous deadline can be a problem. Voltaire said, “The best is the enemy of the good.” Everybody is quick to complain that data curation is too much work. What is the cost, however, to the nation’s scientific effort to peer review weak papers? In some fields, preprint distributions and conference presentations have essentially taken over the publication function, and the journals exist so that people can get tenure. A system relying on something like the Cornell arXiv might be better for scholarly communication, if there were something equivalent to page ranking that would help people decide what they should be reading. If the publication system were less strongly tied to tenure, it might be easier for the universities to determine that they could use other metrics. This could save a lot of money if the universities no longer relied on the ranking of publications to determine who would get tenure.
On the one hand, the promise of data sharing is becoming real. On the other hand, there is a powerful reinforcement of Max Planck’s observation that science proceeds funeral by funeral. In the health sciences, the progress of science may be slower than it has been historically. Some scientific processes may even retard the pace of scientific progress.
The most difficult organizations to consult for are the ones that are doing well, because if they need to change the things that they did to make them successful or rich, their view, of course, is that they are fine and they are not receptive to good ideas. The difficulty is to get the existing order to change their approach, when it is needed.
This is a collective action problem. Individual scientists who would like to fulfill their ideals as scientists often find that they cannot. Framing the problem in terms of reproducibility of scientific results can be very appealing to scientists, however. This can be a way that gives them guidance about when to share data and what types of data to share. The same argument can be helpful in convincing people of the importance of openness in this networked environment, because this is a core principle of science. Nobody argues that reproducibility is not important in the sciences, so this can be useful as a guiding framework for what steps are important.
The climate data controversy at the University of East Anglia in the United Kingdom and elsewhere has indicated that transparency is also key. The average person is not going to reproduce those climate simulations, but citizens want some assurance that there is access to how the model simulation was made and what data came out of it. Therefore, transparency and reproducibility are both important.
We also may separate the different axes along which we can argue that there are benefits. We talked about the speed of communicating research results, so presumably the discovery of new results is one aspect. We have talked about value, and a couple of discussants reminded us that the value of discovery, at least when we are talking to the public, is predicated on the listener’s view of the value of the underlying science. If someone believes that many discoveries in the underlying science are not particularly valuable in the first place, being able to make more of them faster does not seem to accomplish very much.
A strong case for value can be made in the biomedical sciences, particularly as research is translated into health care. For example, it might be possible to determine the real dollar value in repurposing pharmaceuticals. It might be worth picking an area in which the economic value is fairly noncontroversial, as opposed to something like basic mathematics or astronomy, and then start quantifying the value and take a look specifically at data reuse in those settings.
One thing that has not been mentioned in this discussion, but was talked about yesterday, is the idea of data as an unexploited resource, because data have not been part of the traditional journal-based way of communicating science. Negative results in early clinical trials were traditionally not reported, for instance. Because doing that work costs money, there might be some quantitative way to get at the value of what is now being withheld.
A couple of the discussants at this meeting were recently at a geosciences data workshop, and the question arose about how to demonstrate the value of initiatives that integrate data and make those data easily available. For example, congressional staff members are interested in problems such as energy, food, and water. If we are asking for additional resources to enable data sharing, it can be useful to tie that to solving some of the issues important to them—to medical benefits, for instance, or better ways of managing natural resources. That also bears on the discussion of incentives and mandates, because when scientists feel that their work is directly related to producing a social benefit, there is additional incentive for them to share the data.
Speeding Up the Pace of Science
The repurposing of data can be a real opportunity in speeding up research. How do we learn from the World Wide Web? What are the inherent opportunities? Companies are investing a lot of money and making advances in communication that people are using every day. Scientists may have a problem in seizing the opportunities, yet everyone is using the commercial media to communicate, to make decisions, and do many other things. Is there something we can learn from the commercial community to apply to science in this regard? Can we encapsulate very concisely some of the opportunities or the benefits?
In March 2011, none of the U.S newspapers had any headlines about the earthquake and tsunami in Japan, because, of course, it happened early in the morning, and they had already gone to press. In the old days of newspaper-based journalism it would be in the evening edition before most of us knew what was happening. Then when radio and TV journalism came along, they became the media that provided news during the day. Now we can watch videos of the tsunami almost instantaneously on the Internet.
Think about science at that pace. An astronomy course 20 years ago had films that were not very compelling. It was like playing chess. Not only did you have to be smart but also very patient while the other player moved. To keep up with the modern world, the pace of science has to change. It is clear that data science—not just the big data science but the type of science where we can find the appropriate data, pull it together, and quickly get it into a workflow—is what will change the face of science. We need to keep up with that kind of speed.
Much of the reason that we may use the internet as an analogy for how science can be done is because of the speed of change. Just a few years ago, Twitter did not exist. Many people who use it now cannot live without it. We are being asked to solve problems that are changing at the pace of the internet.
The geosciences are well positioned to react to this real-time aspect that was just alluded to, whether it is in the study of earthquakes, tsunamis, flash floods, or tornadoes. Scientists have the capability today to bring that information to be used in real time within seconds of the event happening. This also bears on the point that was made earlier about tying the research to societal benefits that resonate with people in Congress. What does this mean for the protection of lives and property? The rudimentary informatics are in place that can be greatly expanded in the future with the mobility that comes with the smartphones and other mobile devices, so that you do not have to be at your computer anymore to get real-time information.
The Web and semantic data, and linked open data, have had major effects, but not all science may be done at the speed of Twitter. It is one thing to report some results quickly, which is what Twitter and similar tools do; it is another thing to trust science that has been rigorously tested. For example, there is a 10-year study on body fat as a predictor for heart diseases. A 10-year study cannot be compressed into 24 hours, however. Some things just cannot be hastened to completion. This comparison with the speed of Twitter and Facebook, therefore, can be misleading, because it is comparing apples to oranges. The problem with putting these sorts of ideas in a recommendation is that a policy maker might latch onto this and say we want science to be done overnight, which could be even more of a problem rather than a benefit.
Preserving the rigor in science is a very important message, but in the current system there is a major delay from obtaining the research results to publishing them. We should not say: “I am sorry you cannot see my data, because that is how I am going to win my Nobel Prize 30 years from now.” These practices and attitudes are barriers to the goal of solving people’s real needs with science and to better understand the phenomena around them. The 10-year study is great when you need a 10-year study, but not if it is the primary way of doing science. For the next 9 years, you may have people dying who could have been treated by something that might have been ready to work. Verifying, validating, and doing rigorous science, of course, are important. It is the sharing of information that can naturally speed up some of these processes, however.
People in unrelated fields right now may not easily discover prominent work until it has appeared in the archival journal. This means it has been written up and the visualization of data is prepared, sent to the journal, reviewed, and gone through the publication deadline. There are faster approaches that tend to be very different from field to field, but the main issue is the delay of disclosure, the long time it takes to go through this life cycle, workflow, or ecosystem.
We also ought to keep in mind that one size does not fit all for these processes. When we examine the management of data, we may be talking about long-term experiments or about datasets that are developed to help in an emergency. Just those two instances present very different kinds of data types and uses.
Supporting Interdisciplinary Research
Most of the presentations on the challenges from the first day of the workshop were very interdisciplinary, and most of the presentations on the solutions that described what people are building were inside disciplinary boundaries. That actually has to do more with how the funding mechanisms work than with what people would work on in a natural setting, but that interdisciplinarity of the key challenges is important. What happens on the Web is that a
small number of people who can work between areas can transfer huge amounts of information back and forth between those areas. That kind of work tends to be viewed as an ancillary process, not as a major part of doing science. Further, most scientists who have done this kind of work did so at the peril of their careers and may have been fortunate.
Another point is the importance of informal communication. Scientists are used to formal communication. Some of the systems that were mentioned yesterday and some of what the astronomers are doing now have, in addition to the primary scientific channel, an extra social and communicative back channel. For example, informal communication is how scientists may get information from conferences they have not attended personally. Journal papers are the end product that gets put in the library for historical reasons. That is why they are called archival journals and not always the primary reporting of the scientific activity. Recognizing the need for much more of that informal interchange is also important. Those are two key aspects of research—an interdisciplinary approach and informal social communication.
Informal communication can be structured like the World Wide Web: as a network of information and knowledge that you can search and also find the linkages between them. We can take advantage of that type of structure, which does not eliminate the journal system, but improves the communication among different sciences, and not just within a single science.
If we were going to try to find these commonalities that allow us to cross boundaries in data sharing, those are problems that can be addressed. There are high-dimensional image formats, there are complex text formats, there are big table formats, and it does not matter which discipline community produces them. The tools for analyzing, visualizing, and integrating them depend on the structure of the data, not on the contents of the data. Thus, we can look at what those different kinds of datasets are and what the solutions might be.
We have talked about knowledge networks and collaborative environments as being able to communicate and get results out more quickly. One caution is that there are many models for those kinds of practices. As new ways of communicating and collaborating are developed, it will be important to think about what actually is happening (in knowledge and experience), how we collaborate and communicate, and how we capture that so that it can be reused and mined.
One of the issues of citizen science and informal education is the quality of the data, how good the data are for “real scientific research,” and how to describe the quality of the data. That remains an ongoing question.
The National Ecological Observatory Network (NEON) is sponsoring another project, called Budburst. It is a project in which people with cell phones go around and take photos when the first leaves come out from deciduous trees and when flowers first appear, because that timing is expected to change with climate change. It is a very potent opportunity to provide a bit of informal education to citizen scientists. They try to figure out what a plant is, they learn something about how climate and water affect the plants, and when they upload their data, the Web site offers some educational opportunities.
Effective Incentive Systems
There is a 2003 report from the National Academies Press called Sharing Publication-Related Data and Materials: Responsibilities and Authorship in the Life Sciences14. It has many kinds of recommendations that focus on how it is good to share research inputs and what it means to promote the progress of science. It probably has not had a tremendous effect, however. Given that we have been there before, a case can be made for the greater use of deterrents.
The stick-versus-carrot approach has other dimensions. Recently, there was a thought-provoking book called Switch by Chip and Dan Heath15, which talked about the elephant, the writer, and the path. It made the argument that you are never going to get the elephant to move in the desired direction with a stick. It has to be persuaded to go in that direction, and you need the right incentives, the right carrots. More important, however, is that you need the path to be paved. Therefore, we should not focus just on the carrot-versus-stick approach. How can we lower the barriers for the scientists and investigators to share their data, information, and products with the broader world? Not all scientists are opposed to sharing their data because they want to hold onto them for the Nobel Prize. They just do not have the time, resources, or incentives to be able to share them. Agencies can focus on enabling and motivating the investigators to share their data, and on giving them the right tools for effective data sharing and data citation.
It was pointed out that in one discussant’s organization, it is the bench scientists who are demanding national data systems to be built, but this is not happening broadly. The use of one of the U.S. climate change data systems has been adopted by 20 countries, and they are supporting it as well. It is the bench scientists who support this, because they want to do better science and they understand it. In that organization, they are the ones who are demanding, so a stick is not necessary. They would love to do it faster and have a greater impact, but they do not have the funding to expand their scope. The money may serve as a carrot or a stick, but if researchers had a little more funding they might expand their impact quite significantly.
In that agency the bench scientists are demanding greater sharing of data, and they are the ones who are realizing that their work is going to be affected. They are changing the way they are doing science. They need some direction from the general science community so that they are not going to be wasting their limited resources. For them, the incentives can be additional funding and a clear direction on how they ought to be documenting and preserving their data.
The separation of science into gathering and analyzing data would mean that people could be scientists with a smaller skill set than is now possible. Such an approach could open up the field to wider participation by people who at the moment cannot easily do so, because they have not gotten all of the training to do both sides of the job.
The Heritage Health Prize is a $3 million Netflix-like challenge to work on hospital data. It has motivated a huge amount of interdisciplinary effort at Stanford University among the students and faculty in business, computer science, and biology, as well as medical doctors.
15 Chip and Dan Heath, Switch: How to Change Things When Change is Hard, Broadway Books, 2010.
People have formed teams to work on this prize, because there is no barrier to form such teams. Many of these prizes have no barriers to participating if the person has the required skills.
The America COMPETES Act explicitly authorizes the National Science Foundation (NSF) to put forth scientific challenges and to have reward money for solving them. It is being “legitimized” with the federal agencies. The question thus arises whether in an open network environment well-documented crowd-sourcing can be an opportunity for advancing knowledge discovery.
Policies of Research Agencies
There was a very compelling diagram presented during the first day of the workshop about the optimum curve that depicted how much of the scientific community’s research investment should go toward data sharing. We could pose that question to the National Institutes of Health (NIH) or to the NSF and ask them to develop a theory based on that and then to manage their portfolios accordingly.
Federal science agency managers may generally understand these areas related to data sharing. Every agency has, in its own way, made a commitment to funding different kinds of resources. For example, NSF has been supporting the National Center for Atmospheric Research along with other research centers for a long time, which has put those institutions in a very good position. Also, NIH, the Department of Energy, and the Department of Agriculture have provided a great deal of genomic funding, which has helped the funded projects to be able to harvest the resulting data and use them for other useful and interesting ends.
A one-size-fits-all solution, a monolithic approach that everybody has to use, is probably not appropriate. There are small and simple solutions, but there are many of them and they are interconnected, they are designed for specific purposes, and they are run at economies of scale. The NSF and other research-funding organizations have their own mechanisms—ones that are good for their scientists. They can offer solutions to their scientists in the way that the GenBank or the Protein Data Bank were offered, and the scientists can choose to use them or, in some cases, have to use them. For many scientists, however, that opportunity for a solution is missing.
This page intentionally left blank.
Introductory Summary of the Techniques and Methods Discussed on Day One
-Harvard—Smithsonian Center for Astrophysics-
Fewer techniques were identified from the discussion on the first day than barriers. One technique that we already have been using for discovery of scientific knowledge is to develop and install open-access repositories that function not only for full text but also for images, data, and software.
The need for alternatives to traditional ranking mechanisms for scholarship was also mentioned. For example, the University of Southampton could be ranked higher than Oxford or Cambridge universities in the United Kingdom, if we look just at the impact of the university on the Web. Therefore, when looking at the number of citations that a university receives, we could use not only traditional scholarship but also the peer-reviewed scholarship that exists online.
Another technique that was mentioned as a need by many presenters is semantic computing, or the development of systems that take semantic meaning into account. As one of the discussants noted, computers are great at storing, managing, indexing, and accessing information, but in the future, they will also need to make sense of all the information that they store, manage, index, and access.
Many presenters also spoke about linked data and the idea of a faceted search. For example, we have seen the new Astrophysics Data System (ADS) Lab interface that allows us to provide an interface to the information as we search for it and to somehow filter down the information for which we are looking.
Yet another technique for scientific knowledge discovery is providing direct access to the data. The example of the ADS Labs and the faceted interface is relevant here as well. Also, one of the presenters referred to using data as a filter to literature. In order to allow the greatest possible access and to enable discovery of scientific knowledge, we could use the data as a filter to the literature and not just use the publications as a filter to the data.
Versioning was another technique that was mentioned throughout the discussion yesterday. A presenter noted that we could probably use softer versioning systems to capture scientific data, practices, and the artifacts that are produced across the scientific life cycle. Two types of integration were raised. One of the techniques required for the enabling of scientific knowledge discovery is the integration of existing scientific tools with Web 2.0 tools, or simply with existing tools on the Web. An example is the astrometry.net site. It is a research tool that is primarily intended for the scientific community. It determines the position of
images uploaded by users on Flickr. This is an excellent example of a tool that seamlessly integrates with an existing application on the Web.
The second type of integration is with social networking sites. Scientists can leverage the user base that already exists in these services to accomplish several objectives. The first is the scientific discovery process. How can a researcher use data and content that are already published by users on social networking sites to enable scientific discovery? Another objective is to enable citizen science. Computer-supported technologies and social media can be used by nonscientists to take part in different scientific enterprises. Finally, how do we use social networking sites to allow people to collect data, either citizen scientists or actual scientists who want to collect and publish data on the Web?
Another issue identified in the discussion was applying large-scale statistical methods and computational models of public scientific data to both predict and track scientific knowledge discoveries and constructs. One example was the heat map used in the biomedical research area that was described on the first day of this meeting. It examined the use of drugs against diseases. This is a kind of research that is made possible by the use of computational science—data science—on publicly available scientific data. It also could be done in parallel with computational social sciences, which attempt to do exactly the same thing using public social data.
Some people suggested that we need discovery tools for scientific knowledge that are data driven—that is, they are bottom-up rather than top-down. Google uses distributed data to organize, rank, and provide access to information rather than using a prior classification scheme. Researchers might do something similar and use the features of the data themselves to discover scientific knowledge.
This is probably the most efficient and popular technique for scientific knowledge discovery today: data visualization. Statistical visualization and other visualization techniques enable knowledge discovery and make possible the observation of scientific phenomena that data alone may not reveal. We heard about different examples of using visualization in astronomy, as well as in geospatial and atmospheric research. We may not be able to see some features of the data just by doing data analysis, so there are some phenomena that visualization allows us to understand and observe.
Finally, returning to the issue of semantics, the notion of using simple metadata and lightweight semantics to annotate the data was brought up. This was described as micro-level semantics. Some presenters suggested that in some cases the use of lightweight semantics—very light annotation schemes—is enough to describe scientific data and provide metadata for the data that someone is trying to discover and explore.
Summary of the Discussion of Techniques and Methods
The title of this workshop is “The Future of Scientific Knowledge Discovery in the Open Networked Environment.” Although much scientific research is taking place in this open networked environment, it is not proceeding as quickly as it could, because of the various barriers that have been identified. It is important to understand these barriers as well as take advantage of the opportunities that we have cited to help us move more quickly into the open networked environment. It will be a mix of both bottom-up and top-down approaches, however. Also, both young and senior scientists work together and there will be success stories in both directions.
Computational Models and Semantic Computing
It should be noted at the outset that the discussion focused on data, rather than the scientific, technical, and medical literature. The sheer size of the datasets that are being managed is a potential barrier, but there is also a potential solution or a technique to deal with that issue. We can ship the algorithm to the data rather than ship the data to the algorithm. Through Web services or some other approach, we could allow a scientist to process the data wherever they are rather than bringing the data to the scientist. It is still difficult to store a petabyte of data on a desktop. This may be needed, however, not because it is hard to move a petabyte dataset, but because it is the way by which the data owner can control what other people are allowed to do.
The idea of shipping the computational programs to the data makes good sense, because the software is much smaller than the data in size. The problem is that scientists often do not work with one dataset. They may be analyzing 10 different datasets. What will they do in those cases? Ship software to 10 different sites and have researchers perform part of the work there, then ship it somewhere else, perform part of the work, etc.? It does not work that way. For example, geographers may do a simple analysis and have perhaps 16 geographic layers for analysis. Therefore, the issue still comes down to access to data.
Researchers often need to develop an intimate knowledge of the data they are using. Sometimes it is a great idea just to ship the algorithm to the data, but if scientists do not actually see the data, if they do not have the data on their computer, and if they are not running some tests and some examples and establish that sort of relationship with the data, then they also cannot run an algorithm. Consequently, sending the algorithm to the data can be a great idea, but in some cases there is a barrier that has to do with how we manage scientific data.
The following anecdote is relevant in this regard. Years ago an astronomer started working with computer scientists. The astronomer’s interest was in distributed data mining, because she had data that the virtual astronomy observatory was making available. For the first few months of their collaboration, the parties were not communicating well. Not only were they using different terminology, but they were not even talking about the same paradigm of research. They finally realized that the astronomer’s interest was to mine distributed data, while the computer scientists’ research interest was the distributed mining of data. They reached a compromise to use both approaches. It was relevant, because for it to be considered research in the computer scientists’ domain, it had to be computer science research, not astronomy research, so just downloading data to the astronomer’s laptop and enabling her to run an algorithm was not computer science research.
To perform distributed principal component analysis (PCA), we need to do an iterative solution on the PCA by shipping parts of it to distributed data sites. The data are never brought to one central location. It is possible to get very good accuracy on the PCA vectors in only a few iterations, however. That was an interesting result for both parties. They had to have enough network bandwidth to get the data to the astronomers, but also, for cases where that was not possible, to have the computer science researchers ship the code to the data.
Offering many collections of data for downloading is a fairly simple service that can be provided either by an individual researcher without much sophisticated work or on an institutional basis with a one-size-fits-all approach through a repository, for example. When you query to database, however, you open up to a Pandora‘s box regarding computational security. You need to make sure that queries do not run astray in various ways. These can be managed, but require a greater degree of operational sophistication. The last thing we would need as researchers that are encouraged to share more data is media coverage of a high-profile, security breach.
One discussant noted that he has been lucky enough to work with some of the top people in the Web community. He got his credibility when he gave a talk and had the word “the” with a kill ring over it. He said, “On the Web there is no the.” However, he is still hearing: “the way to do this,” “the way to think about this,” “the community I work with could only do it this way.”
The open network is not bounded narrowly, which means that some things will work for one group and some things will work for others. It is important not to try to propose the same architecture for all scientific data sharing. We can examine a litany of such approaches going back approximately 40 years. Remember collaboratories and the access grid, and how that was going to change science? Part of the reason it did not was because such concepts were built on a single model that worked great for some people, but not for others. We therefore should try to explore many different ideas, and not have a closed boundary: explore different ways of doing things.
There will be some small solutions that are the only solutions that support certain communities, and we should not resist those solutions because they do not work for the rest of us. At the same time, there will be some very large communities or some aspects of our work that cut across the sciences, and those techniques may benefit greatly from the sort of soft infrastructure that was discussed on the first day of this meeting. It is important to remember that we are talking about a very large-scale, multiple-science, and interdisciplinary set of issues. There is not going to be a single architecture that solves all of that, including the Web itself.
One discussant wanted to underscore a question that was raised earlier in the workshop about how effective some of these techniques are, particularly the text-mining techniques. It could be very useful to find some specific successes for those kinds of methods of knowledge discovery. This is an area of technology that has had a lot of investment. It would be possible to make a lot more investment, but it would be helpful to understand the efficacy of the work before significant investment is made.
Another discussant wanted to know to whom the techniques or the methods or the improvements are specifically directed. There are many stakeholders involved. If the techniques try to address everybody, they address nobody. In the case of an open-access
repository for the medical community to improve pharmaceuticals, who are those stakeholders? There could be equity fund managers, program managers, Congress, the public, and others. They are all different, and they are going to look at it differently. If we do not associate the techniques, methods, and barriers with those people, we may miss the audience in showing them and explaining the impact. As these topics are discussed, the primary stakeholder at issue ought to be identified.
When talking about open-access data repositories, we should look at models too. Much effort has been devoted to open-access repositories for scientific literature and data, but various models could be added to the analysis, because understanding different ones can be key to the future of data-intensive science.
Other issues include semantics, data visualization, and data mining. The purpose of semantics is to help the computer understand. Data visualization is important once there is something to see, while data mining is the computer discoveries that you have in a database. Those are very different things, each of which comes from a different sub-discipline, but they all need to be dealt with each on their own terms. Paul Peters, who died in 1996, said something along the following lines, “In the Paleolithic Age, there was a saying that information was food for thought, but in the Future Age, information will be the predator and the human will be the prey.” If you think about that, this is the ultimate insight in the presaging of data mining and what is really going to happen. Information is going to be finding us, so how do we end up there?
Another issue is that if data are going to be a filter to the literature, then we need to be able to cite data and their connections to the literature. There are many issues in data citation; thus, it becomes much more important that we know how to define what we are talking about and how to name it.
Two more issues were raised in this content. The first is that reproducibility is really hard. Simply having the code and the data does not give you scientific reproducibility today, let alone in 10 or 20 years. There is much more to making sure that things work together and that we can use that code effectively and intelligently to get the right results and be able to interpret these results. It is not at all clear that if we just had the data and the codes, reproduction of the results could be accomplished; they most likely would not.
Also, when discussing code reproducibility or reusability, brand is important. For example, there is Google labs and all their products. There are millions of lines of code in Source Forge, however, that never get reused by anybody, simply because they are not recognized as being a reliable brand of quality software that can be easily obtained and plugged into different tools.
Visualization is one of the key tools that can enable knowledge discovery. Visualization is important, because it is needed not just to understand the data after they have been analyzed but to understand them while they are being analyzed. For example, sometimes we cannot figure out what is going wrong with a thousand-by-thousand matrix, and it would be useful to just be able to see it and have something that would allow a scientist to do that. It would be good to focus on visualization tools for that reason. However, visualization is not sufficient
alone, because there is an infrastructure that needs to be in place before we can take advantage of visualization.
Data mining and data visualization were described as separate areas. There was a provocative conversation at one of the workshops on data and visualization that Dan Atkins and Tony Hey convened for their advisory group to the National Science Foundation’s (NSF) Office for Cyberinfrastructure. Their topical area had covered both data management and visualization, and visualization seemed to be a very siloed kind of technique. We can take a more holistic view of data analytics, however, that comprises mining, visualization, and perhaps some other intermediate or allied techniques. It could be useful to think about these issues, because there is a tremendous amount of siloed work going on in various places, and as the data science scales up, a more holistic approach would be desirable.
One discussant noted that the word “analytics” is seen frequently in the literature these days, but “informatics” better describes what we are talking about. Informatics includes visualization, mining, statistics, semantics, knowledge organization, data management, data organization, and data structures. That is a holistic viewpoint.
What was indicated during the first day of the workshop is that people are willing to use generic visualization tools that are repurposed—which they can get from companies such as Google—as a substitute for what they might have done with higher-end software. We can call this “modular craftsmanship,” which involves people who are very good at making tools and making them more accessible to the point that anyone can plug them into a system. Those tools will most likely continue to get better.
Such tools also could have useful connections to highly specialized data formats that might be specific to a field. The scientific community could use applications that are being developed commercially but are available free of charge. For example, a YouTube for visualizing data, a “VizTube,” could be useful because it represents a class of data functions. There really is a huge amount of work done in information visualization that goes way beyond what scientists themselves have ever done.
Most scientists do not make the connection between amazing graphics in The New York Times or the Guardian that were produced with some easy, reusable tool and something that they could do for their own data. There is thus an educational aspect here, informing scientists that there are easy-to-use, reusable tools. There is something like this in astronomy, but it is not as easy and as well-behaved technologically as, for example, what newspapers such as the New York Times or the Guardian do. The Guardian is actually an incredible model for graphics and visualization.
When someone takes these tools from the analytics or informatics communities that are generic and freely available, and then tailors them to a scientific use, there can be a hidden cost sometimes to make it work well. It may not be a major cost, but it does need to be covered somehow. That can be a problem, because it is not research in the eyes of the funding entities, even though it can be valuable in enabling much more research.
One of the visualization issues is related to standards. Scientists use tools to create the visualizations and other formats that can be put online. Many of the formats and tools are proprietary, and the vendors who own them are not supporting many of the open visualization
frameworks. The scientific community could benefit from being more aware of the standards for these visualization tools and who is determining them.
A similar situation exists for workflow software. The problem with workflow standards is that there are too many of them, and yet the tools of some groups do not support any of these standards. The tools are just a black box to the user. Therefore, pushing the tool vendors and pushing the scientific community to be smart purchasers can be very important.
Integration of Scientific Discovery Tools with Web 2.0
Another topic concerns the social network sites. How do we get scientists to participate in them? That is a big question that has come up in the Search for Extraterrestrial Intelligence (SETI) community. The issue concerning the encouragement of more participation and how we compete for mindshare or people’s time then leads to the “app store” package. The app store can be a real opportunity to get more participation. Apple put a lot of effort into building their App Store brand to make it useful and productive for people. It may be worth thinking about it from a science perspective, especially how to build our professional brands—an NSF brand or some other trusted brand—to indicate that a tool is high quality and should be generally useful to a large swath of scientists. Also, a greater focus on public-private partnerships could be beneficial. Academics then might be able to exploit the tools developed in the private sector more efficiently with the public’s money.
One discussant from a company that works a lot with the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA) noted that he has been thinking about democratizing access to data and analysis for more than 15 years. The notions of citizen science and public data collection are still seen as fringe activities and not in the center of scientific research. The open-source community likes to say, “If you put enough eyeballs on a problem, all bugs become shallow.” The same situation is potentially true of scientific discovery.
For example, a few years ago, there was a well-publicized case involving ordinary citizens who discovered some unique geological formations just by browsing the satellite imagery available through Google Maps or Google Earth. These formations had been there in plain sight, but no one had found them, even though relatively few of them exist, so it was big news.
On a much larger scale, NASA is about to launch the NPOESS Preparatory Project (NPP) satellite.16 It was never intended to be an operational satellite for applications such as weather forecasting, so a near-real-time, data-access stream is still being discussed. There is an opportunity for direct broadcast systems to share those data for near-real-time use.
Another methodological issue is related to how do people know whether what they are doing is successful or an improvement? The Web metrics that Dr. Hey mentioned was just one example of a different kind of metric that now exists. Dr. Stodden had a slide in the discussion yesterday about peer review and how we evaluate computational research results that are published and the different modes of doing that.
16 National Polar-orbiting Operational Environmental Satellite System (NPOESS).
There are techniques for evaluating the literature that are outdated and do not take into consideration the evolving Web paradigm. They are based on the old print paradigm. That is one example of evaluation, and virtually everyone in the academic sector works toward those evaluation criteria, such things as the ISI Citation Index. If they are not updated, those types of evaluation can retard the progress of science and not advance it.
New tools, criteria, metrics, and indicators of progress of what is valuable and of what these new scientific technologies and processes enable could be useful. In particular, there are very few useful indicators for data management and use. Tenure is not based on data work; it is based on outdated publication indicators. It might be useful to discuss the kinds of research that could help change that culture, and the metrics and the evaluation tools, to get a better understanding of what is actually happening rather than what happened a few decades ago.
Introductory Summary and the Barriers Discussed on Day One
-Harvard—Smithsonian Center for Astrophysics-
This is a summary of the barriers that were repeatedly mentioned in the presentations yesterday. One of the first issues that arose from the discussion was that scientific knowledge discovery efforts exist at the national and international levels, such as the virtual observatory in astronomy, but without any proper coordination mechanisms. Also, ad hoc data standards and protocols are sometimes developed and adopted at the national or organizational levels.
The second barrier that was mentioned yesterday was the issue of interoperability. Infrastructure solutions to knowledge discovery often are not interoperable. Some people referred to the need for a Google-like search and data management service for science, which is something we do not have. Some people mentioned different efforts in the virtual observatory that failed to interoperate in a similar way. The different systems that exist interoperate to a certain point, but usually not across different disciplines.
Funding was another issue that was repeatedly mentioned throughout the presentations. Such things as data hosting, data management, data mining, and cloud computing can be expensive. What are publishable artifacts? What should end up in journals? Some people referred to the importance of considering data, software, algorithms, and other supplemental materials as publishable items.
Regarding proprietary data issues, discussants addressed locked and unlocked data. Some people referred to the rich genomic data that are held by pharmaceutical companies as an example of locked data. Unlocked data include much of the data in astronomy. It was suggested that even with locked data, the owners of such data might want to allow data annotation, because it can add value to the data and, in most cases, makes them reusable. If data are not annotated, they may not be usable at all.
Another topic that was discussed referred to the multiplicity of efforts, solutions, and tools that have been developed for scientific knowledge discovery and data management. We have seen and discussed many of these efforts, and here again the interpretability issue arises. In some cases the various tools are connected to one another, but in other cases they are not connected in a way that makes sense and that allows the tools to communicate.
One topic that was repeatedly highlighted in the presentations was the need for a standardized and widely adopted mechanism to cite and reference scientific data. Although there are some efforts to standardize data citation techniques and mechanisms, there is not a widely adopted standard. For example, many scientists cite data online or just present links to data in the footnotes.
There was also a discussion of sociocultural barriers. Specifically, one of the questions raised was related to whether researchers or social scientists who are interested in scientific knowledge discovery fit in the academic or the industry job market. Another question was related to how they get recognition for their work.
This topic is connected to the next point that was raised, which is the issue of rewards. On many occasions, discussants talked about the changes related to the shape and role of the academic paper and that the reward and recognition systems, including such things as citation and authorship, are also changing. These functions are likely to change even more in the next generation. Today there are not enough incentives to change to the direction of open scholarship and open data. In a way, as mentioned yesterday, such openness conflicts with the culture of independence and competition that we traditionally have seen in scholarship.
A barrier that was mentioned throughout the discussion was related to the difficulty to integrate computer science topics or ideas into the curriculum of the various research disciplines. The refocusing of scientific disciplines and their curricula around computer science and data issues enables not only the solving of problems much more quickly and in an automated fashion but also the asking of novel questions.
Many presentations pointed out that big datasets already exist on the Web and that science is not the only enterprise that produces large datasets. Enterprises such as Facebook and Twitter deal with large amounts of data and other types of large-scale user-generated content, and they are very well funded.
Another barrier that was raised is the simple fact that data searches are hard to do. Indexing, searching, and retrieving scientific data can be difficult. Similarly, extracting knowledge from scientific data in publications can be problematic. For example, extracting hypotheses and results from scholarly articles is very hard to do. It can be accomplished using annotation systems, but usually not in an automated fashion.
The theme of diversity recurred throughout the presentations. Some presenters noted that computer-mediated scientific knowledge discovery has significant differences among scientific communities, disciplines, and institutions.
The issue of disciplinary versus institutional repositories arose several times during our discussions as well. Should there be data and literature repositories at the disciplinary level or at the institutional level? Whatever we choose to do, interoperability of the two platforms is important.
Physical barriers and the fact that distance matters were also discussed. Many large cyberinfrastructure initiatives assume that collaboration can just happen using computer-supported technologies and that scientists can perform work remotely. What we have seen is that most of the time scientific output and trust improves with face-to-face interaction.
The issue of temporal barriers, the idea that software and metadata requirements change with time, was also highlighted. What we collect today may not be what you want tomorrow.
Towards the end of the day, we learned from the lawyers that the law does not always help. The laws that deal with data are not a clearly defined issue. It was also mentioned that with the current legal system, it is difficult to achieve simplicity. The system is designed for securing property, not for sharing knowledge.
Finally, the last point that was discussed was related to the idea that scientific enterprises become progressively more computational, and therefore data and code sharing at the time of publication becomes crucial to the reproducibility of scientific results.
Summary of the Discussion
Certain problems may be common to all fields. We ingest data, manage large amounts of data, organize, analyze, visualize, and preserve them, and therefore the data life cycle is something that might be examined more closely. What many have said is that for us to be able to manage all this data, we ought to lower the barriers to be able to ingest, manage, and visualize the data.
A point about the need for Google-like search in management service for scientific data was raised during the first day of the workshop. There are a number of such mechanisms. One of them is for astronomy, as in the case of the Astrophysics Data System (ADS), which was discussed during the first day of the workshop and which has complete user adoption. All astronomers use electronic publications, but they do not use the internet to access data. There are many interesting techniques, but we could benefit from having more of them and they could be more broadly available.
Some of these mechanisms allow us to find where data are, but they do not allow us to get the data. Many of them are not at the point of allowing us to have the tools to use the data, so there are plenty of improvements that can be made.
The worldwidescience.org and science.gov portals also were mentioned yesterday. These are tools for deep Web searching that can be better than searching Google for science. They have fairly good use statistics, although not like Google, of course.
Another problem is that as more researchers can generate a lot of data, they cannot visualize them properly or even analyze them. It is becoming extremely difficult. As Mr. Dudley said earlier, he has a group of 20 people that have 500 computers in a room, but that is not typical. That can be a big problem, because scientists can generate huge amounts of data, but the data may just sit there or are lost.
The greatest barriers can be based on the social or “soft” infrastructure, not about the technical aspects. The technology moves very rapidly, and the solutions, even if they are not straightforward, are worked on constantly. Institutional structures, models for communication from a human infrastructure standpoint, the legal aspects, and the funding mechanisms are the things that tend to change very slowly. Also, they do not adapt very quickly to new technology changes and always lag behind technological progress.
We are always catching up from an institutional standpoint. The point was made yesterday that we are still mostly stuck in the print paradigm on scholarly communication, even though we have moved wholesale onto the Web. It would be useful to rethink how to organize scholarly communication and how to be more responsive and adaptive to the technological opportunities in a way that will promote greater productivity of the technology that is available.
In this particular activity, we should look to the generation that is called the digital natives. If you know any staffers in Congress, you know that many of them are young, and
senior members of Congress tend to trust these younger people. One of the barriers that could be overcome is the standard structural barrier related to older people making decisions when younger people should be included more in the conversation.
Also, one of the most common explanations given when you ask somebody about barriers is the difficulty of doing the job. Therefore, one issue is getting tools that will make a given task easier. On the soft barrier side, people who have the computing skills to do some of the data work find it difficult to be rewarded in the scientific discipline in which they are working, but also find it difficult to get credit in computer science if they are perceived as working as an assistant to somebody in a scientific area. The same problem exists in statistics. Therefore, another issue is the unwillingness of universities to reward people for activities that seem to be helping someone in a competing department rather than in their own department.
Another institutional barrier is related to legal issues. A large midwestern university, which shall go unnamed, decided to review its copyright policies with respect to faculty. The Statute of Anne, which many people believe to be the first copyright act, was enacted more than 300 years ago, and we still have not figured out exactly what copyright is. At that university, the faculty were upset with a decision by the Technology Transfer Office (TTO), which had been given plenary jurisdiction by a Regental bylaw over all issues related to intellectual property, including copyright. The TTO’s position, in effect, held that university researchers may obtain copyright to their works, because historically they have received the copyright, but that the university administration would decide after the fact whether the researchers used extraordinary university resources. If the researchers had done so, then they had to share the proceeds with the university. The university faculty thought this rule was unfair and lobbied to get it changed. This evolved into a standoff between the TTO and the faculty that lasted for 2 years before it was resolved in favor of traditional copyright privileges of faculty.
This is an example of a routine step-by-step process that occurs in some universities regarding arcane problems that we would think have been settled by now. When a university’s budget is being cut by tens of millions of dollars a year, this does not sound like a very interesting topic to discuss at the regent level. A separate topic is the complex set of issues concerning the sharing of human subjects data, which has not been looked at nearly enough.
One of the barriers is with the geographic information systems (GIS) mapping. If the data are unavailable for the research, scientists can be disadvantaged in the GIS market, which is one of the markets most in demand besides Twitter and Facebook. A key question is whether the GIS being used is proprietary or open source. If we cannot get to the data, then we cannot do anything. We cannot create any maps without the data. There are no standards to release the data. If the data have not been released, the scientific results suffer as well. For example, data.gov only tells us what is there, but it does not make the data interoperable.
In the ideal world, it would be wonderful if everybody could reach everybody else’s data. Simply being able to find the data—even if it only gets users to a human-readable page that tells them who to contact to get the actual data—would still be a major breakthrough. Unfortunately we are not there yet. Data.gov is a great example. Every dataset is available if it is in one of the catalogues, but there are other features available to find out more about the data.
GIS is a very important area, but the standards do not exist to interoperate. For example, the guidelines for putting boundaries on the map are not the same as the ones for putting items on the map, which, in turn, are not the same for reporting where things are. Solving that is a complex problem in which the scientific community should participate, but does not have to lead. Finding out if someone has done a study and has the ground truth pixel map for some region or that someone has the scientific data about the weather patterns can be difficult between different discipline communities.
The geosciences community is advanced in the processes required for integrating and interoperating. It is important to separate those two problems and to recognize that the first barrier is finding the data. A second barrier is obtaining them. Then there is the technical integration of the data. An important issue now being addressed is the soft infrastructure for finding it, for knowing who to talk to and what is in a database before investing the money rather than discovering afterward that it does not have what you want.
The Hubble Space Telescope Archive is fairly large, but people do not have a problem finding Hubble data. They know where to go and it is relatively easy. The problem is to serve the data, because the difference between the data that we have and the manner that we serve is gigantic. We serve “N” times the data we have in our archives, where N is a very large number. That number is not sustainable when we get to petascale data, and there is no infrastructure that we can develop that can support these kinds of requests. Thus, the help that we need is in figuring out how to stop this from happening. We do not want to stop people from doing the research, but we cannot serve petabytes of data, because then we are going to become a service that just gives out data and never do anything else, because that would take much of our resources. This is a very fundamental problem: After scientists collect the data, how do they allow others to apply data mining to the data? How do they allow people to apply algorithms to the data without moving petabytes?
We have separated the public outreach portion into a completely separate network from the research one. We know those numbers exactly, and they are very large. The issue is also that new instruments are going to come online and produce a huge amount of data. Even if a scientist wants data for a small portion of the sky, the data are still considerably large. Many people want the same area of the sky, however. Hubble is a flagship mission, and it is possible to do incredible science with it, but how the Hubble archive can sustain the data that scientists want served directly to them is not clear.
Interoperability also is a very important issue for soft and technical infrastructure. Can we imagine a startup today, such as Twitter, trying to build its own cloud without being interoperable, meaning not having an application programming interface (API) with which data can be mined? That cannot happen if someone wants to be successful as a company, but it happens every day in science, because the barrier is so high. How do scientists publish the data? They do not have access to some data programmatically, so that is a very high barrier that really hinders interoperability.
Interoperability is a deeper issue. It is not just about discovery and search. It is the ability to take somebody else’s data and mix them with your own data, which is what many scientists want to do, rather than work with someone else’s data alone. The barriers to that are technological, semantic, and legal. If licenses do not permit scientists to interoperate and mix
the data, they cannot do that. Thus, there are many different barriers that go along with interoperability. Interoperability is being conflated with discovery.
The issues of a reward structure, workforce development, and the computer science curriculum can be linked. The university department heads that are enlightened and have figured out that they need to add particular courses to their curricula may still forget the data science component. There are university departments that include the data science component, but not that many. Informatics courses include things like data mining, visualization, statistics, and data management. This leads into the issue of workforce development. If we are training the next generation to take over the data-intensive science that we are creating today, then there ought to be courses and curricula that reflect that.
This in turn goes to the issue of the reward structure. For example, one discussant spent 20 years in the National Aeronautics and Space Administration (NASA) system and then went to teach at a university. He reached tenure in his late fifties. It turned out to be harder than he thought, despite having had a successful career and having published. There was a lot of resistance in his college of science—although not in his own department—to astro-informatics, or even just to data science, as a research discipline. Why is that? One reason is that the university administrators do not understand it. A second reason is that most of the places where he can publish are not the typical astronomy journals. They are peer-reviewed conference proceedings. In the physical sciences, unlike in computer sciences, peer-reviewed conference proceedings have a much lower acceptance than traditional science journals. The impact numbers for astronomy journals are in the high double digits, whereas peer-reviewed conference proceedings are in the very low double digits. He had success because he spoke up. He is a senior scientist now, so he can argue with deans and tenured full professors. Younger people do not, so this is definitely a barrier.
One possible solution would be to have some kind of mentorship or advocacy for young scientists who are doing this kind of work and who cannot themselves counter the arguments made by senior professors and deans who say that what they are doing is not research. Younger scientists, unless they are very secure, will not be able to react this way. The younger generations need advocates and mentors to help in this process as they are trained in data-intensive science—or data-oriented science. Statisticians may say that the data they are working with are not large data; they are complex data. It is not so much the size of the datasets, but the orientation of the research, that is different.
Focus of Funding Mechanisms and Science Policy
Another barrier is that the funding mechanisms for science in the United States are set up primarily along disciplinary areas. Even within the larger funding agencies like the National Science Foundation (NSF), it is separated out. Crosscuts for infrastructure and for interdisciplinary work are mainly done by each of the agencies trying to do it themselves. There is very little crosscut infrastructure work between the agencies, although there are notable exceptions. Generally it is difficult to get funding for the crosscuts. The National Academies have written a number of recommendations about the funding of interdisciplinary
work, but these have been almost completely ignored. Both the technical infrastructure and especially the soft infrastructure are absolutely crucial for data work.
Over the past few years there has been a lot of emphasis within the federal government on scientific data policies and plans. For example, the new NSF-issued policy guidance addresses issues related to data management plans. Many research agencies are moving aggressively forward on data policies. The lack of an agency-wide data policy was considered a barrier that set the stage for managing data as an agency resource. Not every agency is doing this, because they are all somewhat different, but among the leaders are NASA and the National Oceanic and Atmospheric Administration (). Also, the Environmental Protection Agency is moving quickly now to develop a complete agency data policy.
In this regard, the report Harnessing the power of digital data for science and society17 discussed data policies and data plans. The Office of Science and Technology Policy’s Interagency Working Group on Digital Data is continuing the work on this. One thing that did not become a final recommendation in that report, because it was not acceptable across all agencies, was to have a chief data officer in federal agencies. Data are an asset that needs to be managed, and thus, somebody who understands science and data may be needed at a senior level. There was a follow-up workshop that reemphasized all of these concepts, including the need for a chief data officer. Consequently that is another challenge within the federal government. Is it also a barrier in universities?
To follow up on this point, the incentives of technology transfer offices for data and code dissemination are somewhat orthogonal to those of most research scientists. The primary interest of TTOs is in licensing the technology or working with it in a commercial sense. It is worth recognizing that part of the question that is raised about legal barriers to sharing, at least in the context of this meeting, is a distortion of the traditional incentives for university scientists. That can be a very real barrier at the institutional level to data dissemination.
Following up on the comment about TTOs, this certainly has been a problem for a long time. A counterforce has been emerging in the universities through the interests in institutional repositories, in open scholarly communication, and in other similar initiatives. We need to be careful where to ask questions in the university, depending on the answer we want to get.
Demonstrating Work Impact and Value
One of the biggest barriers within the scientific documentation community is how to prove that you have had an impact. How do you prove that the resources that you use to run an office of scientific and technical information or a defense technical information center really have an impact? There is a lack of metrics, so we are forced to revert to anecdotes, to cases that prove the point or at least incite the imagination. We tried to do that in the Harnessing report that was mentioned earlier, and we got nice vignettes, but they are all philosophical: “This will change science” was the assertion, but there was none that actually showed how it did change science. That is one of the biggest barriers, just proving the point.
The anecdote problem is very real. It has been surprisingly hard to get more than a small handful of now overly retold good stories. It seems as if there are two models. One is a
17 Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council. Available at http://www.nitrd.gov/about/harnessing_power_web.pdf.
model of data contribution toward a collective resource that everybody uses, like GenBank or the Protein Data Bank, with many people pooling into a common resource to create peer-to-peer data sharing and reuse. That seems to be much more amenable to measuring levels of reuse and impact and maybe even to measuring contribution. For example, if someone says, “I have sequenced four organisms that are in four species that are in GenBank,” it makes a statement that is much more quantifiable than saying, “I collected a lot of ecological data and shared it with three colleagues who used it in their studies.”
That is the barrier to the reward structure. There is the reward structure, and then there is the business case, the value and the impact. Regarding evaluation, there was a meeting in January 2011 organized by the Federation of Earth Science Information Partners (ESIP). The people involved in this organization talk a lot about interoperability and discoverability of data. At this meeting, one of the guest speakers was Ann Doucette, the director of the Evaluators’ Institute at George Washington University. One of the takeaway points that she raised is that if we want to develop metrics that prove the value of an initiative, we need to invest in that upfront. She said we should not wait until the end of the performance period before we ask the institute to get involved. She said if they get involved very early in the research initiative, her group can help do some baseline assessments, so at the end of an initiative that lasts 5 or 6 years, they will be able to develop specific metrics that people can use to prove how the investments have been beneficial.
Introductory Summary of the Range of Options Discussed on Day One
-University of Wisconsin-
This is a summary of the options for future research that were discussed yesterday. The options may also be seen as opportunities.
Optimizing science for the community was one option. This is a meta-level issue, so it would be useful to distill it to specifics. Other issues included analyzing knowledge as an ecosystem, rich authoring (integrating data with the literature), and semantic data storage. Building a precompetitive digital commons seems possible.
Then there was comprehensive monitoring of many biomedical traits at the population level. Breaking the siloed principal investigator mentality is important. That is an institutional issue; the funding agencies can enable that.
Another option was to have some sort of code versioning for data. Other options included tools for adaptive modeling, lightweight integration instead of full-scale ontology, and incentives for those who collect data. This last issue came up repeatedly. How do you reward people for collecting data?
Discussants also mentioned the development of decision-support tools, simple metadata and better semantics for data, and making visualization a first-class citizen of science. Quick and easy visualization would be very useful, and it does not have to be complicated. For example, there are many fun things that could be in a YouTube for science.
Simple metadata that can be searched and localized in different languages could be helpful. One discussant noted that we can change culture by changing practices. This is important, because the culture of different scientific disciplines cannot change by itself.
Incentives to produce results versus a mandate to share data were mentioned as future options: There now are incentives to produce results, but few incentives to share data. There are some mandates, however, to share data.
Finally, there were four licensing options for academic data: (1) Do not put any label on your data, in which case, the default rule protects the copyrightable portions of a database; (2) put all of the data in the public domain with a waiver of rights; (3) use an orderly set of well-recognized licenses, such as the Creative Commons family of licenses; or (4) have a complete free-for-all, a babble of licenses. These are the four conceptual buckets for making data available with or without a license.
Summary of the Discussion of the Range of Options for Further Research
Fostering Innovative Approaches
Based on this discussion, there seems to be an important feature of the data-intensive phenomenon. This area is moving relatively quickly and in very interesting ways, and much of that is being done outside of the traditional channels of academia. For example, people are taking data and information and putting them on Flickr or similar tools, even if they are misconstruing Flickr as an archive rather than as a dissemination mechanism. Therefore, there is much that falls outside of the normal boundaries of science and scholarship in the university context.
The younger generations of researchers will set the standards. For example, Facebook was a major global activity in only 5 or 6 years. It may be a young researcher who understands why all this matters and who will start something that is going to spread through the scientific community. In this budget environment, the top-down approach is not effective.
One example is the SuperHappyDevHouse located in the San Francisco Bay Area. It is a huge mansion in Cupertino, California, in which many hackers live. They open it to the public every month or so, and whoever wants to can go and hack all night. At least one very successful company started from SuperHappyDevHouse. There is a similar idea in Mountain View, California, called the Hacker Dojo. For a relatively small amount of money people have rented a huge abandoned warehouse, and the Hacker Dojo is open to anyone who wants to come in and start a company, for example. There are biohackers in there now starting a lab with some old polymerase chain reaction machines they bought on eBay. That is one interesting bottom-up approach.
We are in a new networked environment and in the middle of a change that we cannot completely understand. We can look at other parts of the networked environment to see what has happened there, and this creates an opening to examine the bottom-up approach, which has resulted in many of the successful developments identified here and elsewhere.
A scientist who identifies a problem looks for a solution. Young scientists have new approaches. What do they do with their images, for instance? They deposit their images into Flickr, so Flickr suddenly becomes a scientific database. Some hackers are also biologists. They identify data and they hack on them, and suddenly they are advancing drug discovery in ways that never would have been expected. Many scientists might now recognize themselves in these examples. Scientists have always collaborated in some way, but with the new digital, networked capabilities combining with the social reality, science is becoming even more collaborative.
In the first day of this workshop, there was an interesting discussion about the sociological dimensions to these issues. There might be a new vector of development that happens through the research libraries community. There might be another vector that occurs through the vice presidents for research at universities. There may be other channels in which people respond with more immediate action. It would be interesting to examine what those channels might be.
Identifying, recognizing, rewarding, and supporting the entrepreneurs in this area is another potential approach. It could be useful to compare some of the entrepreneurial efforts—
wherever they fall on the generational spectrum, whether it is people coming through the research pipeline at the beginning of their career or people who are more senior and advanced in their careers—to better established models of research science so that the efforts are not characterized as so dramatically new and transgressive. This would both serve a rhetorical purpose, so that these developments do not come across as threatening to the various audiences, and show that what is going on in these instances does echo some very traditional understandings of collaboration in the sciences. Many scientists have valued the sharing of their efforts and knowledge. Making those values explicit would be a way of being authentic to what the progression is, as well as being politically sensitive to the audiences.
Encouraging Good Network Uses
One interesting thing about the modeling of how the networked entrepreneurial versions tie back to more traditional understandings is that there is a lack of well-defined research on how these collaborative or commons mechanisms work in the networked environment. That is, there is a great deal of descriptive, qualitative, or anecdotal research on the character of science, but there is not much well-modeled, empirical research about the role of the research infrastructure and networked platforms, and how sociological, economic, and legal factors are related.
It is not necessarily an either-or issue, however. There could be multiple messages going out to different communities and mobilizing those communities in positive ways. There are some arguments that are essentially inevitability arguments—that something is going to happen regardless of what the community wants. Because if it is inevitable, we can argue that progress should be made quickly, but if it is inevitable, what is the rush? Thus, if this change is happening irrespective of what people think or prefer, then one option is to get ready to take advantage of it.
On the issue of the bottom-up and top-down approaches, it is worth comparing the United States and the Chinese management methods. There is a real difference in how the two countries approach research and policy formation. Most innovators here—and also largely in Europe—work from the bottom up. The system supports risk-taking and failure, especially in the private sector, and so there is a tremendous amount of creativity and initiative that happens at the working level. Then finally, the policy-making community catches on, and ideally it institutionalizes approaches that are productive. The Chinese model is very top down, and the potential innovators at the bottom generally will not take initiatives that are not approved.
Although the bottom-up approach has produced a great deal of creativity, it has many inefficiencies because there are many dead ends, it is uncoordinated, and it does not always work. Each system has its good aspects, but neither performs optimally. The bottom-up issue is important for this discussion, because virtually all the examples and everything that has been discussed has happened at the initiative of individuals and communities.
From the government top-down perspective, it is worth noting that significant change is occurring. Consider the tenor of the reports, or look at the programs that now include scientific data. The focus used to be on the grid until that first Atkins report came out.18
18Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation. Blue-Ribbon Advisory Panel on Cyberinfrastructure. January 2003. Available at http://www.nsf.gov/od/oci/reports/atkins.pdf.
One approach to network citizenship from the top down is to emphasize the “network” and articulate the values that are implicit in the idea of the network as the hub of that conversation. A related, but distinct, approach is to put more emphasis on the idea of values and citizenship as part of the scientific research community. That is, how do the traditional values in that domain get expressed in this new technological environment?
The theme of the good network citizen is very interesting. One discussant noted that he works with a project called VIVO (see http//vivo.ufl.edu), which uses semantic faculty profiles for “enabling national networking of scientists.” VIVO has been trying to surpass that 80 percent completion rate that is typical for LinkedIn members who get their profiles up to date. VIVO is aiming to get an automatic 80 percent from administrative data and from bibliographic resources, but getting beyond that 80 percent rate has been difficult.
One of the things VIVO has done is to use information in grants. They have been able to do that with some of their funding because the National Institutes of Health (NIH) and the NIH National Center for Research Resources were amenable to supporting that. VIVO has six projects that will have some interesting results with short turnaround times.
The 1970s film Network has the classic line “I’m mad as hell, and I’m not going to take this anymore” by a broadcast journalist working in the television networks. That was superseded fairly soon by the current network, which is entirely different. The discussion here has made it clear that the emerging network is an incredibly enabling, individually empowering, and self-organizing system. It is the kind of technology that supports bottom-up research and business, but also enables a Google or a Facebook to arise and become a multibillion-dollar enterprise. The new network’s versatility is potentially available for all the science examples described in the past day. The characteristics of the internet are important, because this environment can completely change the way research and data sharing are done. It is also important to identify the type of research that is needed to get a better understanding of how to make better use of the emerging network for science and applications.
Issues in Reproducing or Reusing Data
Facebook and Twitter open up the science of social networks in a tremendous way. For example, consider data mining capabilities: When you have a huge portion of the planet connected like that, there is a big opportunity for research, just to learn from data mining.
Existing social network firms certainly lend themselves to statistical analysis and research problems involving large datasets, but there is a lot of informatics work that could be done by simple tools that still needs to be funded and developed. As an example, one discussant worked for a confocal microscopy lab. The images from a confocal microscope are not flat, two-dimensional images that you can put onto Flickr. They are three-dimensional. They have multiple channels for different wavelengths of light. In short, they are multidimensional images that a scientist needs to be able to visualize in three dimensions, turn around in a computer, and isolate the different channels of light. Flickr will not do that, so special tools are necessary to make those images useful.
Furthermore, there are about 150-plus different file formats for confocal microscopy, and it is a huge problem, because most of those come bundled with the microscope as part of some proprietary software. The data become unusable after a couple of years, because nobody is using that software or that microscope any longer. That is a good example of a failure of the
bottom-up approach, because the demand for a better file format has not come from the scientists who use those microscopes. They might wish that it would happen, but when they purchase the equipment, they are purchasing a lens first, and a file format is not high on the list. The researchers have not yet come together to demand that the vendors provide a better file format.
Another discussant had a different experience, however. There was a PDF from high-dimensional datasets that she coauthored and published in Nature. Part of the point of doing that was to be able to take the file formats, of which there are many proprietary ones and have a format in which people could publish the data, not necessarily in their original richness, but in a way that somebody could reuse them. There are many people in the vanguard of this community who think the PDF format is an obsolescent format. However, this discussant and her coauthors were able to do that because she was at one of the hacker places in Silicon Valley on a trip, and learned that it was possible to easily make three-dimensional (3-D) PDFs. Peter Jackson employed so many 3-D engineers in New Zealand that some formed a company called Right Hemisphere and then licensed its technology to Adobe. Adobe had put a 3-D feature in its free Reader that few people knew about. The authors made a deal with Nature that if they got their paper accepted by that publication, the paper would have 3-D content that people would be able to use.
The way that Right Hemisphere normally makes money is with car companies and other companies that use computer-aided design in their pipeline. They set up million-dollar customized pipelines for a community that has 40 or 50 different file formats and wants something in a standard format—a job, a tool, a PDF, a PowerPoint—to come out the other end in such a way that their engineers and their marketing people can just press a button. Nature thought it would do this on a trial basis and then get a million dollars or so to buy one of these tools for the magazine. The only problem was that they never got the million dollars to set this up. The discussant and her colleagues are now about to publish the first one in the Astrophysical Journal. It may take another 6 years until many more researchers do it.
The idea that people do not care about the file formats they are using or that they are not interested in having their data reproduced or reused varies by discipline and application. The kind of investment someone would make would be to find some smart people who could develop a streamlined way to deploy this million-dollar system on the Web so that others could instantly turn those 40 different confocal microscopy formats into a PDF that they could just send to Nature. That is the kind of thing that would enable science—not the actual creation of the content of the article, but the pipeline for making that a whole new kind of article.
There may be one group that could be a very receptive community of end users. There are many resource managers—for example, people in state governments in various areas—who tend to be very appreciative when someone from the scientific community talks to them about what their needs are. It is so important to get the research results to the end user. It is not just putting the data out into the world; it is the data usability that is so important. In many cases, simply having access to the data does them no good at all. They do not know what to do with the data until the scientists and technicians provide the tools. It is important to think not just in terms of publishing research papers and data but also how those results can be used.
Reviewing Computational Research Results
One point Dr. Stodden made on the first day was that the thing people hate to do more than anything else is to review someone else’s software. It therefore is unclear how many people would review software that someone else has written. There may be some ways to do that, however. If a piece of software has 500 or 1,000 users, that is a fairly good indication that it is a usable and useful program, and the review could then be much less difficult.
This issue comes up frequently in the discussions about reproducibility and other aspects of working science. A code review as part of the peer-review process before publication is difficult, because it is an enormous amount of work that people do not feel equipped to do. There are middle-ground approaches that might be able to be implemented, however.
Nature published two articles in October 2010 on software in science that discussed how it is used and often broken, that it is generally not reviewed, that it should be open, and so on. Mark Gerstein, a bioinformatics professor at Yale, and Victoria Stodden both wrote a letter to Nature commenting on this and saying that the scientific community needed to move toward code review. They suggested a broader adoption of what some journals have done, which is to have an associate editor for reproducibility who will look at the code and try to reproduce it. Then it would not be such a burden that is imposed on reviewers. That is another possible way forward, having code submitted, made open, and then incorporating this aspect of review. Nature did not want to publish that letter, but it is on Dr. Stodden’s blog, and the idea is being discussed further.
Another discussant noted that she has reviewed code plus data for journal publication and an example of the kind of problems that come up is “the compiler did not run because they changed something in the most recent version of Unix.” Every now and then you find real code errors. There are a number of domains that do this as a regular task. A list of approved software is not likely, but different audiences and different kinds of follow-up are definitely worthwhile.
Incentives versus Mandates
Of the 10 stories that Wired magazine listed as the biggest science breakthroughs of last year, 5 seemed to be based on the use of databases—2 concerning GenBank and 3 concerning astronomy. The year before, there were three GenBank successes identified, but no astronomy and no earth science breakthroughs.
There are two fundamental human motivations: greed and fear. Greed is working, but very slowly. This carrot–stick dichotomy indicates that it is still something to be sold to researchers rather than something that is a so-called “killer app.” Should we be appealing to fear, telling researchers that the real problem with not having data is that somebody in India or China will be running rings around them? Perhaps the most effective approach is to pick one message, whether it be enforce the data management mandates, support tool building, or support sociological analysis of what scientists do with data, and describe why just one of these messages should be used and focus on it, because 15 seconds of attention is all that we are likely to get from the research community and the decision makers.
Much of the successful work happens through bottom-up approaches, and this is a lesson we should learn. Identifying and rewarding the people who are already doing this is a
strong option. Focusing on what can be done from the bottom-up to acknowledge good work and to help keep researchers doing that work can get them more visibility across the sciences and other benefits. The Web can be a very forgiving medium in the sense that if something wrong is fixed, it is okay as long as it is fixed fast. The scientific community generally does not have the kind of culture that gets things moving quickly, however. We do a lot of design and architecture. There is nothing wrong with that, but as this area evolves from the top down, the bottom-up efforts can be rewarded as well.
Journals also can help enforce reporting guidelines if there are standard metadata on which you can get a community to agree. In this case, journals can tell authors that anybody who submits a certain kind of experiment has to include certain information about their experiment. They are effective in applying pressure to authors, because publishing is one of their motivating factors.
Regarding prizes, a couple of years ago something like Kiva19 was proposed for science—a type of micro-loan that researchers could apply for, say, $50,000 from the National Science Foundation (NSF), to try something new. Perhaps it would be $100,000 or $200,000, depending on the scope. To determine how scientists should meet the data management plan requirements, we could have a contest for a couple of years where people could try different approaches.
If a technology has the right branding, then other people will discover it more easily. Referring to the “app store” example, the app store has applications that get recommended by Apple, and suddenly millions of people download them. Another option is for NSF or some other entity to have a competition with small grants to find approaches that seem to work as widely applicable solutions. The prize would be that they get a brand that says something like “NSF Approved.”
Another similar suggestion focused on some sort of sandbox, where people can experiment with different approaches. The people who would take on these kinds of projects would not take on all the ones that were listed earlier in this summary, but they would have various missions. Some individual investigators might accept a 5 percent tax so that technologies for data management systems could be developed.
There is already a group of people who are reusing data and making good discoveries. We have seen some policies started, such as the call for data management plans, which are helping to get people who produce data to think about managing and sharing them. One of the best incremental steps on the production side right now, other than letting some of those mandates play out and trying to extend them to other funding agencies, will be to continue to encourage people to contribute to collective databases. There are many mechanisms for doing that, such as journals requiring registry numbers. We actually know a good deal about how to get collective databases implemented successfully.
Another option that can be pursued is to enable innovators who are doing productive work. This can be done in different ways. One option would be to highlight the researchers
19 Kiva is a nonprofit organization with a mission to connect people through lending to alleviate poverty. Leveraging the Internet and a worldwide network of microfinance institutions, Kiva lets individuals lend as little as $25 to help create opportunity around the world. See http://www.kiva.org/about
who are responsible for breakthroughs, as suggested earlier. Another option would be to initiate a prize for the most innovative piece of data reuse for the year—not for the people who supplied the data, but for the person who had the bright idea and went after the data and did something extraordinary with them. Another option is to work with people who are producing results by reusing other people’s data to get them more data. If the theory is that “more data wins,” let us try in some very focused areas to further enable the people who are most active.
The NSF data management plan can be seen as the first iteration of the network plan. That is, from the top-down perspective, the way to break through the discipline silos is to require people within each silo to say how they are going to use their funding to be good network citizens. In what way are they going to be a responsible steward of the research funds with the network in mind? Tell us how you are thinking about the network in your plan to use these resources. Will you make your database available so that it becomes a networked resource, and will you annotate it? The point is that if government agencies start requiring people to plan for using the networks, that could lead to a powerful shift in the way people think about it themselves.
If a checklist of the properties of a good network scientist citizen were compiled, not every scientist would check off all the boxes, because science is heterogeneous. If scientists checked a certain number of those boxes, however, then you would tend to move toward those goals. That would be a documentable approach. When a scientist submits a proposal for funding, he or she can point to a previous project and say, “I was a good network citizen based on this.” For example, one of the boxes could be about reproducibility. If a researcher’s field is not amenable to reproducibility because of the size of its datasets, it might not apply, but it would apply to someone else. There would be some basket of these properties that would determine how good a network citizen you are. That would be something measurable and doable.
The Department of Defense (DOD) has gone completely net-centric. It talks about everything in terms of being net-centric. It might be interesting to look at what the DOD is doing and see how civilian science agencies might think about net-centricity. Keep in mind, however, that the DOD is very command-and-control oriented, so it can do these things top down.
A study could be conducted to look at the impact of the NIH access policy and ask whether it should be extended to any other federal research agencies. Other agencies also could follow NIH’s lead in requiring that those who submit grant proposals include in their bibliographies not only lists of their papers but also lists of the places where that information is publicly available. When a scientist reports on results of prior research, the scientist could list where those data are publicly available. The National Library of Medicine (NLM) has developed many process metrics. These include, for example, determining the percentage of the documents produced with NIH grants that are getting deposited in PubMed Central. That number has substantially increased.
At this juncture, rather than investing heavily in making data available, maybe time is ripe to put some resources into the people who are actively trying to get the data to do some valuable scientific work. Maybe that would be a good strategy for a year or two, and maybe that would be a good thematic response.
Developing the Supporting Infrastructure
The NIH’s National Center for Research Resources (subsequently abolished) and the Chinese have a strategic plan for research that includes infrastructure development and funding. Since there is a lot of talk about funding the enabling tools for research, one option could be a similar kind of mechanism in the U.S. government for general science, not just for biomedical science.
This discussion has identified a problem or barrier to the development of infrastructure that crosscuts across science domains. This could be further investigated at a higher level. In fact, some previous reports have noted that, and the President’s Council of Advisors on Science and Technology (PCAST) report on the Federal Networking and Information Technology Research and Development (NITRD)20 Program also looked at this.
This infrastructure is hard to develop, however, because it depends in part on a large research group that can support it. The community would have to determine how best to promote and cite that to help solve some of these problems. A top-down management approach is one option, but it could also be good to think about doing it through bottom-up scientific innovation. Whatever approach may be taken requires an advocate to make that happen, and it cannot be done very easily.
Technology Transfer Mechanisms
The role of the university technology transfer office in all of these processes remains a question. The typical instantiation is toward promotion of commercial interests and licensing. What is the relationship between the technology transfer office at the institutional level and the funders themselves? Could that relationship be a bridge toward addressing some of these unfunded areas?
If there is a way to follow up on this idea of a public arm of the technology transfer office, that could be a way to disseminate more broadly the technology, information, and datasets that have commercial potential, but also to do other things, such as resolve ownership issues. If academic datasets were shared more broadly, we would see certain patterns evolving, and those could become templates for how those issues—in a legal and a citation sense—could be sorted out. The technology transfer offices could be one venue in which to build partnerships to encourage this, at least at the institutional level.
Would there be a possibility of organizing some projects? Among the universities and other entities, could a few demonstration projects be organized to show how we see it being done? They would experience various difficulties, but in the end they could see how it can be done.
If a project cannot get funding because it is too “applied,” could the NSF have a technology transfer operation? The idea would be not to fund things that are possible to be commercialized, but rather to fund things that would be used to advance scientific research. If there were a GenBank-like division at NSF, for example, it could encourage the application of the results of research that it funds, although not necessarily all the way through to commercialization. In some cases, such as drug discovery mechanisms, it could lead to commercialization, but in other cases, such as discovering stars in the sky, probably not.
If NSF did have a technology transfer concept, how would it be different from a university’s technology transfer function? The NSF does have the interdisciplinary Office of Cyberinfrastructure, which is supporting the development of new software systems, but it tends to be a much higher-level activity at the moment. One of the challenges is determining if this could be done at a specific domain level, or whether we would want to do it more generally.
In addition, a 5 percent tax on research budgets is an option that could be explored further. It is not clear how many people, particularly the program officers at funding agencies, would endorse a 5 percent tax of the funding, but they might. Maybe the agreement could be that if one agreed to pay the tax, it could come with certain benefits, such as access to some repository. That is, researchers could have the carrot and the stick at the same time.
Another model within government science agencies could be to collect technologies that might be broadly used from various directorates or projects and have them available in one place, or at least have pointers to all of them. This is already being done to some extent, but informally. Technologies are created using public research funding, and if they are of value, then they may be put on the Web to let others know about them. There is nothing very systematic, however, that gives NSF or some other agencies credit for having funded something that resulted in a technology or a tool that can be broadly used. So one question would be: Is there a mechanism within the Office of Cyberinfrastructure that could be used to make these tools or technologies available?
There are other government entities that are not explicitly funding agencies—NITRD is a good example—that have the mission of supporting information technology (IT) research and development and have the use of that IT for national priorities, including science. NITRD could be a good ally in this area. The NITRD director is very interested in data issues and data-intensive science, but the organization has no money. All it has is the right to convene groups of experts, mostly in IT hardware, but it could be beneficial for NITRD to be involved.
The recent PCAST report that was discussed above makes the point that the issue of infrastructure needs to be moved up to a level where it is not individual agencies competing with the individual scientists’ budgets, but rather at a level where somebody is looking at this infrastructure as a national priority for innovation and science. There is now much more discussion at the top levels and the research agency managers are beginning to see the value of it.
It is harder to do than it sounds, however. At the University of California, for example, anything that costs more than $500 has been considered capital equipment for many years. This is not rational, because the university has enormous amounts of space. It has junkyards of technologies that cost more than $500, because if they were more than $500, they have to be on the capital equipment inventory, and it costs more to dispose of them than to store them. The reason for such a ridiculously low limit is that anything that is considered to be capital equipment does not get taxed for overhead on grants. The principal investigators have wanted to keep the capital equipment threshold as low as possible, because it benefits them in the grants. It sounds preposterous, but California still has a $500 threshold for capital equipment, and it has been this way for some 15 years.
Publicly funded research organizations are doing projects that are overlapping or similar to each other, whether in medicine or climate change or other areas. It would seem as though they could repackage those projects to demonstrate the multi-disciplinary needs and
benefits of such work. Such projects sometimes also have international partners, such as Brazil, Australia, and Europe, for example. That would allow them to demonstrate something similar on an international scale.
The Virtual Acquisition Office science advisory committee made the mistake of asking the members of the committee to come up with test projects. Some people were very opposed to limiting something like that to a small number of investigators. A regular (small) grant competition could specify that it is closer to infrastructure than research. This would have to be made extremely clear to the reviewers, because otherwise they will not understand. Grant competitions have become a grandiose vision with millions of dollars to do research. That was not what was originally intended.
Another important feature of this data-intensive phenomenon is that there is a general sense that we ought to change practices and change culture so that the new and better ways are accommodated along with the existing ways of doing things.
How do we change culture by changing practices? How do you change culture? How do you change practices? It is an easy thing to say, but it is a very hard thing to do. It takes a lot of time and effort, and often a lot of pressure, because people’s interests at the local level are not served by changing practices to which they have become accustomed.
You can also turn it around to make it “change practices by changing culture.” The idea that you need voluntary kinds of encouragement to stimulate open access has been found to not work very well. That is why we have had to resort to mandates, whether through the journals or legislation or funding mechanisms, to change both culture and practice.
At the same time, so much has been added to the list of mandates that accompany proposals. We have to promote minorities, help K–12 education, and so on. Of course, those who submit proposals claim that they are totally behind these mandates and support them, but frequently they do not do anything to advance these goals.
One of the things that was proposed to the NSF leadership was letting grantees pick what they are going to affect and then incorporate that in the proposal. These interest groups, however, have worked for a long time to get themselves into that mandate list. The fact that not much happens from them being in a mandate list is not as important to them as being in the mandate list itself. There is a kind of culture of incumbency just around being listed. It does not necessarily have anything to do with meaningful change.
We can push hard on this changing-practice, changing-culture approach, because it is hard to do. What can the agencies do to help change practices and cultures? Probably no amount of studies will result in changing practices. Adding something to the mandate list may not change practice either, because the mandate list is already so long that it is just an exercise.
The steps involved in influencing culture are not interchangeable, however. The way to change culture is by changing practice, and this example of NSF requiring data management plans is a good first step, although it is a very tentative step, and there is still a long way to go. Everyone now is holding workshops on how to write the correct data management plan. Maybe in 3 years there will be many more people thinking that this is a good thing to do, because they would have been made to do it and they would have started seeing the benefits of doing so. Thus, there could be some value in recommending specific things for changing practices.
One answer to this question is the old adage “you get what you measure.” If we are not measuring the impact of the data plan, we are not tracking it, we are not enforcing it, and then nobody cares. Suppose, for example, researchers were asked how many minorities they supported on their last NSF grant? If it was one-third of what was promised, then the researchers could get one-third of the requested money on the new grant. You would suddenly have many more people matching those mandates.
It is easy to talk about measurement and enforcement. It is harder to make that work. We may be coming to a point where we need to start thinking about what is going to work.
The NSF seems to be trapped in this situation where it is supposed to be getting ideas from the community about where to go with the science. Although there may be many people interested in this issue, the population of people doing science and being funded by NSF is much larger, so the view being expressed by those sufficiently interested in any particular issue is a minority view within that sourcing process. How do you change this? Where are the pressure points and the leverage that will work, given the limited resources available?
Another discussant made a comment about the socio-cultural attitudes that we may wish to change. Reflecting on the tradition of science, it is important to celebrate that tradition and use it as a basis to build upon. Understanding traditions is very effective in introducing change to different cultures. In working with different cultures you discover that you need to take different approaches. If, for example, we are talking about the transmission of HIV in a culture that has polygamy, it is irrelevant to talk about being faithful. Therefore, if you know the important traditions to select, you can build upon and influence that change more rapidly.
Role of Libraries
In setting up a national information-organizing center that oversees and manages standards and vocabularies, it is important to remember that when data and information are created, they may have to be maintained over long periods of time and adapted as things change. Therefore, there is a certain amount of infrastructure that has to be supported from the top down. It is not possible to get universities to do this in a distributed way.
Libraries are giving much of the advice for data management plans right now. Research libraries have seized upon this as an opportunity to do outreach to the science departments at their universities and to help them figure out how to do the data management plans that the researchers are now being required to do. Hence, research librarians are allies on the ground at universities, and they reach out to the scientists to help disseminate cultural ideas and strategies that funders and policy makers would like also to see implemented.
One of the big questions circulating in the library community currently is what to do with print collections now that we are moving much more toward digital materials. There are a number of issues that have never been encountered before. We have been building up print collections for the past several hundred years, so we know a lot about that. We do not know much about reducing the collections, however. It is hard. It is more of an ecological approach than telling others what they should do.
A problem is that it is very easy for scientists to say that issues or concerns expressed by one community have no corresponding value in their community. There are many ingrained practices, attitudes, and traditions that are candidates for change, but can we deal with them
sensibly? Research librarians may be able to provide assistance to scientists that would not otherwise be available, particularly at the college and university level.
One thing that may be helpful is delineating the roles of different kinds of experts who are involved in the data management process. The funding agencies are like the data investors. The data producers are interested in their data, but not necessarily in how the data will be reused. A problem in some cases is that research funders want data producers to let other people reuse their data, but it is extra work for the producers to deposit their data someplace and to annotate the data in a way that makes them useful. Then you have the data reusers or analyzers as well as those who are the intermediaries between that data producer and the data reuser, acting as the data translator, manager, or marketer. This latter type of person will help the data producers—who really do not want to do this kind of work or think they do not have the tools available to do it—become aware of the practices or tools that exist for them to do this kind of work, and also manage the data and train them in how to do that more efficiently. Research librarians do that kind of work in genomics, for example.
In addition to the NIH Center for Research Resources that was mentioned earlier, there is another institute at NIH that deals entirely with information science and information resources: the National Library of Medicine. There is not an equivalent library for the basic sciences; although NLM has done some work expanding its focus into other related areas, it does not cover geosciences and other disciplines.
In the Agricultural Research Service (ARS) there is the National Agricultural Library. The ARS includes librarians in research project planning and data management. Without trust among the parties, however, we cannot open up a legitimate dialogue, innovative direction, and mutual support. How to build trust really depends on those who are involved in a research project, but without trust they may fight against each other.
Finally, one of the options that was suggested for changing and improving data management practices is to reduce the costs. Better leadership can direct people to what those practices need to be. One research agency has a saying concerning data management: “Should we, could we, and will we?” There is no argument about whether data should be managed, but the scientists want to know how to do that. Clear answers can foster effective leadership in different disciplines for preserving data. Some sort of basic minimum guidelines would be helpful—if not necessarily just tools or methods. Technologies have changed what we can do, but they have not changed what we do. Appropriate leadership can help us define what we do.
Communicating and Influencing Understanding of Scientific Knowledge Discovery in Open Networked Environments
It appears that there are at least four communities that are listening and thinking about these issues. One is government leaders and elites, who believe that there is something big here and are looking for guidance on how to think about it.
Then there are the people in the institutions—librarians, research leaders, and others—who are thinking about these issues from a different perspective. For example, many state universities are now focusing on what to do with the flagship institutions of higher education among the others. The flagship schools are ostensibly the research institutions, but the distinction that some schools are for research and some are for teaching is somewhat dated. One big question here is what the flagships should do for the others. Should they do anything?
This will have an effect on access. A lot of science and scholarship can be done by secondary sources. As those secondary sources get better, the quality of scientific scholarship will improve, and it can be ever more distributed. That is a productivity issue as well as a participation issue. Therefore, this is a second audience, the combination of locally based research entities and the elites of those locales.
A third audience is researchers who are thinking seriously about the approaching change and wondering how to respond. This is important. It is not like it was 40 years ago when the more senior scientists were trained. This is different. What do we do about this? Perhaps that is a channel to approach some of the younger people, because they are coming up through the apprenticeship structure.
The final, fourth, group that is usually not addressed are the creators and innovators—the Facebook and Google types and the people who are doing research in these hacking centers. In short, good things can happen in this self-motivating way. Moreover, they do not have to ask for federal money or congressional approval to do it.
There is a big difference between just communicating versus the process of scientific knowledge discovery. Some of what is being discussed is the need for better tools and applications to do science over Facebook and other new approaches. According to some studies, data practices are driven very much by the kinds of tools they use. By talking to people, we can know immediately the kinds of data practices and the kinds of research that they are doing, just by knowing the kind of tools that they are using.
One discussant noted that he just got on a big grant with people at the Mayo Clinic that he has never met in person. He only met them through Twitter, but he is now on their grant. His best paper recommendations come through people he follows on Twitter. Many of his own paper citations can be attributed to tweeting about his work. He also gets invited to conferences through people who follow him on Twitter. This is an example of the value of scientific networking.
Another discussant noted that the topics of creating scientific discoveries and communicating knowledge are being conflated to some extent in this discussion. He uses software and writes code, but does it for a science that can never be practiced on a social network. For example, there are foresters who work with trees that take 50 or 60 years to grow. There is much science that cannot be done socially. At the same time, the communication of science or even enabling the process of science can be done with these tools. He is in user groups, and when he asks for help, people respond to that. He is able to read up quickly and does not have to go through a hierarchy to acquire that knowledge. He can utilize tools that are changeable, because they are all open-source. Therefore, separating the functions of creating knowledge from communicating knowledge is important. There may be some overlap, but they are not the same things at all.
It is helpful to discriminate between the two and to ask what we get from doing that. It is also useful to note that things that we think of as nailed down, such as mathematical proofs, are not settled until the mathematicians say so. Mathematics is essentially a social process. There are some things, such as the proof of Fermat’s last theorem, that not all mathematicians agree on at this point. A preponderance of mathematicians think it has been proved, but there are some holdouts who are very strongly opposed to that position.
Scholarship is communication. We should not assume that geology is going to be done on the Web. Geology is probably going to be done with rocks and other things gathered onsite. But the phenomenon of geology—what people agree about or come to hold as geological fact—is going to be socially constructed over time, and that is a communication process. The idea that science and communication are two different things is not valid.
This page intentionally left blank.