Click for next page ( 134


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 133
4. Summary of Workshop Results from Day One and Discussion of Additional Issues THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 133

OCR for page 133

OCR for page 133
Introduction Bonnie Carroll -Information International Associates- Today, we will take the results of the discussion from the first day of the workshop and build on them. We would like to hear about some options for what could be done over the next few years and what the value proposition is. Closely coupled with that is the issue of how we will know when we have succeeded. We had some excellent concepts and ideas during the first day of the workshop that highlight much about the nature of the changes that are ongoing. I heard that we should not stay static and that the community might benefit from a less linear and more dynamic approach. We talked about trailblazers versus the long tail of data activities, and whether we are getting into a credibility crisis, because many of the things we are doing now are not as shareable as they were when they were in print. Many codes are not available, for instance. The title of the workshop is "The Future of Scientific Knowledge Discovery in Open Networked Environments." The title recognizes that we are adding a fourth paradigm to the research process. We already have three paradigms--the theoretical, the experimental and observational--but now we have added the computational or the data intensive. The presentations we heard during the first day of the workshop focused mainly on the latter paradigm. How do these approaches build on each other and work together? Why did we immediately jump to the data-intensive sciences when the title referred to a networked environment? The answers are obvious. First, it is because the technology is so enabling, and second is the data deluge. In the text that follows, we discuss the four sets of topics identified in the statement of task. Each section begins with an Introductory Summary of the Issues Identified in Day One of the workshop. In the first and fourth sections, Puneet Kishor reviews the sets of issues from the first and fourth items in the statement of task, namely, the Opportunities and the Benefits and the Range of Options. In the second and third sections, Alberto Pepe addresses the Techniques and Methods, and then the Barriers, respectively. Each of these Introductory Summaries of the Issues Identified on Day One of the meeting by these two rapporteurs is then followed by a Summary of the Discussion of each topical area by all the workshop discussants on Day Two of the meeting. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 135

OCR for page 133

OCR for page 133
4. Opportunities and Benefits for Automated Scientific Knowledge Discovery in Open Networked Environments Introductory Summary of the Issues Identified on Day One Puneet Kishor -University of Wisconsin- Among the issues we heard about during the first day of the workshop was that certain problems in science, no matter what domain, are really the same. A scientist performing observational research, for instance, generates data and then manipulates, organizes, analyzes, visualizes, and reports on them, or maybe preserves them, and then the cycle repeats. The digital medium is one of the best ways to communicate knowledge and is probably also the best way to create knowledge. The concept of a knowledge ecosystem, the whole system of producing, storing, and retrieving knowledge, depends on standards, metadata, discovery, association, and dissemination. One presentation focused on creating a precompetitive commons, using examples from drug discovery, where fierce competitors can cooperate to produce evolving models of diseases. Cooperation can improve the translation of publicly available molecular data into biomarkers, and increase the opportunities for using drugs meant for one disease to treat another disease. The "crowd" plus the "cloud" concepts are important to the future of the network. There was an assertion that science can learn about data sharing from the World Wide Web, which could be argued either way. Discussants also mentioned exploring linked open data for interdisciplinary uses. Access to many data is very restricted. Also, there was much focus on the long tail of science, which is defined as those fields where the benefits of information technologies are not immediately apparent. There were conversations about quality, format, access, financial support, and policy issues. Other topics that were discussed related to opportunities for innovation that we cannot even imagine right now, serendipitous innovation, still others focused on government policies that encourage or even require scientists to share their scientific data and information, such as the America COMPETES Act. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 137

OCR for page 133
Summary of the Discussion of Opportunities and Benefits This session examines the opportunities and benefits in a 5- to 10-year time horizon, so it is important that we understand the value proposition. Do we really know what the benefits and the opportunities are? Building and Sustaining Knowledge Networks and Infrastructure Can we say that one opportunity is to provide more and better infrastructure? If a common infrastructure is not there, is it possible to have such a common infrastructure so that anybody can use it? When we hear a term such as knowledge ecosystem, it raises the concept of a knowledge network. The discussion on the first day of the workshop about a nonlinear way of thinking or a linear way of doing science reminds us of a neural network. That is, there are many different synapses in the brain that connect different facts and different memories and experiences that lead to an understanding of a situation. The same kind of process happens in science. We saw the example of the pharmacological studies that found linkages between drugs that were used to treat one disease and that could be used to treat another disease because there was a common gene sequence. There was a connection in the network that showed that the same thing that is being used somewhere else actually does link back. This suggests moving toward not just linked data or a knowledge ecosystem but a linked knowledge system. That is, we could merge those two ideas. We could benefit across all disciplines and truly enable long- tail science--the science that is so removed from our experience that we never would have thought of doing it. We are likely in a period of persistent restricted resources, however. Hopefully it is not a zero sum game, but assuming it is such an era, what is the best way to deal with that situation? Do we need to think of ways of developing systems that could be used by many different kinds of scientists? Are there general characteristics of such systems? Again, we do not need something only for computer science; we may also want something for the agencies funding research. Demonstrating the Value of Data Sharing Change is always going to be before us, so how do we work together in our public- private partnerships to adapt to change? We need to show the value of sharing data to our principal investigators. Science is going to be done differently than it was by scientists 20 or 30 years ago, so we have an opportunity to redefine the science as well. We can imagine a scenario in which a scientist is speaking to a congressional staffer who has heard that scientists are just "welfare queens" in white coats and wants to know why they are being supported. We need some convincing examples where data have led to important discoveries. Earth sciences data and medical data could provide some useful anecdotes, for instance. We could survey the number of hours people spend analyzing the databases as opposed to the number of hours they spend reading traditional journals. For example, Princeton University spends about $10,000 per faculty member per year subscribing to journals. If we could quantify the relative costs, we could put a value on the database. The bottom line is that 138 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
when talking to somebody who is not a scientist, such people do not necessarily start with the assumption that science is worth something, so we need more descriptive stories. For example, the digital library program is justified by pointing to Google. We need more compelling stories that appeal to outsiders. Also, a protein chemist cannot work without the protein data bank. There are many scientists who do not work the way it was done before there were digital databases. They are data-mining researchers. If you take away the databases, their type of research would stop. You will get different answers in different kinds of research, however. For instance, there is much more to online astronomical research than the official National Virtual Observatory or Virtual Astronomical Observatory (VAO) projects. Although the VAO has led to many useful results, so far it has focused only on infrastructure, without building end-user tools. In astronomy or in other sciences, there are databases that are even larger than the protein data bank. Researchers in these fields use them every day, but the results in some seminal papers may not be cited. Astronomers use data collection facilities such as the Sloan Sky Survey, the Two Micron All Sky Survey (2MASS), and the upcoming Large Synoptic Survey Telescope. Scientists see these huge databases as something on which they have come to rely. What they may not understand well, however, is what would happen if the medium- and small-scale datasets that are attached to literature would also be openly contributed. Incentives alone might not work. Scientists may not be convinced that their datasets are going to be worth depositing, unless the funding agencies make that a condition of their grants. Big science examples may be the wrong approach to convince scientists, however, because they are too common. In the physics community, researchers would rather work with the Large Hadron Collider (LHC) than their own small data collections, because the LHC will have a huge amount of data. There are other fields in which the generation of the data may not be very expensive, especially when amortized over a longer time period, but may nonetheless be costly to maintain, because many people are involved or for some other reason. There also may be a potential problem in asserting that it is more efficient to share data, because if this were true, why are scientists asking for more money to do that? If researchers are going to be more efficient, they should be able to perform better science with fewer dollars. If this is not the case, then there is a question about the efficiency of the system. Also, when we say that it is going to be more efficient, another issue is: Where will the savings go? The savings could go into the analysis of existing data, rather than into collecting and organizing new data. Put differently, the budgets may not go down, but the analytical capabilities could increase. The research funders could identify the different areas where data sharing is extremely important, and they could then establish a reasonable time span for making the data available. Some guidelines to the scientific community could be useful in this regard and constitute an opportunity. For example, we might eliminate the 2-year gap between obtaining the research results to the sharing of those results through publication. This could allow the timelier sharing of the methodology, the data, and even the codes developed for the study. It could lead to real benefits. Too much specificity concerning what should be shared, however, can be a problem as well. If science were segmented into different categories according to a level of what should be shared, it could be counterproductive. A specific time frame for sharing data absent THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 139

OCR for page 133
countervailing considerations, such as the Health Insurance Portability and Accountability Act (HIPAA) or similar requirements, could be very useful and would signal to scientists that there is recognition that sharing is very important. It would encourage more scientists to start thinking about the next steps after they produce the data, what infrastructure they will need, how they will have to reorganize their own budgets and funding to do that, and so on. A rigorous deadline can be a problem. Voltaire said, "The best is the enemy of the good." Everybody is quick to complain that data curation is too much work. What is the cost, however, to the nation's scientific effort to peer review weak papers? In some fields, preprint distributions and conference presentations have essentially taken over the publication function, and the journals exist so that people can get tenure. A system relying on something like the Cornell arXiv might be better for scholarly communication, if there were something equivalent to page ranking that would help people decide what they should be reading. If the publication system were less strongly tied to tenure, it might be easier for the universities to determine that they could use other metrics. This could save a lot of money if the universities no longer relied on the ranking of publications to determine who would get tenure. On the one hand, the promise of data sharing is becoming real. On the other hand, there is a powerful reinforcement of Max Planck's observation that science proceeds funeral by funeral. In the health sciences, the progress of science may be slower than it has been historically. Some scientific processes may even retard the pace of scientific progress. The most difficult organizations to consult for are the ones that are doing well, because if they need to change the things that they did to make them successful or rich, their view, of course, is that they are fine and they are not receptive to good ideas. The difficulty is to get the existing order to change their approach, when it is needed. This is a collective action problem. Individual scientists who would like to fulfill their ideals as scientists often find that they cannot. Framing the problem in terms of reproducibility of scientific results can be very appealing to scientists, however. This can be a way that gives them guidance about when to share data and what types of data to share. The same argument can be helpful in convincing people of the importance of openness in this networked environment, because this is a core principle of science. Nobody argues that reproducibility is not important in the sciences, so this can be useful as a guiding framework for what steps are important. The climate data controversy at the University of East Anglia in the United Kingdom and elsewhere has indicated that transparency is also key. The average person is not going to reproduce those climate simulations, but citizens want some assurance that there is access to how the model simulation was made and what data came out of it. Therefore, transparency and reproducibility are both important. We also may separate the different axes along which we can argue that there are benefits. We talked about the speed of communicating research results, so presumably the discovery of new results is one aspect. We have talked about value, and a couple of discussants reminded us that the value of discovery, at least when we are talking to the public, is predicated on the listener's view of the value of the underlying science. If someone believes that many discoveries in the underlying science are not particularly valuable in the first place, being able to make more of them faster does not seem to accomplish very much. 140 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
A strong case for value can be made in the biomedical sciences, particularly as research is translated into health care. For example, it might be possible to determine the real dollar value in repurposing pharmaceuticals. It might be worth picking an area in which the economic value is fairly noncontroversial, as opposed to something like basic mathematics or astronomy, and then start quantifying the value and take a look specifically at data reuse in those settings. One thing that has not been mentioned in this discussion, but was talked about yesterday, is the idea of data as an unexploited resource, because data have not been part of the traditional journal-based way of communicating science. Negative results in early clinical trials were traditionally not reported, for instance. Because doing that work costs money, there might be some quantitative way to get at the value of what is now being withheld. A couple of the discussants at this meeting were recently at a geosciences data workshop, and the question arose about how to demonstrate the value of initiatives that integrate data and make those data easily available. For example, congressional staff members are interested in problems such as energy, food, and water. If we are asking for additional resources to enable data sharing, it can be useful to tie that to solving some of the issues important to them--to medical benefits, for instance, or better ways of managing natural resources. That also bears on the discussion of incentives and mandates, because when scientists feel that their work is directly related to producing a social benefit, there is additional incentive for them to share the data. Speeding Up the Pace of Science The repurposing of data can be a real opportunity in speeding up research. How do we learn from the World Wide Web? What are the inherent opportunities? Companies are investing a lot of money and making advances in communication that people are using every day. Scientists may have a problem in seizing the opportunities, yet everyone is using the commercial media to communicate, to make decisions, and do many other things. Is there something we can learn from the commercial community to apply to science in this regard? Can we encapsulate very concisely some of the opportunities or the benefits? In March 2011, none of the U.S newspapers had any headlines about the earthquake and tsunami in Japan, because, of course, it happened early in the morning, and they had already gone to press. In the old days of newspaper-based journalism it would be in the evening edition before most of us knew what was happening. Then when radio and TV journalism came along, they became the media that provided news during the day. Now we can watch videos of the tsunami almost instantaneously on the Internet. Think about science at that pace. An astronomy course 20 years ago had films that were not very compelling. It was like playing chess. Not only did you have to be smart but also very patient while the other player moved. To keep up with the modern world, the pace of science has to change. It is clear that data science--not just the big data science but the type of science where we can find the appropriate data, pull it together, and quickly get it into a workflow--is what will change the face of science. We need to keep up with that kind of speed. Much of the reason that we may use the internet as an analogy for how science can be done is because of the speed of change. Just a few years ago, Twitter did not exist. Many people who use it now cannot live without it. We are being asked to solve problems that are changing at the pace of the internet. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 141

OCR for page 133
The geosciences are well positioned to react to this real-time aspect that was just alluded to, whether it is in the study of earthquakes, tsunamis, flash floods, or tornadoes. Scientists have the capability today to bring that information to be used in real time within seconds of the event happening. This also bears on the point that was made earlier about tying the research to societal benefits that resonate with people in Congress. What does this mean for the protection of lives and property? The rudimentary informatics are in place that can be greatly expanded in the future with the mobility that comes with the smartphones and other mobile devices, so that you do not have to be at your computer anymore to get real-time information. The Web and semantic data, and linked open data, have had major effects, but not all science may be done at the speed of Twitter. It is one thing to report some results quickly, which is what Twitter and similar tools do; it is another thing to trust science that has been rigorously tested. For example, there is a 10-year study on body fat as a predictor for heart diseases. A 10-year study cannot be compressed into 24 hours, however. Some things just cannot be hastened to completion. This comparison with the speed of Twitter and Facebook, therefore, can be misleading, because it is comparing apples to oranges. The problem with putting these sorts of ideas in a recommendation is that a policy maker might latch onto this and say we want science to be done overnight, which could be even more of a problem rather than a benefit. Preserving the rigor in science is a very important message, but in the current system there is a major delay from obtaining the research results to publishing them. We should not say: "I am sorry you cannot see my data, because that is how I am going to win my Nobel Prize 30 years from now." These practices and attitudes are barriers to the goal of solving people's real needs with science and to better understand the phenomena around them. The 10-year study is great when you need a 10-year study, but not if it is the primary way of doing science. For the next 9 years, you may have people dying who could have been treated by something that might have been ready to work. Verifying, validating, and doing rigorous science, of course, are important. It is the sharing of information that can naturally speed up some of these processes, however. People in unrelated fields right now may not easily discover prominent work until it has appeared in the archival journal. This means it has been written up and the visualization of data is prepared, sent to the journal, reviewed, and gone through the publication deadline. There are faster approaches that tend to be very different from field to field, but the main issue is the delay of disclosure, the long time it takes to go through this life cycle, workflow, or ecosystem. We also ought to keep in mind that one size does not fit all for these processes. When we examine the management of data, we may be talking about long-term experiments or about datasets that are developed to help in an emergency. Just those two instances present very different kinds of data types and uses. Supporting Interdisciplinary Research Most of the presentations on the challenges from the first day of the workshop were very interdisciplinary, and most of the presentations on the solutions that described what people are building were inside disciplinary boundaries. That actually has to do more with how the funding mechanisms work than with what people would work on in a natural setting, but that interdisciplinarity of the key challenges is important. What happens on the Web is that a 142 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
small number of people who can work between areas can transfer huge amounts of information back and forth between those areas. That kind of work tends to be viewed as an ancillary process, not as a major part of doing science. Further, most scientists who have done this kind of work did so at the peril of their careers and may have been fortunate. Another point is the importance of informal communication. Scientists are used to formal communication. Some of the systems that were mentioned yesterday and some of what the astronomers are doing now have, in addition to the primary scientific channel, an extra social and communicative back channel. For example, informal communication is how scientists may get information from conferences they have not attended personally. Journal papers are the end product that gets put in the library for historical reasons. That is why they are called archival journals and not always the primary reporting of the scientific activity. Recognizing the need for much more of that informal interchange is also important. Those are two key aspects of research--an interdisciplinary approach and informal social communication. Informal communication can be structured like the World Wide Web: as a network of information and knowledge that you can search and also find the linkages between them. We can take advantage of that type of structure, which does not eliminate the journal system, but improves the communication among different sciences, and not just within a single science. If we were going to try to find these commonalities that allow us to cross boundaries in data sharing, those are problems that can be addressed. There are high-dimensional image formats, there are complex text formats, there are big table formats, and it does not matter which discipline community produces them. The tools for analyzing, visualizing, and integrating them depend on the structure of the data, not on the contents of the data. Thus, we can look at what those different kinds of datasets are and what the solutions might be. We have talked about knowledge networks and collaborative environments as being able to communicate and get results out more quickly. One caution is that there are many models for those kinds of practices. As new ways of communicating and collaborating are developed, it will be important to think about what actually is happening (in knowledge and experience), how we collaborate and communicate, and how we capture that so that it can be reused and mined. One of the issues of citizen science and informal education is the quality of the data, how good the data are for "real scientific research," and how to describe the quality of the data. That remains an ongoing question. The National Ecological Observatory Network (NEON) is sponsoring another project, called Budburst. It is a project in which people with cell phones go around and take photos when the first leaves come out from deciduous trees and when flowers first appear, because that timing is expected to change with climate change. It is a very potent opportunity to provide a bit of informal education to citizen scientists. They try to figure out what a plant is, they learn something about how climate and water affect the plants, and when they upload their data, the Web site offers some educational opportunities. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 143

OCR for page 133
Reviewing Computational Research Results One point Dr. Stodden made on the first day was that the thing people hate to do more than anything else is to review someone else's software. It therefore is unclear how many people would review software that someone else has written. There may be some ways to do that, however. If a piece of software has 500 or 1,000 users, that is a fairly good indication that it is a usable and useful program, and the review could then be much less difficult. This issue comes up frequently in the discussions about reproducibility and other aspects of working science. A code review as part of the peer-review process before publication is difficult, because it is an enormous amount of work that people do not feel equipped to do. There are middle-ground approaches that might be able to be implemented, however. Nature published two articles in October 2010 on software in science that discussed how it is used and often broken, that it is generally not reviewed, that it should be open, and so on. Mark Gerstein, a bioinformatics professor at Yale, and Victoria Stodden both wrote a letter to Nature commenting on this and saying that the scientific community needed to move toward code review. They suggested a broader adoption of what some journals have done, which is to have an associate editor for reproducibility who will look at the code and try to reproduce it. Then it would not be such a burden that is imposed on reviewers. That is another possible way forward, having code submitted, made open, and then incorporating this aspect of review. Nature did not want to publish that letter, but it is on Dr. Stodden's blog, and the idea is being discussed further. Another discussant noted that she has reviewed code plus data for journal publication and an example of the kind of problems that come up is "the compiler did not run because they changed something in the most recent version of Unix." Every now and then you find real code errors. There are a number of domains that do this as a regular task. A list of approved software is not likely, but different audiences and different kinds of follow-up are definitely worthwhile. Incentives versus Mandates Of the 10 stories that Wired magazine listed as the biggest science breakthroughs of last year, 5 seemed to be based on the use of databases--2 concerning GenBank and 3 concerning astronomy. The year before, there were three GenBank successes identified, but no astronomy and no earth science breakthroughs. There are two fundamental human motivations: greed and fear. Greed is working, but very slowly. This carrotstick dichotomy indicates that it is still something to be sold to researchers rather than something that is a so-called "killer app." Should we be appealing to fear, telling researchers that the real problem with not having data is that somebody in India or China will be running rings around them? Perhaps the most effective approach is to pick one message, whether it be enforce the data management mandates, support tool building, or support sociological analysis of what scientists do with data, and describe why just one of these messages should be used and focus on it, because 15 seconds of attention is all that we are likely to get from the research community and the decision makers. Much of the successful work happens through bottom-up approaches, and this is a lesson we should learn. Identifying and rewarding the people who are already doing this is a 168 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
strong option. Focusing on what can be done from the bottom-up to acknowledge good work and to help keep researchers doing that work can get them more visibility across the sciences and other benefits. The Web can be a very forgiving medium in the sense that if something wrong is fixed, it is okay as long as it is fixed fast. The scientific community generally does not have the kind of culture that gets things moving quickly, however. We do a lot of design and architecture. There is nothing wrong with that, but as this area evolves from the top down, the bottom-up efforts can be rewarded as well. Journals also can help enforce reporting guidelines if there are standard metadata on which you can get a community to agree. In this case, journals can tell authors that anybody who submits a certain kind of experiment has to include certain information about their experiment. They are effective in applying pressure to authors, because publishing is one of their motivating factors. Regarding prizes, a couple of years ago something like Kiva19 was proposed for science--a type of micro-loan that researchers could apply for, say, $50,000 from the National Science Foundation (NSF), to try something new. Perhaps it would be $100,000 or $200,000, depending on the scope. To determine how scientists should meet the data management plan requirements, we could have a contest for a couple of years where people could try different approaches. If a technology has the right branding, then other people will discover it more easily. Referring to the "app store" example, the app store has applications that get recommended by Apple, and suddenly millions of people download them. Another option is for NSF or some other entity to have a competition with small grants to find approaches that seem to work as widely applicable solutions. The prize would be that they get a brand that says something like "NSF Approved." Another similar suggestion focused on some sort of sandbox, where people can experiment with different approaches. The people who would take on these kinds of projects would not take on all the ones that were listed earlier in this summary, but they would have various missions. Some individual investigators might accept a 5 percent tax so that technologies for data management systems could be developed. There is already a group of people who are reusing data and making good discoveries. We have seen some policies started, such as the call for data management plans, which are helping to get people who produce data to think about managing and sharing them. One of the best incremental steps on the production side right now, other than letting some of those mandates play out and trying to extend them to other funding agencies, will be to continue to encourage people to contribute to collective databases. There are many mechanisms for doing that, such as journals requiring registry numbers. We actually know a good deal about how to get collective databases implemented successfully. Another option that can be pursued is to enable innovators who are doing productive work. This can be done in different ways. One option would be to highlight the researchers 19 Kiva is a nonprofit organization with a mission to connect people through lending to alleviate poverty. Leveraging the Internet and a worldwide network of microfinance institutions, Kiva lets individuals lend as little as $25 to help create opportunity around the world. See http://www.kiva.org/about THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 169

OCR for page 133
who are responsible for breakthroughs, as suggested earlier. Another option would be to initiate a prize for the most innovative piece of data reuse for the year--not for the people who supplied the data, but for the person who had the bright idea and went after the data and did something extraordinary with them. Another option is to work with people who are producing results by reusing other people's data to get them more data. If the theory is that "more data wins," let us try in some very focused areas to further enable the people who are most active. The NSF data management plan can be seen as the first iteration of the network plan. That is, from the top-down perspective, the way to break through the discipline silos is to require people within each silo to say how they are going to use their funding to be good network citizens. In what way are they going to be a responsible steward of the research funds with the network in mind? Tell us how you are thinking about the network in your plan to use these resources. Will you make your database available so that it becomes a networked resource, and will you annotate it? The point is that if government agencies start requiring people to plan for using the networks, that could lead to a powerful shift in the way people think about it themselves. If a checklist of the properties of a good network scientist citizen were compiled, not every scientist would check off all the boxes, because science is heterogeneous. If scientists checked a certain number of those boxes, however, then you would tend to move toward those goals. That would be a documentable approach. When a scientist submits a proposal for funding, he or she can point to a previous project and say, "I was a good network citizen based on this." For example, one of the boxes could be about reproducibility. If a researcher's field is not amenable to reproducibility because of the size of its datasets, it might not apply, but it would apply to someone else. There would be some basket of these properties that would determine how good a network citizen you are. That would be something measurable and doable. The Department of Defense (DOD) has gone completely net-centric. It talks about everything in terms of being net-centric. It might be interesting to look at what the DOD is doing and see how civilian science agencies might think about net-centricity. Keep in mind, however, that the DOD is very command-and-control oriented, so it can do these things top down. A study could be conducted to look at the impact of the NIH access policy and ask whether it should be extended to any other federal research agencies. Other agencies also could follow NIH's lead in requiring that those who submit grant proposals include in their bibliographies not only lists of their papers but also lists of the places where that information is publicly available. When a scientist reports on results of prior research, the scientist could list where those data are publicly available. The National Library of Medicine (NLM) has developed many process metrics. These include, for example, determining the percentage of the documents produced with NIH grants that are getting deposited in PubMed Central. That number has substantially increased. At this juncture, rather than investing heavily in making data available, maybe time is ripe to put some resources into the people who are actively trying to get the data to do some valuable scientific work. Maybe that would be a good strategy for a year or two, and maybe that would be a good thematic response. 170 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
Developing the Supporting Infrastructure The NIH's National Center for Research Resources (subsequently abolished) and the Chinese have a strategic plan for research that includes infrastructure development and funding. Since there is a lot of talk about funding the enabling tools for research, one option could be a similar kind of mechanism in the U.S. government for general science, not just for biomedical science. This discussion has identified a problem or barrier to the development of infrastructure that crosscuts across science domains. This could be further investigated at a higher level. In fact, some previous reports have noted that, and the President's Council of Advisors on Science and Technology (PCAST) report on the Federal Networking and Information Technology Research and Development (NITRD)20 Program also looked at this. This infrastructure is hard to develop, however, because it depends in part on a large research group that can support it. The community would have to determine how best to promote and cite that to help solve some of these problems. A top-down management approach is one option, but it could also be good to think about doing it through bottom-up scientific innovation. Whatever approach may be taken requires an advocate to make that happen, and it cannot be done very easily. Technology Transfer Mechanisms The role of the university technology transfer office in all of these processes remains a question. The typical instantiation is toward promotion of commercial interests and licensing. What is the relationship between the technology transfer office at the institutional level and the funders themselves? Could that relationship be a bridge toward addressing some of these unfunded areas? If there is a way to follow up on this idea of a public arm of the technology transfer office, that could be a way to disseminate more broadly the technology, information, and datasets that have commercial potential, but also to do other things, such as resolve ownership issues. If academic datasets were shared more broadly, we would see certain patterns evolving, and those could become templates for how those issues--in a legal and a citation sense--could be sorted out. The technology transfer offices could be one venue in which to build partnerships to encourage this, at least at the institutional level. Would there be a possibility of organizing some projects? Among the universities and other entities, could a few demonstration projects be organized to show how we see it being done? They would experience various difficulties, but in the end they could see how it can be done. If a project cannot get funding because it is too "applied," could the NSF have a technology transfer operation? The idea would be not to fund things that are possible to be commercialized, but rather to fund things that would be used to advance scientific research. If there were a GenBank-like division at NSF, for example, it could encourage the application of the results of research that it funds, although not necessarily all the way through to commercialization. In some cases, such as drug discovery mechanisms, it could lead to commercialization, but in other cases, such as discovering stars in the sky, probably not. 20 Available at http://www.nitrd.gov/. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 171

OCR for page 133
If NSF did have a technology transfer concept, how would it be different from a university's technology transfer function? The NSF does have the interdisciplinary Office of Cyberinfrastructure, which is supporting the development of new software systems, but it tends to be a much higher-level activity at the moment. One of the challenges is determining if this could be done at a specific domain level, or whether we would want to do it more generally. In addition, a 5 percent tax on research budgets is an option that could be explored further. It is not clear how many people, particularly the program officers at funding agencies, would endorse a 5 percent tax of the funding, but they might. Maybe the agreement could be that if one agreed to pay the tax, it could come with certain benefits, such as access to some repository. That is, researchers could have the carrot and the stick at the same time. Another model within government science agencies could be to collect technologies that might be broadly used from various directorates or projects and have them available in one place, or at least have pointers to all of them. This is already being done to some extent, but informally. Technologies are created using public research funding, and if they are of value, then they may be put on the Web to let others know about them. There is nothing very systematic, however, that gives NSF or some other agencies credit for having funded something that resulted in a technology or a tool that can be broadly used. So one question would be: Is there a mechanism within the Office of Cyberinfrastructure that could be used to make these tools or technologies available? There are other government entities that are not explicitly funding agencies--NITRD is a good example--that have the mission of supporting information technology (IT) research and development and have the use of that IT for national priorities, including science. NITRD could be a good ally in this area. The NITRD director is very interested in data issues and data- intensive science, but the organization has no money. All it has is the right to convene groups of experts, mostly in IT hardware, but it could be beneficial for NITRD to be involved. The recent PCAST report that was discussed above makes the point that the issue of infrastructure needs to be moved up to a level where it is not individual agencies competing with the individual scientists' budgets, but rather at a level where somebody is looking at this infrastructure as a national priority for innovation and science. There is now much more discussion at the top levels and the research agency managers are beginning to see the value of it. It is harder to do than it sounds, however. At the University of California, for example, anything that costs more than $500 has been considered capital equipment for many years. This is not rational, because the university has enormous amounts of space. It has junkyards of technologies that cost more than $500, because if they were more than $500, they have to be on the capital equipment inventory, and it costs more to dispose of them than to store them. The reason for such a ridiculously low limit is that anything that is considered to be capital equipment does not get taxed for overhead on grants. The principal investigators have wanted to keep the capital equipment threshold as low as possible, because it benefits them in the grants. It sounds preposterous, but California still has a $500 threshold for capital equipment, and it has been this way for some 15 years. Publicly funded research organizations are doing projects that are overlapping or similar to each other, whether in medicine or climate change or other areas. It would seem as though they could repackage those projects to demonstrate the multi-disciplinary needs and 172 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
benefits of such work. Such projects sometimes also have international partners, such as Brazil, Australia, and Europe, for example. That would allow them to demonstrate something similar on an international scale. The Virtual Acquisition Office science advisory committee made the mistake of asking the members of the committee to come up with test projects. Some people were very opposed to limiting something like that to a small number of investigators. A regular (small) grant competition could specify that it is closer to infrastructure than research. This would have to be made extremely clear to the reviewers, because otherwise they will not understand. Grant competitions have become a grandiose vision with millions of dollars to do research. That was not what was originally intended. Cultural Change Another important feature of this data-intensive phenomenon is that there is a general sense that we ought to change practices and change culture so that the new and better ways are accommodated along with the existing ways of doing things. How do we change culture by changing practices? How do you change culture? How do you change practices? It is an easy thing to say, but it is a very hard thing to do. It takes a lot of time and effort, and often a lot of pressure, because people's interests at the local level are not served by changing practices to which they have become accustomed. You can also turn it around to make it "change practices by changing culture." The idea that you need voluntary kinds of encouragement to stimulate open access has been found to not work very well. That is why we have had to resort to mandates, whether through the journals or legislation or funding mechanisms, to change both culture and practice. At the same time, so much has been added to the list of mandates that accompany proposals. We have to promote minorities, help K12 education, and so on. Of course, those who submit proposals claim that they are totally behind these mandates and support them, but frequently they do not do anything to advance these goals. One of the things that was proposed to the NSF leadership was letting grantees pick what they are going to affect and then incorporate that in the proposal. These interest groups, however, have worked for a long time to get themselves into that mandate list. The fact that not much happens from them being in a mandate list is not as important to them as being in the mandate list itself. There is a kind of culture of incumbency just around being listed. It does not necessarily have anything to do with meaningful change. We can push hard on this changing-practice, changing-culture approach, because it is hard to do. What can the agencies do to help change practices and cultures? Probably no amount of studies will result in changing practices. Adding something to the mandate list may not change practice either, because the mandate list is already so long that it is just an exercise. The steps involved in influencing culture are not interchangeable, however. The way to change culture is by changing practice, and this example of NSF requiring data management plans is a good first step, although it is a very tentative step, and there is still a long way to go. Everyone now is holding workshops on how to write the correct data management plan. Maybe in 3 years there will be many more people thinking that this is a good thing to do, because they would have been made to do it and they would have started seeing the benefits of doing so. Thus, there could be some value in recommending specific things for changing practices. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 173

OCR for page 133
One answer to this question is the old adage "you get what you measure." If we are not measuring the impact of the data plan, we are not tracking it, we are not enforcing it, and then nobody cares. Suppose, for example, researchers were asked how many minorities they supported on their last NSF grant? If it was one-third of what was promised, then the researchers could get one-third of the requested money on the new grant. You would suddenly have many more people matching those mandates. It is easy to talk about measurement and enforcement. It is harder to make that work. We may be coming to a point where we need to start thinking about what is going to work. The NSF seems to be trapped in this situation where it is supposed to be getting ideas from the community about where to go with the science. Although there may be many people interested in this issue, the population of people doing science and being funded by NSF is much larger, so the view being expressed by those sufficiently interested in any particular issue is a minority view within that sourcing process. How do you change this? Where are the pressure points and the leverage that will work, given the limited resources available? Another discussant made a comment about the socio-cultural attitudes that we may wish to change. Reflecting on the tradition of science, it is important to celebrate that tradition and use it as a basis to build upon. Understanding traditions is very effective in introducing change to different cultures. In working with different cultures you discover that you need to take different approaches. If, for example, we are talking about the transmission of HIV in a culture that has polygamy, it is irrelevant to talk about being faithful. Therefore, if you know the important traditions to select, you can build upon and influence that change more rapidly. Role of Libraries In setting up a national information-organizing center that oversees and manages standards and vocabularies, it is important to remember that when data and information are created, they may have to be maintained over long periods of time and adapted as things change. Therefore, there is a certain amount of infrastructure that has to be supported from the top down. It is not possible to get universities to do this in a distributed way. Libraries are giving much of the advice for data management plans right now. Research libraries have seized upon this as an opportunity to do outreach to the science departments at their universities and to help them figure out how to do the data management plans that the researchers are now being required to do. Hence, research librarians are allies on the ground at universities, and they reach out to the scientists to help disseminate cultural ideas and strategies that funders and policy makers would like also to see implemented. One of the big questions circulating in the library community currently is what to do with print collections now that we are moving much more toward digital materials. There are a number of issues that have never been encountered before. We have been building up print collections for the past several hundred years, so we know a lot about that. We do not know much about reducing the collections, however. It is hard. It is more of an ecological approach than telling others what they should do. A problem is that it is very easy for scientists to say that issues or concerns expressed by one community have no corresponding value in their community. There are many ingrained practices, attitudes, and traditions that are candidates for change, but can we deal with them 174 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
sensibly? Research librarians may be able to provide assistance to scientists that would not otherwise be available, particularly at the college and university level. One thing that may be helpful is delineating the roles of different kinds of experts who are involved in the data management process. The funding agencies are like the data investors. The data producers are interested in their data, but not necessarily in how the data will be reused. A problem in some cases is that research funders want data producers to let other people reuse their data, but it is extra work for the producers to deposit their data someplace and to annotate the data in a way that makes them useful. Then you have the data reusers or analyzers as well as those who are the intermediaries between that data producer and the data reuser, acting as the data translator, manager, or marketer. This latter type of person will help the data producers--who really do not want to do this kind of work or think they do not have the tools available to do it--become aware of the practices or tools that exist for them to do this kind of work, and also manage the data and train them in how to do that more efficiently. Research librarians do that kind of work in genomics, for example. In addition to the NIH Center for Research Resources that was mentioned earlier, there is another institute at NIH that deals entirely with information science and information resources: the National Library of Medicine. There is not an equivalent library for the basic sciences; although NLM has done some work expanding its focus into other related areas, it does not cover geosciences and other disciplines. In the Agricultural Research Service (ARS) there is the National Agricultural Library. The ARS includes librarians in research project planning and data management. Without trust among the parties, however, we cannot open up a legitimate dialogue, innovative direction, and mutual support. How to build trust really depends on those who are involved in a research project, but without trust they may fight against each other. Finally, one of the options that was suggested for changing and improving data management practices is to reduce the costs. Better leadership can direct people to what those practices need to be. One research agency has a saying concerning data management: "Should we, could we, and will we?" There is no argument about whether data should be managed, but the scientists want to know how to do that. Clear answers can foster effective leadership in different disciplines for preserving data. Some sort of basic minimum guidelines would be helpful--if not necessarily just tools or methods. Technologies have changed what we can do, but they have not changed what we do. Appropriate leadership can help us define what we do. Communicating and Influencing Understanding of Scientific Knowledge Discovery in Open Networked Environments It appears that there are at least four communities that are listening and thinking about these issues. One is government leaders and elites, who believe that there is something big here and are looking for guidance on how to think about it. Then there are the people in the institutions--librarians, research leaders, and others-- who are thinking about these issues from a different perspective. For example, many state universities are now focusing on what to do with the flagship institutions of higher education among the others. The flagship schools are ostensibly the research institutions, but the distinction that some schools are for research and some are for teaching is somewhat dated. One big question here is what the flagships should do for the others. Should they do anything? THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 175

OCR for page 133
This will have an effect on access. A lot of science and scholarship can be done by secondary sources. As those secondary sources get better, the quality of scientific scholarship will improve, and it can be ever more distributed. That is a productivity issue as well as a participation issue. Therefore, this is a second audience, the combination of locally based research entities and the elites of those locales. A third audience is researchers who are thinking seriously about the approaching change and wondering how to respond. This is important. It is not like it was 40 years ago when the more senior scientists were trained. This is different. What do we do about this? Perhaps that is a channel to approach some of the younger people, because they are coming up through the apprenticeship structure. The final, fourth, group that is usually not addressed are the creators and innovators-- the Facebook and Google types and the people who are doing research in these hacking centers. In short, good things can happen in this self-motivating way. Moreover, they do not have to ask for federal money or congressional approval to do it. There is a big difference between just communicating versus the process of scientific knowledge discovery. Some of what is being discussed is the need for better tools and applications to do science over Facebook and other new approaches. According to some studies, data practices are driven very much by the kinds of tools they use. By talking to people, we can know immediately the kinds of data practices and the kinds of research that they are doing, just by knowing the kind of tools that they are using. One discussant noted that he just got on a big grant with people at the Mayo Clinic that he has never met in person. He only met them through Twitter, but he is now on their grant. His best paper recommendations come through people he follows on Twitter. Many of his own paper citations can be attributed to tweeting about his work. He also gets invited to conferences through people who follow him on Twitter. This is an example of the value of scientific networking. Another discussant noted that the topics of creating scientific discoveries and communicating knowledge are being conflated to some extent in this discussion. He uses software and writes code, but does it for a science that can never be practiced on a social network. For example, there are foresters who work with trees that take 50 or 60 years to grow. There is much science that cannot be done socially. At the same time, the communication of science or even enabling the process of science can be done with these tools. He is in user groups, and when he asks for help, people respond to that. He is able to read up quickly and does not have to go through a hierarchy to acquire that knowledge. He can utilize tools that are changeable, because they are all open-source. Therefore, separating the functions of creating knowledge from communicating knowledge is important. There may be some overlap, but they are not the same things at all. It is helpful to discriminate between the two and to ask what we get from doing that. It is also useful to note that things that we think of as nailed down, such as mathematical proofs, are not settled until the mathematicians say so. Mathematics is essentially a social process. There are some things, such as the proof of Fermat's last theorem, that not all mathematicians agree on at this point. A preponderance of mathematicians think it has been proved, but there are some holdouts who are very strongly opposed to that position. 176 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 133
Scholarship is communication. We should not assume that geology is going to be done on the Web. Geology is probably going to be done with rocks and other things gathered onsite. But the phenomenon of geology--what people agree about or come to hold as geological fact--is going to be socially constructed over time, and that is a communication process. The idea that science and communication are two different things is not valid. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 177

OCR for page 133