The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 69
3. How Might Open Online Knowledge Discovery Advance the Progress of Science? Technological Factors Session Chair: Hal Abelson -MIT- FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 69

OCR for page 69

OCR for page 69
Interoperability, Standards, and Linked Data James Hendler -Rensselaer Polytechnic Institute- I will focus on the technological challenge of online knowledge discovery. I will first address interoperability issues and then talk about some promising developments. Scientists can learn a lot about data sharing just from observing what is happening in the outside world. There are many data initiatives and developments outside academia to which we should be paying more attention. Expressive ontologies are not a scalable solution. They are a necessary solution in certain domains, but they do not solve the interoperability problem widely. They allow a scientist to build his or her silo better, and sometimes they even let the silo get a little wider, but they are not good at breaking down the silos unless the scientist can start linking ontologies, in which case he or she has to deal with the ontology interoperability issue, as well as the costs and difficulty of building them and their brittleness. I am known for the slogan "A little semantics goes a long way." I have said this so often that about 10 years ago people started calling it the Hendler hypothesis at the Defense Advanced Research Projects Agency (DARPA). We are used to thinking that a major science problem is searching for information in papers, and we have forgotten that we also have to find and use the data underlying the science. A traditional scientific view of the world might be, "I get my data, I do my work, and then I want to share it." But the sharing should be part of how we do science, and other issues such as visualization, archiving, and integration should be in that life cycle too. I will talk about these issues, and then I will discuss the kinds of technologies that are starting to mature in this area. These technologies have not yet solved the problems of science and, in fact, largely have not been applied to the problems of science. Scientists do use extremely large databases, and many of these data are crucial to society. On the other hand, we scientists tend to be fairly pejorative about something like Facebook, because Facebook is not being used to solve the problems of the world. I wish science could get the number of minutes per day that Facebook gets, which is roughly equivalent to the entire workforce of a very large company for a year. We are also not used to thinking about Facebook as confronting a data challenge per se, but it collects, according to the published figures, 25 terabytes of logged data per day. That sounds like the kind of numbers for large science data collections. Facebook's valuation is estimated to be well over $33 billion, which is the size of the entire National Institutes of Health budget. Not surprisingly, they are able to deal with some of these data issues that are discussed here. We need to look at what they are doing and determine if we could take advantage of some of their approaches. I do not have similar numbers for Google, because they have not been published. The last good estimate I could find was in 2008, which was 20 petabytes per day, but that was 3 years ago. That also does not include the exchange of or the storage of YouTube data, which I cannot find good numbers about either. Google's valuation in 2011 is about $190 billion, FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 71

OCR for page 69
which is roughly a third of the Department of Defense budget. Not the research budget--the entire budget. Therefore, if we think that this kind of work is expensive, yes, it is, but there are other people doing it, and they are investing large amounts of money. It is not surprising that they have been focusing on big data problems in a way different from scientists, and they have been able to explore some areas that are very difficult for us to explore. If a researcher wanted to buy 10,000 computers for data mining purposes, it would difficult to do because of the lack of resources. Several speakers talked about semantics in the context of annotation and finding related work in the research literature. It is an important problem, but I do not think it is the key unaddressed problem in science. In fact, I would contend that we have spent a huge amount of money on that problem, much of it on trying to reinvent things that already exist in a better form from open sources in the real world. Most companies today that want to work with natural language processing start with open-source software. For example, there is a company that is taking everything from the Twitter stream, running it through a number of natural language tools, and doing some categorization, visualization, and other similar work. They did not build any of their language processing software. I am also told that Watson, the IBM Jeopardy computer, had a large team of programmers, but, in fact, that the basic language technology used was mostly off the shelf, and that it was mostly statistical. Semantic MedLine is a project that the Department of Health and Human Services has invested in. It does a fairly good job but does not understand the literature sufficiently to find exactly what we want. But we are not able to do that in any kind of literature, and Google is working on that problem as well. Hence, I do not see the point of yet another program to do that for yet another subfield or against yet another domain. Instead, we need to start thinking about how to put these kinds of technologies together. The Web is a hard technology to monitor and track because it moves very fast. It has been moving very fast for a long time, however, and, as scientists, we need to start taking advantage of it much more than reinventing it. There are a few different tools and models on the Web that are worth thinking about. One is to move away from purely relational models; from assuming that the only way to share data is to have a single common model of the data. In other words, to put data in a database or to create a data warehouse, we need to have a model, and that is done for a particular kind of query efficiency. Sometimes, however, that efficiency is not the most important factor. Google had to move away from traditional databases to deal with the 20 petabytes of data generated every day. Nobody has a data warehouse that does that, so Google has been inventing many useful techniques. "Big Data" was one of their names for their file system. We cannot easily get the details of how Google does it, but we can get published papers from Google people that will be useful in learning how to do similar work. The NoSQL community is a fairly large and growing movement of people who are saying that when you are dealing with large volumes of data there is a need for something different from the traditional data warehouse. 72 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
I had a discussion with some server experts who said, "Just give us the schemas, and we can do this work." I pointed out that we cannot always get the schemas, that some of these datasets were not even built using database technologies with schemas, and in some cases someone else owns the schema and will not share it. We can still obtain the data, however, either through an Application Programming Interface (API) or through a data-sharing Web site. Although Facebook's and Google's big data solutions are proprietary, the economics of applications, cell phones, and other new technologies are pushing toward much more interoperability and exchangeability for the data, which is mandating some sort of a semantics approach. Consider, for example, the "like" buttons on Facebook. Facebook basically wants their tools and content to be everywhere, not just in Facebook, which means it has to start thinking more about how to get those data, who will get that data, whether it wants data to be shared or not, and what formats and technologies to use. As a result, there are many technologies behind that "like" button. Similarly, there are a number of approaches that Google employs to find better meanings for their advertising technologies. What happens is that Google recognizes that at one point it will not be able do the work by itself, because there is a long-tailed distribution on the Web. This means that it has to move the work to where Webmasters and other people will be able to develop the semantic representations for their domains. That is happening with all the search engines, and the big issue now in that area is simple metadata and lightweight semantics. The notion of the complex ontology is getting replaced by the notion of fairly simple descriptive terms--that is, I can probably describe the data in my dataset with 10 or 12 different terms that will be enough for me to put in a federated catalog for people to decide whether they want to read my data dictionary. The idea here is that if the data are going to be in a 200-page, carefully constructed metadata format with the required field-specific types, and they have to be compliant with the standards of the Internet Engineering Task Force and ISO (International Standards Organization), the vocabularies get harder to work with. It is an engineering problem that is very similar to the integration problem, and what many people are realizing is that you can arrange the data hierarchically and get good results for small investments. Here is an example. Many governments are putting raw data on the Web now. They are not just putting visualizations of the data online; they are putting the datasets themselves. From a political point of view, the two biggest motivations are enhanced transparency and the chance to inspire innovation. Promoting innovation can proceed in two different ways. First, the governments are hoping that some people will figure out how to make useful and innovative tools using the data. More important, especially for local governments, is that they pay many people to build Web sites and interactive applications that they cannot afford to build anymore. For instance, a government agency may be able to pay one contractor to build a big system for a problem that is of high priority, but it may have 57 more priorities that it cannot afford to fund. The governments need someone to develop those applications, so if it makes the data available, at least some of those applications are done by other people. The development of such applications is out of the agency's control, but if the work is getting done at no cost or for some small amount of money, it can start planning strategically for other priorities. The United States and the United Kingdom are the two leading providers in terms of organizing their data. The United States has about 300,000 datasets, most of them geodata. If FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 73

OCR for page 69
we take out the geodata, there are probably 20,000 to 30,000 datasets that have meaningful raw data. The United Kingdom has less, but it probably has the best in the sense that, for example, you can get the street crime data for the entire country for a number of years. They are releasing very high-quality data down to a local level. Many countries, including ours, are releasing scientifically interesting data, but you would have to work to find them. To use these data, combining them with other data, can be more important than just looking at them. Those entities releasing data include countries, nongovernmental organizations, cities, and other groups. For example, one of the Indian tribes in New York State has released much of its casino data. There are groups all over the world working with these data. My group is one, and the Massachusetts Institute of Technology has a group working jointly with Southampton University, mostly on U.K. data. Many of these groups are in academia, but there are many nonacademic groups doing this kind of work. One of my suggestions is to think about data applications. You may have a large database, and if parts of it can be made available through an application, an interface, or through an API, people would be using the data in a sharable-- and often an attributable--way. Getting back to science issues such as attribution and credit, we have one of our government applications in the iTunes store. I know exactly how many people downloaded it yesterday, and how many people are still using an old version. When some students and I were in China, we discovered that China was releasing much of its economic data, so we took China's gross domestic product (GDP) data and the U.S. GDP data to do a comparison. To do that comparison, we needed to find the exchange rate. Luckily, there is a dataset from the Federal Reserve that has all of the exchange rates for the U.S. dollar weekly for the past 50 years or so. We got those data, we mashed them together, and we got a chart that looks like Figure 3-1: FIGURE 3-1 GDP Chart. SOURCE: James Hendler 74 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
We also divided the GDP by the population (data we found in Wikipedia), so that we could click a button to go between total GDP and per capita GDP. The model was built in less than 8 hours, including the conversion of the data, the Web interface, and the visualization. That is a game-changer. It would completely change the way we work if we could get this down to a few minutes. When my group started working with this kind of technology several years ago, it would take us weeks to do this kind of work. Part of the improvement is because some of the technology was immature, part of it is because the tools are now available commercially, and another part is because there is now visualization technology freely available on the Web. Building visualizations is labor intensive, so by moving to simple visualizations, we can use visualizations much earlier. We are also able to link government data to social media. One interesting question that has not really been part of the discussion that we have had in the scientific community is how to find data. For example, we noticed that most of the U.S. government data were about the states, and the metadata would represent the data as being about the 50 states, but very few of the sets actually covered all the states. Some states were missing. Some databases included American Samoa, Guam, Puerto Rico, et cetera, and the District of Columbia (which is not officially a state). In this case, there is a very loose definition of a state as opposed to a territory, and no one has much problem with that. But if a researcher wants to find datasets about Alaska, it can be erroneous to just assume that all of the datasets that say they cover the states will actually have Alaska data. The other problem is that we cannot search for the keyword "Alaska" within the datasets. If there is a column that represents the states, it may be called Alaska, it may be called AK, it may be called state two, or it may be called S17:14B/X. The government has terabytes of data, so how do we find the data that are for Alaska, for example? We cannot just call for building a data warehouse and rationalizing the process, because these datasets are being released by different people in different agencies in different ways. Thus, metadata becomes very important. Simple and easy-to-collect metadata can allow building faceted browsers and similar tools. It is a real research challenge, however, to determine the kind of metadata for real scientific data that are powerful enough to allow useful searches. With the government data, one problem is that all of the foreign databases are in their own languages, and it is hard for those of us who are English language speakers to figure out what is in the Chinese dataset, unless you hire a Chinese student. You cannot just use Google Translate and expect the result to be academically useful. People are starting to consider integrating text search and data search. This is an application that one of my staff did, joined with Elsevier's new SciVerse, which was featured on the U.S. Data.gov site (Figure 3-2). What this application does is that when we are doing a keyword search for scientific data, it is also looking for datasets that might be related to that same term. We are using very lightweight ontologies that mostly just use the keyword. We are working on making it better. We think it would be good when someone is looking for papers, the data in or about that paper were available, but also finding what is in the world's data repositories that might be useful. FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 75

OCR for page 69
FIGURE 3-2 Search in SciVerse Hub on Climate Change. Source: James Hendler When we start thinking about data integration, using data, or searching for data, one thing that happens in the discovery space is that the data become part of the hypothesis formation. That means that looking at, visualizing, and exchanging the data cannot be an expensive add-on to a project. It has got to become a very key part of the standard scientific workflow, with appropriate tools to reduce the costs. What is promising in this area? We have been looking at linked open data issues outside of science, and some of its promise has been explored (mostly within the ontology area). Genomics and astronomy are two fields where we have actually seen semantic Web technologies deployed, but many other science communities are still mostly thinking about their own data holdings, not about being part of a much larger data universe. It is interesting that when we talk to the Environmental Protection Agency, or to the National Oceanic and Atmospheric Administration, or similar agencies, they say that they are providing data to many communities, so they cannot easily use the ontologies of a particular one. Hence, we need to learn how to map between these approaches. How do I know what is in a large data store? How do I know what is in a virtual observatory? I need services, metadata, and APIs. I also need to know the rules for using the data. Other issues we need to think about are related to policies. If I take someone's data from a paper, mix them with someone else's data, and republish them, those people would probably like to get credit for the data being reused. Or if I do some work on your data that you do not like, you might want to rectify it. How do we do that? 76 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
National Technological Needs and Issues Deborah Crawford -Drexel University- I am currently the vice provost for research at Drexel University, but I used to work at the National Science Foundation (NSF). I spent almost 20 years at NSF and was fairly significantly involved in the agency's cyberinfrastructure activities. My experience within NSF and now at Drexel has provided me with an interesting perspective on what data-intensive science means in a research university that aspires to be more research intensive. The main message of my talk is that there have been many advances in data-intensive science in some fields, but we have massive amounts of work still to do if data-intensive science is to realize its full potential across all of the disciplines. There have been many reports issued over the past decade that address the importance of data-intensive science and the role of information technology in advancing science. An important one is the Atkins report that Dr. Hey referred to earlier. In that report, Dan Atkins of the University of Michigan and his committee speak of revolutionizing the conduct of science and engineering research and education through cyberinfrastructure, and they examine democratizing science and engineering. Those were tremendously powerful statements in 2003, and they stimulated a great deal of excitement within the scientific community. Since then, there have been a number of reports that have specifically addressed data- intensive science, several of which were released in the past couple of years. This is a quote from a joint workshop between the NSF and the U.K. Joint Information Systems Committee that was held in 2007: "The widespread availability of digital content creates opportunities for new kinds of research and scholarship that are qualitatively different from traditional ways of using academic publications and research data." What we have heard so far in this workshop was from those who I would describe as the visionaries and the trailblazers in science and engineering, representing those communities that were very motivated to take advantage of information technology in their work in order to advance their field of science. I want to focus now on the long tail of science--the researchers in those fields where the immediate advantages of information technology and collaboration are not so readily apparent. There is tremendous opportunity in those communities, but we do not quite know how to realize those opportunities. I would like to provide a snapshot of some surveys of researchers working in different communities. The main message is that computer-mediated knowledge discovery practices vary widely among scientific communities and among colleges and universities. In the United States and in the United Kingdom, for example, this is certainly true. There are some colleges and universities that know how to take advantage of their digital technologies and their digital capabilities, while there are others that simply are way behind the curve. There are three fundamental issues that communities or universities must address: (1) What kinds of data and information are made open, at what stage in the research process, and FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 77

OCR for page 69
how? (2) To which groups of people are the data and information made available and on what terms or conditions? (3) Who develops and who has access to the tools and training to leverage the power of this discovery modality? These are fundamental questions and are key to the realization of data-intensive science. I am going to talk about two case studies, one conducted in the United Kingdom by the Digital Curation Center, and one conducted within Yale University. Both point to some key features we need to address. The study done by the Digital Curation Center in the United Kingdom was called "Open to All." Its purpose was to understand how the principles of digitally enabled openness are translated into practice in a range of disciplines. These are the kinds of questions we need to be asking ourselves to determine the actions that we need to take to make sure that this modality of science is accessible and advantageous to everyone. In this study, the authors characterized a research life-cycle model and asked different communities how they were using digital technologies within the context of that life cycle to further their science. They surveyed groups among six communities: chemistry, astronomy, image bioinformatics, clinical neuroimaging, language technology, and epidemiology. The range of responses within those different communities was fascinating to see. Surprisingly chemistry was the trailblazing community, at least among the individuals surveyed for this study. The chemists were using social networking tools, Open Notebook Science, wikis, and all the modalities of digital technologies to collaborate, and to collect, analyze, and publish their data. Everything was quite seamless from a community that I had not anticipated to be one of the trailblazing communities. It was interesting to hear from the clinical neuroimaging group. They were so skeptical of the value of data-intensive science and open data sharing that in this study they refused to even disclose their identities as individuals. We therefore went from one extreme to the other, and we saw the range of practices and values within different scientific communities. From their conversations with these communities, the group that conducted these case studies in the United Kingdom came up with a list of the perceived benefits of open data- intensive science. It included improving the efficiency of research and not duplicating previous efforts, sharing findings, sharing best practices, and increasing the quality of research and scholarly rigor. This last one was especially true for the members of the chemistry community who were surveyed. They saw a great opportunity in making available in blogs the day-to-day data that they collected--not just raw data, but derived data. They found tremendous benefits in making that open to the community to provide more scholarly rigor. Among the other perceived benefits were enhancing the visibility of access to science, enabling researchers to ask new questions, and enhancing collaboration in community- building, which all of the groups surveyed agreed was a benefit. There was also a perceived benefit of increasing the economic and social impact of research about which the report was ambiguous. Although economic and social impacts each are treated separately as a perceived benefit, the sense was that the real value could not actually be measured. So, there was a question: Can we create economic value by making our data--essentially our intellectual property--much more openly accessible? 78 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
One of the perceived impediments was the lack of evidence of benefits. Researchers who felt this way were not motivated to make their data openly available. Other impediments were the lack of clear incentives to change and the values in academe not being consistent with open data sharing. For tenure and promotion and the drive to publish, the perceptions were that the only way to publish is in the open literature and that it is not in a researcher's interest to share data, because someone else may use the data to advance farther than the one who shared the data. Competitiveness is a big issue. The conflict with the culture of independence and competition is absolutely related to the lack of clear incentives to change. Other impediments were inadequate skills, a lack of time, and insufficient access to resources. Another big concern was how to train both the scientists who are practicing today and the scientists and engineers of the future. Researchers were also worried about how long it took them to prepare their data for open access, about quality, and about ethical, legal, and other restrictions to access. Those were big issues, especially in the life sciences community. The report's recommendations called for policies and practices for data managing and sharing. Communities are desperate for guidance on these issues. What have the trailblazers learned that can be shared and applied more broadly? Contributions to the research infrastructure should be valued more--that is something we have heard often. Training and professional development should be provided, and there should be an increased awareness of the potential of open business models. That is related to the attitude among researchers that their data are their intellectual property and they want to derive some value from that; thus, they feel that if they make their data openly available, they are giving that value away. Assessment and quality assurance mechanisms should be developed. The study done by the Yale University Research Data Taskforce was conducted by an organization within Yale called the Office of Digital Assets and Infrastructure, which is an organization established to accrue the value over time of the digital assets that result from research and scholarly activity within the institution. The office conducted this study to determine the requirements and components of a coherent technical infrastructure, to provide service definitions, and to recommend comprehensive policies to support the life-cycle management of research data at Yale. Given the discussion earlier about research libraries, this is interesting. Here is Yale University doing a survey that includes both the librarians and the information technology enterprise staff at the university to determine what their faculty base most needs for managing digital data. Very much like the other survey, this one found that data-sharing practices vary widely among the disciplines they surveyed. The researcher has the most at stake when determining what the data-sharing practices are. Yale is going to create an institutional repository to secure, store, and share large volumes of data. There are certainly some institutional pioneers in this area, such as the Massachusetts Institute of Technology and Indiana University, and much of what I characterize as "slow followers." A major concern is how an institution can afford to create and maintain infrastructure like this. Yale University understands that it needs to develop and deliver research data curation services and tools to all of its interested parties, not only in science and engineering, but also in the humanities and in the arts. Recognizing the importance of persistent access and FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 79

OCR for page 69
addressed in the other two branches of the scientific method, the deductive and the empirical branches. In the deductive branch in mathematics and logic, people have worked out what it means to have a proof, to really communicate the thinking behind the conclusions that are being published. Similarly, in the empirical sciences, there are standards that have been developed, such as controlled experiments and the machinery of hypothesis testing, and how they are communicated. In a methods section, there is a very clear way that these results are to be written for publication, designed so that other people can reproduce the thinking and the results themselves. In computational science, we are now struggling with this issue: how to communicate the innovations that are happening in computational science in such a way that they will meet the standards that have been established in the deductive and empirical branches of science. My approach is to understand these issues in terms of the reproducibility of computational science results, and this gives me the imperative to share the code and the data. We have seen many interesting examples of how reuse can be facilitated and what happens when someone actually shares open data. This gives rise to a host of issues about ontologies, standards, and metadata. This is framed within the context of reproducibility. The reason that we are putting the data and the code online is to make sure that the results can be verified and reproduced. Here is an example. In 2007, a series of clinical trials were started at Duke University that have since been terminated, but it took a few years to terminate them. They were based on computational science results in personalized medicine that had been published in prestigious journals, such as Nature Medicine. Researchers at the M.D. Anderson Cancer Center tried to replicate the computational work that had gone into the underlying science, and uncovered serious flaws undetected by peer review. The study was plagued with a myriad of issues, such as flipped drug labels in the dataset and errors of alignment between observations in treatment and control groups--errors that are simple to make. The clinical trials were canceled in late 2010, after patients had been assigned into treatment and control groups and had been given drugs. One of the principal investigators resigned from Duke at the end 2010. The point is that we have to assume that errors are everywhere in science, and our focus should be on how we address and root out those errors. There was a discussion earlier of how the data deluge is a larger manifestation of issues that have been seen before. Also, Dr. Hey gave the example of Brahe and Kepler and how what must have been a data deluge in their context ended up engendering significant scientific discoveries. In that sense, there is nothing fundamentally new here. We are doing the same thing in a methodological sense as we have always done, but we are doing it on a much larger scale. The scope of the questions that we are addressing has changed. In that sense, the nature of the research has changed. Dr. Hey told us that we need more skills to address this concern. Dr. Friend then said we need verifiability, a point that I have also attempted to make in this talk. This means that the infrastructure and incentives need to adapt to the research reality even though the process of science is not changing in any fundamental way. That in turn means that it will be important to develop tools for reproducibility and collaboration. For example, some presenters also talked about provenance- and workflow-tracking tools and openness in the publication of discoveries. 122 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
In short, there are many different efforts that are needed. The solutions in this area will not be something that comes down as an executive order and then all scientists are suddenly open with their data and code. The problems are much too granular, and so the solutions must emerge from within the communities and within the different research and subject areas. Vis Trails is a scientific workflow management system. It was developed by a team from the University of Utah that is now moving to New York University. It tracks the order in which scientists call their functions and the parameter settings that they have when generating the results. These workflows can then be shared as part of the publication, and they can be regenerated as necessary. Vis Trails also promotes collaboration. Some recent work by David Donoho and Matan Gavish was presented for the first time in a symposium on reproducibility and computational science, held at the American Association for the Advancement of Science. They have developed a system for automatically storing results in the cloud in a verifiable way as they are generated, and creating an identifier that is associated with each of the published results. For example, if a paper contains a figure, then we would be able to click on it and see how it was done and reproduce the results, as the means to do this are automatically in the cloud. Another useful tool is colwiz, a name derived from "collective wisdom." Its purpose is to manage the research process for scientists. One of the major problems with reproducibility is that, unless a scientist is using these specialized tools, there is no automatic mechanism for researchers to save their steps as they advance. After they have finished an experiment and written the paper, they may find that going back and reproducing the experiment is even more painful than going through it the first time. Thus, tools like colwiz could help both with communicating scientific discovery and with reproducibility. These issues are all related to the production of scientific data and results. There are also some aspects related to publication and peer review. It is a lot of work to request reviewers, who are already overworked, to review code or data and incorporate them into the review process. Maybe we will get there one day for computational work, but certainly not now. The journals Molecular Systems Biology and the European Molecular Biology Organization are publishing the correspondence between the reviewer and the authors, anonymously but openly, along with the actual published results. This is one approach that is being tried to be more open and transparent. One of the reasons for this practice is that there is a great inequality in the power of the reviewers and the authors. In particular, reviewers can ask for additional experiments and more exploration of data from the person who is trying to get the paper published. Particularly for prestigious journals, reviewers have a lot of power. These journals are trying to balance this power by allowing readers to see the dialogue that took place between editors and authors before a publication. Furthermore, many journals now have supplemental materials sections in which datasets and code can be housed and made available for experimentation. The sections are not reviewed and so far have had varying amounts of success. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 123

OCR for page 69
In the February 11, 2011, issue of Science, there was an editorial emphasizing that, in addition to data, Science is now requiring that code be available for published computational results. That is extremely forward thinking on the part of Science. Science is folding this into its existing policy, which is that if someone contacts an author after publication and asks for the data--and now the code--the author must make it available. An approach that other journals have taken is to employ an associate editor for reproducibility. The Journal of Biostatistics and the Biometrical Journal do that. The associate editor for reproducibility will regenerate and verify the results, and if the editor can produce the same results that are in the manuscript, the journal will Kitemark the published article with a "D" for data or "C" for code. In this case the journal can advertise that readers can have confidence in the results, because they have been independently verified. There are also new journals that are trying to address the lack of credit authors get for releasing and putting effort into code or data or for attaching metadata. They are trying to address the issue of incentives, and their focus is on open code and open data. Open Research Computation offers data notes, for example, and BioMed Central has research notes. PubMed Central and open access are older concepts embedded within the infrastructure of the National Institutes of Health (NIH). But why does it stop with NIH? Could we have a Pub Central for all areas and allow people to deposit their publications when they publish, similar to the NIH policy? There has been much discussion about the peer-review data management plan at the National Science Foundation (NSF). This is a very important step even though it has been called an unfunded mandate. It is an important experiment in that it creates the possibility of gathering information about how much it will cost and how data should be managed. It is, in a way, a survey of researchers on how they are conceiving of these issues. Maybe the costs are less than NSF worries about, or maybe they are more, but at least we will be able to get a sense of this. One report that I was involved with along with John King was for the NSF Office of Cyberinfrastructure on virtual communities. We advocated reproducibility as part of the way forward for the collaborative, very high-powered computing problems that we are addressing. As part of the fallout from the problem I mentioned with the Duke University clinical trials, the Institute of Medicine convened a committee to review omics-based tests for predicting patient outcomes in clinical trials. "Omics" refers to genomics, proteomics, and so on. The committee is chaired by Gil Omenn, and part of its mandate concerns issues of replicability and repeatability and how the articles published in Nature Medicine that led to the clinical trials could have gotten through with what were, in hindsight, such glaring errors. There seems to be a hesitation on the part of some funding agencies to fund the software development or infrastructure necessary to address reproducibility and many of the other issues that we have discussed so far. Let me give an example of an e-mail that was sent to a group e-mail. The author of the e-mail was talking about how his group develops open- source software for research. He is a prominent researcher, who is very well known and very influential. His group develops open-source tools, and it is very difficult for him to get funding even when applying to NIH programs that are targeted at promoting software development and maintenance. In particular, he said, "My group developed an open-source tool written in Java. 124 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
We started with microarrays and extended the tool to other data. There were 25,000 downloads of this tool last year. So we submitted a grant proposal. Two reviewers loved it. The third one did not because he or she felt it was not innovative enough. We proposed three releases per year, mapped out the methods we would add, included user surveys, user support, and instructional workshops. We had 100 letters of support." This quote is from the negative review: "This is not science. This is software development. This should be done by a company." We can see that there seems to be a bifurcation in understanding the role that software plays in the development of science. One idea for the funders of research might be: Why not fund a few projects to be fully reproducible to see what barriers they run into? Is the problem that they do not have repositories where they can deposit their data? Is the problem that they encounter issues of maintaining the software? Where are the problems? Let us do a few experiments to see the stumbling blocks that they encounter. On the subject of citation and contributions, as we incorporate databases and code, we need to think about how to reward these contributions. Many contributions to databases now are very small, and there are databases where 25,000 people have contributed annotations. Hence, there are questions about how to cite and how to reward this work. What is the relative worth between, for example, a typical article with a scientific idea versus software development versus maintaining the databases? Typically the last two have not been well rewarded, and our discussion here calls that practice into question. I will end with a figure from a survey I did of the machine-learning community (Figure 3-11). These are the barriers that people said that they faced most dramatically when they were sharing code and sharing data. FIGURE 3-11 Barriers to data and code sharing in computational science. SOURCE: Victoria Stodden THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 125

OCR for page 69

OCR for page 69
A Government Perspective Walter L. Warnick -Department of Energy- I am the director of the Office of Scientific and Technical Information, which manages many of the scientific and technical information operations of the Department of Energy (DOE). Our goal is to maximize the value of the research and development (R&D) results coming from the department. To put this into perspective, each government agency has an organization that manages information. Those organizations have gotten together and formed an interagency group called CENDI. Bonnie Carroll is the executive secretary of CENDI. The National Science Foundation is represented in CENDI by Phil Bogden, and the National Institutes of Health (NIH) is represented by Jerry Sheehan. I represent the DOE, and all the other agencies have representatives too, including the Library of Congress, Government Printing Office, Department of Agriculture, Department of the Interior, and practically every other organization that has a big R&D program. Ninety-eight percent of the government's R&D is represented, and several organizations that do not have R&D programs are also in CENDI. The results of the U.S. government R&D investment, which amounts to about $140 billion a year, are shared via different formats. There is the text format, which includes journals, e-prints, technical reports, conference proceedings, and more. There is nontext data, which includes numeric datasets, visualizations such as geographic information systems, nonnumeric data such as genome sequences, and much more. And there are other formats, including video and audio. Each format is in a state of change and presents its own set of challenges. With journal articles, for example, the big issue within the government is public open access versus proprietary access. I think we all agree that the gold standard for text-based scientific technical information is the peer-reviewed journal article. Many highly respected journals are available only by proprietary subscription access. NIH has pioneered a transition to make journal literature publicly accessible. The effort has attracted a lot of attention, and it is still getting a lot of attention within the government. Principal investigators are asked to submit journal- accepted manuscripts for publication in the NIH public-access tool, PubMed Central. What is significant now is that the America COMPETES Reauthorization Act, which became law in December 2010, calls upon the Office of Science and Technology Policy (OSTP) to initiate a process to determine if public access to journal literature sponsored by government should be expanded. For now, NIH is the only agency that makes a requirement of public access to journal articles. The DOE and other agencies are already empowered by law to adopt that requirement, but we do not have to adopt it, and as a matter of practice we do not. I think that OSTP will soon formulate a committee, which the COMPETES Act calls for, to get input from stakeholders, consider the issues, and develop some recommendations. Beyond the journal literature, there are gray literature issues, and integrating them with journal literature is important. Gray literature includes technical reports, conference proceedings, and e-prints. It is typically open access, but not all of it is. All of the DOE's THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 127

OCR for page 69
classified research is reported in gray literature, and of course that is closely held, but I am talking here about the open access part of DOE's offerings. Gray literature is often organized into single-purpose databases. For example, in the DOE we have something called the Information Bridge, which has 300,000 technical reports produced by DOE researchers from 1991 to the present. It is all full-text searchable. The average report is 60 pages long, so they are fairly detailed. Many other agencies have similar resources. Other databases handle e- prints, conference proceedings, patents, and more. DOE pioneered novel and inexpensive ways to make multiple databases act as if they are an integrated whole, one example of which is Science.gov. Science.gov posts the publications of all the agencies that are in CENDI, so it is a very large virtual collection of databases. It is all searchable, and a single query brings back results ranked by relevance. It has won awards for being easy to use and is an example of transparency in government. The DOE's largest virtual collection that integrates gray literature, and some journal literature, is WorldwideScience.org, which is a computer-mediated scientific knowledge discovery tool. WorldwideScience.org makes the knowledge published by, or on behalf of, the governments of 74 countries, including the United States, all searchable by a single query. The amount of content is huge, about 400 million pages. A user can enter a query in any one of nine languages, and the system will search all the databases in the language of the database and then bring back the list of hits in the language requested. It is new, and it is growing very rapidly under the supervision of the international WorldWideScience Alliance. We also manage nontext sources--the numeric data, genome sequences, and so forth. The main questions are to what extent should such sources be made accessible and for how long. Some agencies are grappling with the issue by formulating data policies. Some agencies require principal investigators to propose data-management plans. The America COMPETES Reauthorization Act calls upon OSTP to initiate a process to encourage agencies to consider these issues, and it is the same part of the act that I mentioned previously that addresses journal literature. Hence, committees stemming from the act are handling both text items and nontext items. Everything we do entails cost. Whether it is just sharing information or doing analysis of the information, there is always a cost. Here is a way that I talk to my funding sources about cost. Imagine a graph whose vertical axis is the pace of scientific discovery and whose horizontal axis is the percentage of funding for sharing of scientific knowledge (see Figure 3- 12). I think everybody agrees that science advances only if knowledge is shared. Therefore, let us postulate an imaginary situation in which no one shared any knowledge. That would take us to the origin of this graph, because there would be no funds expended for the sharing, but there would be no real advance in science either. At the other end of the x-axis, at the 100 percent mark, if we spent all our money sharing and none of it doing bench research, soon your pace of scientific discovery would draw down close to zero too. 128 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
FIGURE 3-12 Knowledge investment curve. SOURCE: Walter Warnick We have two data points on this graph, both lying on the x-axis. In between the two points, there is a curve, which we call the Knowledge Investment Curve. We do not know the shape of the curve, but it is likely to have a maximum. The point is that decision makers affect the pace of discovery when they determine the fraction of R&D funding dedicated to sharing. That is the argument I make to my funders. The point of the Knowledge Investment Curve is to make funders realize that while they can dedicate funds to buy computers, hire more researchers, or build a new facility, they should also weigh that investment against the benefits of getting information out better, sharing it with more people, making searches more precise, and doing the kinds of analyses and data mining we have talked about here today. It would require a significant research program to calculate what the shape of this curve is and where that optimum is, but we know that such an optimum exists somewhere. Furthermore, the optimum is not the minimum, which is another message I give to the funding sources. If we think that the purpose of an information organization is to be a repository where information goes in, seldom comes back out, and seldom sees the light of day, that is not the optimal expenditure for sharing. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 129

OCR for page 69
Discussion DISCUSSANT: My question to Dr. Warnick and Dr. Stodden is related to the knowledge investment curve presented by Dr. Warnick. In a sense, the peak of that graph is the amount of funding spent on infrastructure that enables research versus the amount of funding spent on research. Where do you think that should fall for any example you choose? DR. WARNICK: We would probably reach a consensus that there is not enough money being spent on, for example, sharing of knowledge and analysis, and the development of some of the tools that were discussed earlier. As to how much below the optimum it is, let me give an example. Consider the National Library of Medicine as an excellent example of an information management organization. The funding for that organization exceeds the funding for all the other information organizations combined. I am not suggesting that the National Library of Medicine is overfunded, but I will say that the others are underfunded. DR. STODDEN: I absolutely agree. I think it became much harder than it has been traditionally to share our work. As our science becomes more data intensive and involves code, those two areas add extra expenses involved with sharing that are not wholly taken care of in funder budgets. The science itself, through technology, has changed, and it is making ripple effects through our funding agencies, which have not quite caught up yet, I think. DISUSSANT: Dr. Hey talked about the new data-intensive work as a new paradigm, yet so much of the discussion has been about things like reproducibility in the traditional sense, but with code and data added, or sharing in the traditional sense, but with code and data added. So where does thinking about new paradigms or new ways of doing things come in? Where do you see that falling in the spectrum of who is responsible and how that affects this whole question? DR. WARNICK: Even the sharing part is being subjected to new paradigms. Just to make a point, the infrastructure behind Science.gov and WorldwideScience.org is something that we see very rarely in everyday experience on the Web, and it was developed and matured as a result of some government investment. To take your point directly about the other kinds of analysis that were discussed earlier, however, since the government is providing $140 billion of funding for research, then the analysis that gets more mileage out of that research ought to be funded by the government too. Of course, the government always welcomes the idea that the private sector can take and utilize these results, but it must be the funding agencies that do the initial work. I think that the reason why people have not heard the Department of Energy mentioned in this discussion before now is because we were doing very little in this regard compared to the National Institutes of Health (NIH) and the National Science Foundation, and that ought to change. DR. STODDEN: My perspective within academia is that processes are changing for hiring, promotion, and work committees. The scientists who are clued into these issues of reproducibility, open code, and open data seem to be a little more interested than people who are carrying on in a different paradigm. Academia is conservative in the sense that things change slowly. Therefore, it takes time for these practices to percolate through. 130 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 69
DR. BERMAN: The issue of the gap really intrigues and concerns me, because in our real life, if we want to find a restaurant, we can go to Yelp. We can find which restaurants are near us, what is available, who likes them, and so on. There is an application for that. We can get this information on an iPhone. Consider taking all of our scientific products and putting them in that world. Is there a place where we can find scientific databases and see who liked them? Can we see who added metadata to them in a very user-friendly way? Can we access them easily? We are starting to see crossover between the academic world and the world of commercial applications. Phil Bourne has a project called SciVee, where we can show how to do different kinds of experiments or give a talk on a data YouTube-style. We can imagine using many of these commercial types of applications and technologies in academia. Some interesting questions arise: What does it mean if we have a data collection and many people like it? Does that mean it is a good data collection? Does that mean it is an economically sustainable data collection? We should not utilize one set of tools for our academic work and another set of tools for applications in the real world without bridging the gap. MR. UHLIR: There is a proposed act in Congress, the Federal Research Public Access Act, that broadens the NIH PubMed Central grantee deposit policy to include other agencies with annual research and development budgets of $100 million or more, although I do not know if it is going to actually become law. Also, in the list of peer reviews presented earlier, there is one other model that was missing: postpublication review. It is not a traditional peer review. It is an open peer review, it is moderated, and it is ongoing. There are two kinds of this model. One is just commentary, and one is papers generated in response to a big paper. The model I am thinking of is the European Geosciences Union's Journal of Atmospheric Chemistry and Physics in Munich. I do not know how many other journals do that, but it is yet another model for a review. Even if the code and the data are not available, people can ask very pointed questions that can be answered by the researchers. That is a potentially valuable way of reviewing results. In response to Dr. Hendler's comment, there have been several people who have made some intermediate suggestions, such as Dr. Friend's journal of models being published. But the fundamental problem is that we have moved all the print journals wholesale onto the Web without hardly changing the model at all. To obtain good models, one would deconstruct the scholarly communication process used in the print paradigm and reconstruct it in a way that makes sense on the Web. Thus, we are repeating everything we have done before and not really thinking about what the Web allows us to change in order to achieve greater efficiency. I think the print journal system itself is an outmoded way of communicating. I am sure there are all kinds of new paradigms, but I will leave it at that. THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 131

OCR for page 69