2. Experiences with Developing Open Scientific Knowledge Discovery in Research and Applications

Case Studies

International Online Astronomy Research

Alberto Conti
-Space Telescope Science Institute-

I am the archive scientist at the Optical UV Archive at the Space Telescope Science Institute, which is the place that operates the Hubble Space Telescope for the National Aeronautics and Space Administration (NASA) and which is going to operate the successor of Hubble, the James Webb Space Telescope. As archive scientists, we have to do a lot of work to bring data to the users, but we focus mainly on three areas. We like to optimize the science for the community by not just storing the data that we get from users but also trying to deliver value-added data. We also try to develop and nurture innovation in space astronomy, particularly for Hubble and its successor. We are good at that because we have been doing it for 20 years. And we like to collaborate with the next generation of space telescopes as well as with ground-based telescopes, because that field is quite busy.

Figure 2-1 shows some key astronomical missions, which provides an idea of the kind of astronomy and astrophysics data that exist and are being developed.

image

FIGURE 2-1 Astronomy project timeline.
SOURCE: Space Telescope Science Institute



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 19
2. Experiences with Developing Open Scientific Knowledge Discovery in Research and Applications Case Studies International Online Astronomy Research Alberto Conti -Space Telescope Science Institute- I am the archive scientist at the Optical UV Archive at the Space Telescope Science Institute, which is the place that operates the Hubble Space Telescope for the National Aeronautics and Space Administration (NASA) and which is going to operate the successor of Hubble, the James Webb Space Telescope. As archive scientists, we have to do a lot of work to bring data to the users, but we focus mainly on three areas. We like to optimize the science for the community by not just storing the data that we get from users but also trying to deliver value-added data. We also try to develop and nurture innovation in space astronomy, particularly for Hubble and its successor. We are good at that because we have been doing it for 20 years. And we like to collaborate with the next generation of space telescopes as well as with ground-based telescopes, because that field is quite busy. Figure 2-1 shows some key astronomical missions, which provides an idea of the kind of astronomy and astrophysics data that exist and are being developed. FIGURE 2-1 Astronomy project timeline. SOURCE: Space Telescope Science Institute FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 19

OCR for page 19
The Kepler mission, for example, is a planet-finding mission. The ones on top are space based, and the bottom ones are ground based, some of which are going to produce very large amounts of data. At my institution, we collaborate directly or indirectly with most of these missions and deal in some way with the data they produce. How are we going to manage the large amounts of data from these missions? Over the past 25 years, astronomy has been changing quite radically. Astronomers have been very good at building large telescopes to collect more data. We have been much better at building larger mirrors. But we have been a hundred times better than that at building detectors that allow us to collect extremely large amounts of data. Since the detectors roughly follow Moore's Law, every year or so our data collection doubles, and this raises many important issues. We realize that we are not alone in this area. We know that fields from earth science and biology to economics are dealing with massive amounts of data that must be handled. I am not particularly fond of using the word "e-science" to describe this field; I prefer to speak of "data-intensive scientific discovery," because I think that is exactly the field that we should be moving to, because we are going to be driven by data. However, while astronomy is similar to other fields in managing big volumes of data, the astronomy field is somehow special--not because the data are intrinsically different and special in their own right, but because they have no commercial value. They belong to every one of us, so they are an ideal test bed for trying out complex algorithms that are based on very large dimensions. Currently there are missions that have in excess of 300 dimensions, which can be very useful if a scientist wants a dataset with many dimensions to analyze. We also have to be aware that things have been changing, for example, in the geographic information system world, where our perception of our planet has been changed by such tools as TerraServer, Microsoft's Virtual Earth, Google Earth, and Google Maps. The way that we interact with our planet is now different than it was, and as we use all of these tools, we have no concept of what is really going on underneath, because we are focusing on the discoveries. Therefore, some of us are trying to do the same thing for data in other areas. What is this new size paradigm? Why is so much data a problem? In Figure 2-2, the red curve represents how much data we collect in our systems as a function of time. The big spike in the middle corresponds to the installation of new instruments for Hubble, for example. This is a trend, but it is not a problem. The real problem is the curve above it, which is the amount of data that we serve to the community. This is a large multiple of the data that we collect, so this potentially can be a problem. 20 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
FIGURE 2-2 New science paradigm for astronomy. SOURCE: Space Telescope Science Institute This was a concern, at least until a few years ago, because of how we used to work with the data. As a researcher with the Space Telescope Science Institute, I can use a few protocols to access the very different data centers, but every time I interface with a different data center, I have to learn the jargon--the language of that data center. I have to learn all the idiosyncrasies of the data in this particular data center. After I learn all this, my only option is to download the data to my local machine and spend a long time filtering and analyzing them. In astronomy this limits us to small observations of very narrowly selected samples. In 2001, scientists realized that this was not a good model, so the National Science Foundation (NSF) funded the National Virtual Observatory for Astronomy. The goal was to lower the barrier for access to all the different data centers so that a scientist did not have to worry about the complexities behind the data centers. They just had to worry about how to get data and do their science. The National Virtual Observatory was established in 2001, and for several years it was in the development phase. It moved into the operational phase with funding from NSF and NASA in 2010. By this time it was called the Virtual Astronomical Observatory. At the same time these developments were taking place in the United States, people across the planet were building their own standards, their own protocols, and their own data centers. At some point it became evident that we could not ignore what was happening around FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 21

OCR for page 19
the world, because many of the ideas that were being proposed to deal with this deluge of data were very worthwhile and could be integrated into the thought process that goes on inside of any astronomical archive. Hence, in the middle of developing the virtual observatory, a collaboration called the International Virtual Observatory Alliance (IVOA) was begun with the goal of lowering the barrier for access at all of the data centers across the world. Today the IVOA allows scientists to access all these different missions and to become more knowledgeable about what each mission does and how they are different from each other. It also has helped scientists identify where data are stored and what they are called. This is a major leap, but it has come at a cost, and the cost is that we have developed very ad hoc data standards for storing and sharing the data. After a long development process, we adopted protocols that turned out to not be very effective at certain times. Furthermore, after that many years of development, we do not have very effective tools, because doing the data mining is very difficult. We are still in the mode where most of the data have to be downloaded and filtered locally, which is not what we would like to do. Although we have standards and protocols, the IVOA has come under some pressure lately because it is not changing perhaps as fast as some would like. To some people's credit--particularly Dr. Alex Szalay of Johns Hopkins University, who is one of the founding fathers of the virtual observatory idea--it was predicted that this transition would be chaotic, and it has been chaotic. It is very hard to integrate very large data centers across the planet into a seamless process of discovery, but some people are trying to do that. The chaotic part has come to pass. Unfortunately we do not yet have a uniformly federated and distributed regime of sources across the planet that we can access in a seamless manner, so this is part of our existing problem. In the future, I think the IVOA would still be at the core of developing such a distributed system. Some people, like me, believe that the IVOA should have a much smaller and limited role in defining the requirements and standards for this distributed system that allows people to access and process on a cloud-like system. The IVOA should try to build standards that are not ad hoc, but rather are based on industry standards. Metadata is another important frontier that we must explore so that we can characterize these data and mine them appropriately. The idea is that we want to enable new science and new discoveries. Here is what I see as the challenges we are facing. We capture data very efficiently, but we need to reduce obstacles to organizing and analyzing the data. We are also not very good at summarizing, visualizing, and curating these large amounts of data. Another issue is that most of the time the algorithms that produce the papers that we see are never associated with the data. We should consider this as a problem. A scientist should be able to reproduce exactly the research result once he or she finds the data. Semantic technology is key to being able to deal with many of these obstacles. We have to be prepared, because this will introduce a fundamental change in the way astronomical research is conducted. Another issue is that we do not have available to us an intensive data- mining infrastructure, and if we are to build one ourselves, that will be very expensive. Many of these efforts in astronomy are not funded at a level that would allow us to build our own. A solution for handling very large datasets is emerging. Such solutions are not pervasive in astronomy, which is also a problem. As for hosting, if we host our large datasets on the cloud 22 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
for redundancy, for performance, and so on, that will come at a cost, and some archives might not be able to sustain that cost even if it does not seem to be very large. Finally, I think we have the opportunity to do a few things. We have to be a little humble and accept that we have to ask for help from computer science, statistics, applied mathematics, and from others who can use our high-dimensional data for their purposes, and to apply their useful algorithms. We should leverage partnerships with those who are already doing this work on the Web scale. There are companies--Microsoft, Google, Amazon, and many others--that are handling and trying to solve these problems, and we should definitely partner with them. They may not have the solution for us, but they will have a big part of some solution. FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 23

OCR for page 19

OCR for page 19
Integrative Genomic Analysis Stephen Friend -Sage Bionetworks- There are three points that I would like to focus on today. One concerns the current paradigm shift in medicine, and I am going to make a case for why we need better maps of diseases. The second--and probably most important--point is that it is not about what we do, but how we do it: It is all about behavior. Third, I want to discuss some aspects of building a precompetitive commons. When we are working with commercially valuable information--unlike the case with astronomy, there is serious money involved with information in the medical area--a number of issues arise. The cost of drug discovery, which is at the intersection between the knowledge and being able to impact people, is around $1 to 2 billion per molecule. There is about $60 billion being spent every year to make drugs, but there are fewer than 30 new drugs that are being approved. People are beginning to realize this lack of efficiency, however, and the good news is that there is information coming along that makes discovery easier. To put it in perspective, even after drugs are approved their rate of effectiveness is low. This is true for cancer drugs and also for the statins used to treat high cholesterol levels. Only 40 percent of patients who take statins have their cholesterol lowered, while more than half do not, and of those who do, less than half see any benefit in decreased occurrence of heart attacks, yet we still give these drugs and say everybody has got to be on a statin. The problem is that when trying to figure out what is going on inside cells and to understand diseases in terms of pathways, we are using maps that are akin to pre-Kepler models of biology. People get very excited about understanding pathways and believe that if they could just understand the pathway of a disease, they could make sense of it. Almost every single protein has a drug-discovery effort associated with it, and all of these researchers are dreaming that they will be able to make a drug against this or that disease without understanding the context and the circuitry that underlies it. We have gone through these transitions before. We got beyond the idea that God throws lightning bolts, and we got beyond the concept of Earth being at the center of the universe. In the next 10 to 15 years, we are going to have a change that is similarly dramatic and will be remembered for centuries. It is going to be driven by a recognition that the classical way of thinking about pathways is a narrative approach that we favor, because our minds are wired for the narrative approach. The reason this change will happen is that technologies are developing--analogous to telescopes--that will allow us to look inside of cells. In the last 10 or 15 years, we have gone from the first sequencing of a whole genome to the point where it will soon be cheaper to get your genome sequenced than it is to get a CT scan or a blood test. Think of that in terms of the amount of information associated with it. It will take a few hours to get your whole genome sequenced, and it will be under $1,000. The cost of CT scans is going up, while the cost of sequencing genomes is going down. FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 25

OCR for page 19
I know of a project in China that is planning to do the whole genome sequence of everyone in a village with a population of 1 million people. They believe they will get whole genome sequences on the village population in the next 8 years. The problem we have is that most people are still looking at altered component lists, and astronomers have already gone through this. Altered component lists do not equal knowledge of what is happening. While people are excited about the sequencing of genomes and understanding the specifics, the reality is that we are stuck in a world where biological scientists are still generating one-dimensional slices of information and not understanding that those altered one-dimensional slices of information will not add up to understanding altered function, which is ultimately what you have to do. The reason that this is important is that the circuitry in the cell has, over a billion years, been hardwired to buffer any type of changes, and the complexity that the system has in it to tolerate a modification is hardwired together within one cell, within multiple cells within organs, and so on. A project started in 2003 at Rosetta in Seattle asked how much it would cost to build a predictive model of disease that used integrated layers of information. By integrated, I do not mean raw sequence leads leading through to a genome. I am talking about integrating different layers of information. An experiment done by Eric Schadt represented a fundamental seismic shift in how we think of this science when it is asked: "Why not use every DNA variation in each of us as a perturbation experiment?" Imagine that we are each living experiments and that if we know all of those variations, we could take a top-down approach similar to the blue key that is on your computer that feeds any error through to Microsoft or other places. Imagine a way in which we could build knowledge not from looking at publications. This would be very important. I will argue that publications are a derivative, and while they are helpful, that is not where we want to be focused. We should be looking at the raw data and building those up. Imagine that we could look at putative causal genes, that we could look at disease traits, and that we could be zooming across thousands of individuals and asking what happens at intermediate layers and what happens with causal traits. The central dogma in biology, which is that DNA produces RNA, which in turn produces proteins, allows us to build causal maps that enable us to look collectively at changes in DNA, changes in RNA, and effects on behavior and then begin to ask, "What was responsible for doing what to what?" There are many mathematical models, from causal inferences to coexpression networks, but it does not matter which statistical tools we use to do top-down systems biology. The result is that we do such tasks as taking brain samples and expression, building correlation matrixes and connection matrixes, then clustering them and building network models that use a key driver of a particular effect, not because it was in the literature, but because it was experimented with in different databases. A key paper that used this approach was published in Nature Genetics in 2005. It was looking for the key drivers of diabetes. The idea behind the paper was to rank the order of the 10 genes most likely to be responsible for going from a normal state to diabetes and from diabetes to normal. Over the next 4 years, that work was validated by the community, which found that 90 percent of the genes on that rank-ordered list were correct. That is unheard of. Those trivial early maps did not have nearly as much power compared to what is coming, but they were good enough to make that type of analysis. 26 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
With such an approach, we could look at drugs that are on the market and drugs that are in development and try to identify those that are going to be responsible for certain subtypes of effects on behavior or on cardiovascular function, or we can look for toxicity. This is the type of shift that is emerging in the medical world for understanding compounds. There are already about 80 or 90 papers that have been published in different disease areas. The laboratory of Atul Butte at Stanford is one of the 10 labs that have made an art out of exploring human diseases by going into public datasets and building these types of maps. The first take-home message is that we have been living in a world where we think of narrative descriptions of diseases--this disease is caused by this one agent. At Sage Bionetworks, we now use maps that are not hardwired narratives, but that look at statistical probabilities and the real complexity. For example, imagine that some components in the cells talk to each other and they need to listen to each other to make a decision on what physiologic module is coming out of that component of the cell that then talks to other parts of the cell. What we are finding is that key proteins aggregate into key regulatory drivers that are able to say, "This is what I need to do." This is the command language. We are not looking at the genetic language, but rather at a molecular language that determines what is going on within a cell, that says, "I have enough of this," and interacts with cells in different parts of the body. The complexity of this approach is outstripping our ability to visualize it and to work with it, but the data are there. Those key ingredients are coming. Maps of diseases will be very accurate for who should get what drugs, and what drugs are able to be developed. This is going to totally revolutionize the way we think of disease. Disease today is still stuck in a symptom- based approach. If you are depressed, you have depression, and you try to find a drug as if all depression were the same. The same is true for diabetes or cancer. The level of the molecular analysis that we will soon be able to do is going to be important, but substantial progress will require significant resources and decades to complete. Those maps are almost like thin peel-off slices of what is happening, and we realized that the group doing this work could not be one institution or one company. For that reason, a year and a half ago I left the job I had running the oncology division at Merck and started a not-for-profit foundation to do a proof of concept. This foundation, which has about 30 people and a budget of about $30 million, is asking how to do the same work for genomics that has been done for astronomy. We are now trying in genomics to do the virtual equivalents to what has been done in astronomy. The people who are driving it are Lee Hartwell; Hans Wizgell from the Karolinska Institute; Wang Jun, the visionary who is sequencing the million genomes in China; and Jeff Hammerbacher, who was the chief architect of Facebook. We brought Hammerbacher in because we think that this is not strictly about science. We think we need to ask why a young investigator would want to share his or her data. Why would that investigator allow someone else to work on them? We think the social aspects of this are very important. I will offer four quick examples. In breast cancer research, we are working on ways to identify key drivers by using publicly available datasets and running coexpression and Bayesian networks. We are going to be publishing that work soon. The second concerns cancer drugs. It turns out that virtually all of the cancer drugs that are the standard of care, the first line of drugs that a patient gets, are usually not very effective. We are looking at Cisplatin doublet therapies for ovarian cancer, Avastin for breast cancer, and Sutent for renal cell carcinoma. We realized that we should identify who is not responding to those drugs so that these patients can be given something else. It has been very difficult to get the National FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 27

OCR for page 19
Institutes of Health and the Food and Drug Administration interested, so I had to go to China for funding. The third example involves getting data from pharmaceutical companies. The richest trove of data is inside these companies, because they often collect genomic and clinical information when assembling information for a drug filing; but they lock all these data up, because they do not want to share what it was that gave them their advantage. Accordingly we took a different approach. We asked to use the data from the comparative arm rather than the data for the investigational drug. This helped, and all of these companies have said they will give us these data. That brings up the point that Dr. Conti was making earlier--that when we have all these data, we must worry about the annotation and curation. That is what we are doing. We are saying, "You give us the data, and we will annotate, curate, host, and make them available to anyone." There is a very big difference between accessible and usable data. We tell the pharmaceutical companies that if they make the data accessible, we will make them usable. I am sure we are going to do it in a bad way that has to be redone, but at least we will get it started. The fourth example is a project in which we took five labs that are supercompetitive with each other and linked them together so that they could share their data, models, and tools. This type of linking opens up new opportunities. We found that we have broken the siloed principal investigator mentality of science. This is part of that fourth dimension. Science is presently practiced like a feudal system: "I have the money, those are my postdocs, and this is my data." But imagine a world in which scientists could go anywhere they wanted to get anyone's data before it is published in the same way that astronomers work. We have found that it is the scientists under 35 years old who know what to do. We have been making openly available global datasets, models, and tools. I want to describe the behavior we have run into. First, clinical genomic data are accessible, but minimally usable, because there is no incentive to annotate and curate the data. Second, people who get the data think they own the data. If someone pays them to generate the data, somehow they think they own them. We are living in a hunter-gatherer community, which is preventing data sharing, and, for the most part, the principal investigators have no interest in changing that situation. This needs to fundamentally change. Just as Eisenhower in the 1960s talked about the military-industrial complex, I think we now have a medical-industrial complex in which the goals of individual researchers to get tenure, the goals of institutions to improve their reputation, and the goals of the pharmaceutical industry to make money create a situation in which the patients are not benefiting. There is a time for intellectual property, there is a time for innovation, but the current situation is crippling our ability to share the data. The system is not set up to make it easy to reproduce results, as in astronomy. You do not hold people accountable when they publish a paper to explain what they did so that it can be reproduced. We have to figure all of that out. Finally, I want to ask why we do not use what the software industry has already figured out for tracking workflows and versioning. We have to change the biologic world so that the evolution of a software project is clear, with a code repository and the branches and releases. What we are doing now at Sage Bionetworks is determining the details on the repositories, collaboration, workflows, and cloud computing. We find that the world is saturated with early 28 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
efforts, so it would be absurd to think that we ourselves need to do this. We are interested in such efforts as Taverna, Amalga, and work at Google. We are identifying who has what so that we can stitch this together. The hardest thing is that we do not have the support of patients in this area yet. We have to bring the patients and citizens in. Otherwise the data-sharing, privacy, and access issues are going to tear this approach apart. FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 29

OCR for page 19
to make the data available through the Earth System Grid Portal, using the NetCDF-CF format. This again raises the question of how we are going to process all of the data The other type of ensemble work that I did in my previous work was to make ensemble predictions for hurricanes and land-falling tropical cyclones. In this work, a scientist can create ensemble numbers by varying the initial conditions, boundary conditions, physics of the model, sea surface temperatures, and other criteria a scientist has for a prediction system from the external forcing standpoint. This particular system was generating 150 to 200 runs, considering the different permutations and combinations, and we needed an end-to-end efficient workflow system, as well as an informatics system for providing seamless access to all of the data and information that was created. Therefore, we need to be thinking about geoinformatics systems that are capable of facilitating these things and delivering information. The hurricane research community has proposed a framework for improving hurricane predictions, and, as part of its metrics, the group is attempting to reduce both average track error and intensity error by 50 percent for a 1-to-5-day forecast. Given that the intensity prediction for tropical cyclones and hurricanes has not improved by even 10 percent in the last 20 years, this would be a gigantic leap, and they are expecting to do very high-resolution predictions as well as ensemble-mode predictions to make that possible. All of the data from this endeavor will be openly available. That was mandated by the authors of the 2008 "Proposed Framework for Addressing the National Hurricane Research and Forecast Improvement Initiatives," because the group figured this is the only way it is going to realize these results and the metrics they are aiming for. I also want to talk about data visualization as a way of enabling new insights and discoveries. I will focus on one of the software programs that Unidata has developed, the Integrated Data Viewer (IDV). I am not a geophysicist, but I know that the IDV was used by a group called Unavco, in the GEON (GEOscience Network) project, which was another large NSF Information Technology Research project under the Linked Environments for Atmospheric Discovery project (LEAD), described in the previous talk by Dr. Graves, and which got started about 8 years ago. Unavco adopted the Integrated Data Viewer as the framework for visualizing data from the solid earth community. Figure 2-24 shows the IDV integrating geophysical data. We are looking at seismic plate activity along the west coast. 58 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
FIGURE 2-24 Integrated data viewer (IDV). SOURCE: Unavco At the bottom left are stress vectors and fluids, the mantle convection assimilation, and Mount St. Helen's seismic activity underneath. The people who were doing this work told us that they could never see these aspects before this three-dimensional IDV visualization tool became available. It opened their eyes to how the Earth works underneath the surface through plate movements and mantle convection. I also want to mention geographic information systems (GIS) integration, because I work in a field that is focused on geo-specialties. We need geo-specialty-enabled geoinformatics cyberinfrastructure so that we can integrate location-based information, because many of these processes and events are very much related to the geography. We should not be thinking of GIS as an afterthought. The historical way of thinking about GIS is that atmospheric scientists would do their work and then eventually somebody else would take over from there and put the information into some kind of a GIS framework. Unidata is working to enable NetCDF data models to interface directly with GIS capabilities through the Open Geospatial Consortium standards for Web Coverage Service, Web Features Service, and Web Map Service. As a result, when we have a NetCDF data cube, that data cube becomes available as a Web GIS service, and we can produce maps like the one to the right in Figure 2- 24 from that particular capability. I cannot conclude this talk without talking about data citation, because it has already come up in almost every presentation so far. To me this is the next frontier, I want to give an FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 59

OCR for page 19
anecdotal story to demonstrate that this is not a technical challenge, but a cultural and social challenge. This particular topic came up at the American Meteorological Society. The publication commissioner, who is in charge of all scientific publications of the American Meteorological Society, took a proposal modeled after a Royal Meteorological Society proposal, which was intended to encourage data transparency. He went into the meeting and said that the condition for publication should be that authors be required to agree to make data and software techniques available promptly to readers upon request and that the assets must be available to editors and pre-reviewers, if required, at the time of review. He was basically told that his proposal is a nonstarter, because we will never get the cooperation of the authors. This shows that we still have a lot of work to do. I also want to talk about globally networked science. The IPCC activity is the gold standard for that, but I also want to mention one other activity that NCAR and Unidata have been involved with, which is the Observing System Research and Predictability Experiment (THORPEX), an interactive and global ensemble project. It is generating 500 gigabytes a day, and the archive is now up to almost a half a petabyte. It brings in data from 10 different modeling centers twice a day. We need to be able to create geoinformatics for these kinds of databases. These are some of the challenges. I want to come back to the point that was made earlier about interfacing social networking systems. As a geoscientist, I have seen GPS, coupled with mobile sensors, revolutionize the geosciences in a major way. I think there are some incredible opportunities, not just for citizens and science, but also for providing workforce development, geospatial awareness, and education for our students at all levels. This is very important, and how we use social networking tools like Facebook for getting the scientific community to provide commentaries, share information and data, and work in a collaborative way is something that is a real opportunity and definitely a challenge. In closing, I would like to emphasize that we live in an exciting era in which advances in computing and communication technologies, coupled with a new generation of geoinformatics, are accelerating scientific research, creating new knowledge, and leading to new discoveries at an unprecedented rate. 60 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
Discussion DISCUSSANT: Mr. Dudley, in your chart of supporters, I see some National Institutes of Health (NIH) institutes, but none from the National Science Foundation (NSF). However, you were working at the intersection of what NIH and NSF medicine would fund. I find that curious. When I was at NSF 10 years ago, I discussed with NIH colleagues the cultural differences, such as how NSF will fund soft money research and how NSF likes proposals to be more exploratory, while NIH likes them to be more incremental. We did not come to an agreement, though. There was a division director at NSF who liked to encourage joint work between computer science and medicine. Now that you are doing this kind of work, what organizational steps would you take to help in that area? MR. DUDLEY: Our funding strategy has always been to involve the computer scientists in whatever we are doing. One way we are trying to solve it at Stanford University is by trying to mix the computer scientists and the biologists more, letting the computer science experts figure out what the important problems are in medicine and how they can engage in those problems, especially for the algorithmic issues. We have been struggling with this problem for a long time, and we have just been pragmatic about it and have gone where we can get the money most easily. DISCUSSANT: If I am collaborating with the doctor, he writes the NIH proposals and hires my students. Who writes the NSF proposals? MR. DUDLEY: At Stanford, we now have a lot of rebranding. For example, computer science people are rebranding themselves as bioinformatics experts, because they need funding. Therefore, instead of trying to deal with the problem, they rebranded themselves and went to NIH. DISCUSSANT: As a lawyer, I sometimes think of myself as a social engineer. Most speakers have described a series of social engineering problems, such as institutional culture affecting funding and institutional culture within the disciplines, so it is a big problem. Imagine this scenario: I am at a midtier institution, and what the presentations showed me is that I am the "sharecropper." You have got both the brain power and the computing power to utilize all the data, process them, and do different kinds of work with them that I will never be able to do. Therefore, my only competitive advantage against you for funding is to hide the data. How do you respond to that? If you are going to break the logjam, what are the incentives for researchers to share their data with you? MR. DUDLEY: You are exactly right about that, and that was the point I was trying to make when I said there are computational elites. It should not be that way, however. I think that if we can get some good tools and the cloud, we can level the playing field for costs and accessibility for many researchers at midtier institutions. In fact, we wrote a grant for the NIH to build such a system, which was rejected. DR. FRIEND: I think the point you made is important. I want to separate researchers into data gatherers and data analyzers. Right now the way it works is that the person who collects the data expects to have the right to take it all the way through to the insight, and then the next person should go back and do it again. That system is wrong. It is just inefficient. I think that having core data zones where people who did not generate the data can mine the data and work with them is important enough that funding institutions and journal publishers need FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 61

OCR for page 19
to determine a way to make it happen. A second issue is the matter of humility. It is very hard to get the "Titans on Olympus" to share their data even among themselves. DR. HEY: We work with some members of the HIV committee. The moment they send us their data, we send them the results with some tools that we developed. There is no real reason we could not make those tools available in the cloud, and we are planning to do that. However, I do think the cloud can empower scientists in certain cases. We are also working with the Ameriflux community. They have about 150 sensor towers and fields all over America. Typically they are owned by a principal investigator, and they take data and publish it with their graduate and postdoctoral students, so it is generally one professor and two students. By putting the data into a central data server, like the SkyServer, the community is gradually realizing that there is much more value to be gained by publishing not just one tower's data but several. That is a sociocultural change, because instead of one professor and two students we now can have 10 professors and 20 students, which makes it a 30-author paper. For that particular community, this is a big change in culture, and it takes time. DR. GOODMAN: In astronomy, the situation is very different, and those of us who like the astroinformatics approach are lucky if we have tenure, because people still do not appreciate that kind of approach. Getting the last three photons out of a Keck telescope is what gets someone the "Olympian" status these days. That is changing slowly, but it is a big battle. MR. UHLIR: I would like to follow up with a comment and a question. The comment has to do with the issue of incentives. Many people talk about the importance of citation of data and the role of that in giving people incentives to share data, through recognition and rewards, and changing the sociology of how people receive their data and share them. The Board on Research Data and Information is about to start a 2-year project in collaboration with some other groups, in particular, the international Committee on Data for Science and Technology (CODATA), on data attributions and citations. We will start with a symposium in Berkeley, California, in August 2011. The other issue, which most of you have largely avoided, is the human infrastructure element. Obviously this symposium is about automated knowledge discoveries, so you have appropriately focused on the technical infrastructure and automated knowledge tools, but there seem to be two schools of thought about how this is going to be managed and promoted in the future. One is that the technology will disintermediate many professionals in libraries of science or in information management, because everything is going to be automated and work perfectly, and we will live in a wonderful automated information age. The other school of thought is that the deluge of data will require not only more people to manage it, but also a retraining of the information managers and reorientation of the traditional library and information schools. So my question is, which school of thought is correct? Or perhaps they both are. Also, what do the speakers feel needs to be done to develop this kind of new expertise, because a lot of it seems to be done by people mostly in the middle of their careers? Is there a new approach to education--higher education and training at the outset of careers-- that is being implemented, and if not, what would you suggest? DR. GRAVES: There are new programs starting in data science and informatics. We are initiating one at the University of Alabama in Huntsville. I know Rensselaer Polytechnic Institute (RPI) is also starting some programs, as well as several other universities, and they are attempting to be interdisciplinary. 62 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
DR. RAMAMURTHY: The creation of Unidata was intended to address the concern that you just expressed. Atmospheric sciences has been at the forefront of data sharing and of bringing data into the classroom and giving students--at all levels, whether it is an introductory atmospheric science class or master's and Ph.D. students--access to the same kinds of data and tools that are used by professionals in their operational forecasting or in research. Having that facility that has been sustained for 26 years has made the issue of access to atmospheric science information and data systems almost a nonissue from a workforce development, because literally hundreds and thousands of students go through their labs and classrooms and use state-of-the-art software and data technologies in their daily work. The key is having that kind of a sustained infrastructure that will provide the services that are needed for training the next generation of students and for creating a workforce that will be very skilled in dealing with geoinformatics of the future. DR. CONTI: There is a need for a paradigm shift in astronomy. The vision that the younger generations bring to this area is very refreshing. They suggest that we can solve this problem from a completely different point of view. At the same time, many of them bring very little knowledge of astronomy combined with an extremely high knowledge of computer science, and those have the ability to solve problems much more quickly and therefore to ask many more questions. Therefore, in our programs, we try to find a good balance between people being trained astronomers, but also having a focus on computer science as well, because the future is going to depend on how they are going to manage data and how they can ask new questions that were not asked before. The other issue is that once we have all these data, our ability to mine them is extremely important in the sense that we need to be able to understand how the data can be manipulated. Therefore, it is important to change the curriculum to reflect the fact that computer science has to be part of our daily life. We find that younger students are used to managing and producing astronomy and looking at astronomy through the Worldwide Telescope. They do not know the inner working or what it takes to produce plenty of data, but they know what they want to do. They know how to discover new pathways through the data. There is a lot of resistance among senior researchers, but perhaps we just need to learn to say, "I do not really understand how you get to this, but show me how, because we can open a new frontier." DR. FRIEND: I want to make three points. The first concerns what we have found to be necessary to bridge the gap between the data mining and the understanding of biology. It is very rare that a single individual does both, and so we split it into two parts. We have people that we call network biologists, who may be mathematical physicists, coming toward the models from one direction, and then we have systems biologists, who understand enough of the math and understand the biology. The people who can do that sort of polymath linkage are rare, and so the education and career paths are split in a way that is similar to what happened previously when someone chose to become either a molecular biologist or an enzymologist. It is important to not expect someone to bridge the entire gap. The second point is that Dr. Eric Schadt, who drove much of this work, and several others have just started a new journal. The journal will have a section that is just for publishing models that are not validated. A validation often comes later, and the point is to make the model available so that people can work on it. FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 63

OCR for page 19
The third issue relates to the discussion about structuring rewards. In medicine, we are convinced that the patients and citizens themselves must break the logjam between the scientists who do not want to provide the data and those who are trying to get access to large amounts of data. If there are electronically generated data from a patient, the patient can get them back. This is going to be a very fundamental change--shifts in how data are going to flow. The development of the law and the role of the citizen are going to be very important in getting the data flowing, because academic institutions are not necessarily going to do it on their own. DR. HEY: I have three brief comments. First, I do think that research libraries are in danger of being disintermediated. Librarians are caught in a difficult trap. Subscription costs are clearly a crisis, because the budgets are not increasing at the same rate. There is an interesting move toward iSchools, but the question remains: What should they train the next generation of librarians to do? Second, training will be important in changing the culture of science. We should work with graduate students in each science, because it is clear we are not going to convince the faculty. I think we should train people to have at least depth in a couple of disciplines. Universities change very slowly, however, so my worry is that we will produce bright graduate students with nowhere to go. I think that is a real issue, although some universities are doing things better. Third, there are three communities: the scientists doing the real research, the computer scientists who know the technologies, but we also need information technology experts, who can turn a research prototype into a tool that people can actually use. DISUSSANT: I see the success in bioinformatics and geoinformatics as disciplines, and I see communities of scientists working in multidisciplinary ways with other communities, such as statisticians and computer scientists, but we do not see that much in astronomy. One difference is that there was a tipping point reached--certainly this was true in the medical community--that so many papers are being published that it is impossible for anybody to keep up with what is happening. You therefore need some kind of a summarization or aggregation and an informatics approach to give you the high-level picture. I like the historical example at the University of California, Berkeley, where you can move from a higher-level view down to the exact specific event. Imagine doing the same thing with our databases. Maybe in astronomy we have not quite reached that tipping point. We have learned with the virtual observatory that if it is built, users will not necessarily come, and so many astronomers are not using this infrastructure. Some are using it, but not as many as we would like. Then, as I was listening to this education discussion, I thought that perhaps that tipping point could be completely different from the one I was thinking. It would be more the education system that is the tipping point, with the younger people who are familiar with social networking and computer-based tools, whether we call it Web 2.0 or Science 2.0. I would like to call it citizen science, that is, getting people in touch with the science and them being inspired to do science, knowing that they approach it with either the skill sets in using these tools or at least an interest in learning to work with them, because that is the way their lives are, and now they discover they can do science with it. 64 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
My university, George Mason University, is one of those places, like RPI and a few others, that are offering this sort of informatics space education. We have found that it is hard to draw the students into the program, especially the undergraduates. Our graduate students already understand it, so the graduate program has a good number of students, but our undergraduates still do not. We talk to students and ask what are they interested in, and they say, "We like biology, but I want to do something with computers." When we finally get these students into our classes, their eyes open up and they say, "I did not know I could do science with computers." So, in a sense, the younger generation is reaching that tipping point. They are discovering that they can do what they enjoy doing, such as social networking and online experiences, and do science at the same time. What I would like to see as a goal is not only doing this kind of work at the graduate level, but moving it to the undergraduate level, as we are doing now, and then moving it even further down to the high school students. I would like to teach data mining to school children. It would not be so much the algorithms of data mining, but it would be evidence-based reasoning and evidence-based discovery--basically detective stories--and in this way we could inspire a generation of new scientists just by using a data-oriented approach to the way we teach these topics. DR. GOODMAN: Perhaps the core or at least part of the problem is the scientific paper as a unit of publication. I will give an example. Seven first-year graduate students in our department started on their own project called Astro Bites. They go to AstroPH, which is the archive server for astronomy, select the articles that they like, and review them. This is now being read all over the world. It is searchable by any search engine, and they can discover useful materials there. This is kind of the junior version of a blog called AstroBetter, which is produced worldwide by astronomers who are of the mindset that the last discussant was talking about--a kind of astroinformatics community that is evolving in a mostly 30-something world. Those are the same people who would like to take their model or dataset and make it available for people to experiment with. Right now there is an interesting concept of the data paper in astronomy. I did not even know what this was, but fortunately when we finished a very big survey about 5 or 6 years ago, I had a postdoctoral student who said, "We have to publish a data paper now," and I said, "What is a data paper?" It is basically a vacuous paper that says, "Here are my data. Cite me." She has gotten hundreds or thousands of citations because of that data paper. This should not be this way. Going back to Mr. Uhlir's point about people and training, there is a whole new system that is emerging. I do not know exactly what it looks like, but it does not involve 12-page papers as the currency of everything, and it does not involve even looking at those papers to find the data. Such papers are just one of many resources. How we develop the resources by all these other new people, I do not know. DISCUSSANT: Part of what we are seeing is the realization that a citation count has become an outmoded way of thinking, and yet it is still the H index that gets people into the academy. Take, for example, Danah Boyd. She has 50,000 Twitter followers. She is not too worried about her citation counts--she does fine in her papers. But, again, what does it mean and why does that work? DISCUSSANT: The right solution is for those of us who are thinking about this area to engage in that conversation, to look for ways to balance the traditional measures with some FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 65

OCR for page 19
new kinds of metrics. We should look for impact metrics, as well as for sustaining papers. People should still write papers, but we need to get a better mix that looks across a much broader range. DR. GOODMAN: Where does the data scientist, who maybe does not typically get included in a publication, come in? DR. CONTI: Most of the time we are penalized for being a data scientist. Perhaps one scientist does not have as many citations as some others, and does not care to publish that many data papers; as a result, the tenure-track path is harder. DISCUSSANT: One key is that those of us who are involved in other people's tenure decisions have to rethink what we examine, and not get the junior faculty to rethink what they are publishing. DR. FRIEND: Some people who have been looking at this have noted that the music industry is fairly advanced in figuring out microattribution and that there are ways to borrow from that experience, by looking at impacts without using citations. We should think about using tools that others have developed to determine impact factor. DISCUSSANT: I work for the National Ecological Observatory Network, or NEON. It is funded by the NSF to build a network of 60 sites across the United States to collect data on the effects of climate change and land use on the ecosystem. We will be measuring something like 500-plus variables, and we will be delivering these measurements freely to the community. The measurements range from precipitation and soil pH all the way to things like productivity. We have talked about data-intensive science, of bringing a new era of how scientists can work, but what about data-intensive science in forming our policy and in forming how we design our cities, for example? From NEON's perspective, we will be one of the many data providers, along with the National Aeronautics and Space Administration (NASA), the National Oceanic and Atmospheric Administration (NOAA), and other institutions that will be providing similar data, but somebody else has to do the aggregation of those data into knowledge. Those people will need tools, so who is going to build the tools? And will these tools be used by federal agencies, resource management makers, and policy makers to craft how they will plan the nation's national resources? DR. HEY: We have a couple of projects that show there is some hope. One is a project called Healthy Waterways in Brisbane. They had a big drought, now they have too much water, but it is relevant in either case. They have a system for doing ecology reports on their water supplies, and it took several people several days to get a report written and distributed, which is released every week. By putting in some serious information technology, but nothing research grade, one person can now do it in 1 hour or so, and that person can get much more flow of information, so ecologists have a much better evidence base with which to do emergency measures, for example. The other project is an urbanization project in China where they are building 10-million-people cities. They will need to worry about all the environmental impact issues, as well as designing the city. I think there is hope, but I agree it is a challenge. I think that more case studies are needed. MR. DUDLEY: Arizona State University has a building called Decision Theater. I do not know how successful it has been, but this building has Liquid Crystal Displays around the wall, 360 degrees, and it was built specifically for policy makers to develop evidence-based 66 THE FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS

OCR for page 19
views of policy making. I heard rumors that the state legislature wants to defund the entire Arizona State University, so maybe it was not too successful, but it is something to look into. DR. RAMAMURTHY: As atmospheric scientists, geoscientists, and geoinformatics people, almost all of the data we produce has societal relevance. It is important not to stop with scientific informatics, but to try to determine a way to interface it with decision-support systems and other kinds of policy tools. In the climate change arena, everyone is focusing on mitigation and adaptation, because it is not just whether the globe is going to be two or four degrees warmer in the next 50 or 100 years, but what it means for the watershed, for agriculture, for urbanization, for coastal communities, and so forth. It is absolutely critical that the scientific information systems and geoinformatics systems can interface with the other tools and other systems. We do not build those systems in my program, but we do think about interfacing with geographic information systems, such as what needs to be extracted from climate-change models to help people talk about what the agricultural picture is going to look like 50 years from now, whether for growing wheat or barley or something else. DR. GOODMAN: I have a question for the people from the funding agencies. Dr. Hey said "some serious information technology, but nothing research grade." This is a huge problem. Here is a true story. At a meeting I had yesterday we had an undergraduate researcher who works with a postdoctoral student on a project, who said, "I do not know where we are going to put these data, because we need a couple of terabytes, and the computation facility has a lot of security systems, so they will not let us run the right software," and so on. I knew that my assistant had some extra money from the division and that she bought a stack of 40 boxes of terabyte drives, so I gave a terabyte drive to each of the researchers, and I solved the problem. The point is that there was an obstacle and that there was no person whose job it was to facilitate that and to make the data usable and streamlined for the future. So my question is, How are we going to pay those people in the future? Where are we going to get those people? DR. BERMAN: If we think about the impact of digital information--what we need to store, how long we need to store it, how we need to curate it, and how we need to respond to data management policies--the university libraries are reconceptualizing themselves and can step into that role, but they need some help. So the question is, How can we jumpstart the system, so that when universities are thinking about infrastructure costs, they consider the data bill along with the water bill, the liability insurance, and so on? FUTURE OF SCIENTIFIC KNOWLEDGE DISCOVERY IN OPEN NETWORKED ENVIRONMENTS 67

OCR for page 19