This chapter describes presentations at the workshop that focused on why data sharing is important, especially in the frame of the current state of environmental health sciences research. The relevant discussion from the workshop is summarized in the final section of this chapter.
Francesca Dominici, professor of biostatistics and senior associate dean for research at the Harvard University School of Public Health, discussed the importance of advancing science through reproducible research. She explained that a spectrum exists with data sharing: publication is at the low end of the spectrum, the codes and data underlying the published work are in the middle, and full replication of a study is at the highest part of the spectrum. “I think it is extremely hard to achieve a full replication in environmental health research for all kinds of reasons.” For example, she said, there is confusion around what is meant by data. Additionally, raw data may be confidential, or some investigators have their entire careers built on the raw data. However, reproducible research is still achievable, and much progress in this area can be made by focusing on the most important data and the most important results.
Dominici made a distinction between reproducibility and replication, noting that “replication is not always strictly necessary” and often requires tremendous investments. One aim of the reproducibility standard is to fill the gap in the scientific evidence-generating process between full replication of a study and no replication. Utilizing the tools available, investigators can reproduce the results (and verify the quality of scientific claims) when some data and analysis codes from the original
research are made available. A researcher does not need access to the raw data because statistical analyses can be performed on the aggregate data and the methods and results can be shared.
Reproducible research still requires funding and energy to maintain access to the data and software, and this burden often falls on the original investigators. Dominici highlighted the need for an established system to promote reproducible research that could be sustained with support from government agencies.
One of the major factors behind the push to share environmental health data is the widespread use of such data to develop environmental regulations. For environmental policy making to be legitimate, the scientific reasoning behind a given decision—including the data supporting it—must be transparent. Debates over policy can become quite adversarial, and that adversarial nature can extend to debates over the data and their interpretation.
There is a generalized concern over how trustworthy scientific findings are, particularly in controversial fields, as noted by Anthony Cox, chief science officer of NextHealth Technologies. “I think that this issue of whether we should share data and how we should share [them], two different issues, should be framed in the context of where we are right now,” he said, “and where we are right now is that there is an epidemic of false-positive and irreproducible results in the peer-review literature.” To illustrate, he referenced three different articles focused on perceived problems in science today and the errors that are often found in published results that caused much hand-wringing:
- “Why Most Published Research Findings Are False” (Ioannidis, 2005),
- “Trials and Errors: Why Science Is Failing Us” (Lehrer, 2011), and
- “Beware the Creeping Cracks of Bias” (Sarewitz, 2012).
There are many reasons why different investigators might reach different conclusions from the same data, Cox said, so it is important not only to share data but also to share the reasoning that led to a particular conclusion to assess if any mistakes were made. For example, he said, “I would say that if a regulatory agency says we are going to do this because
the relevant risk is 2.2, the relevant data are the complete audit trail that ends with 2.2 and goes back through the chain of calculations ultimately to raw data: What surveys were handed out and what were the results in the survey? If you cannot audit the result, you do not know that it is correct.”
In short, he said, it is only by sharing data—and “data” in the broad sense of everything that went into a conclusion—that the ideal of scientific knowledge being testable can be achieved. It goes back to the basic idea of science that people learn in high school or perhaps kindergarten, he said that its results flow from “publicly available, reproducible, everybody-can-stand-around-and-look-at-it data.”
He said that in his field—risk analysis—there is a preference for seeing raw data. When members of professional societies who are engaged in risk analysis were asked, 69 percent said that it was very important to have access to the underlying raw data so that they can form their own conclusions. However, he added, only 36 percent reported that they typically had such access to raw data. Thus, “there is a perception of a big gap between what you would want to do a responsible job of coming to conclusions and what we currently have.”
Whatever the merits of reanalysis, there are other ways that sharing data does improve and advance the practice of science, Cox said. For example, he pointed out that if researchers know that other scientists may reanalyze their data, those original researchers are likely to do better science because of the possibility that someone will be looking over their shoulders. “Open access to key data and to models will encourage greater scrutiny,” he said. “Other people will say, ‘Let me see if I can come up with the same answer or ... suppose I make some other assumptions, would I get a very different answer? Let’s go find out.’”
More careful scrutiny will encourage more careful research, Cox said, which in turn should result in a lower prevalence of false positives. The problem of false positives is a very serious one, he said, so more careful research would be a valuable benefit. Furthermore, more careful research will be accompanied by more careful interpretation of the data. If scientists refrain from overinterpreting the data—for example, by drawing causal conclusions from associations that are not necessarily causal—it will increase the trustworthiness of published results. “We
need greater trustworthiness in published results and in actions taken based on them. That is one class of benefits.”
A related benefit, Cox said, is that the sharing of data lowers the “barriers to entry” to reanalysis or alternative analyses, which in turn encourages better and more frequent follow-ups. “We increase the value from the investments that have already been made in research,” he said. “It is very costly to assemble a good database, so let’s have a lot of people exploit that database to get as much juice out of it as possible.”
Cox went on to suggest that researchers—particularly junior researchers—need to think of sharing data with others as part of science as usual. “Suppose a graduate student says to me, ‘But, Tony, you don’t understand. I plan to base my future career on mining [these] data that I have so painstakingly collected. I am certainly not going to share [them] with other people. There goes my career.’ What I would say is, ‘You have not paid the price of entry. The price of entry to being a real scientist is you do things in the open light. You do not monopolize them.’” He noted that researchers should expect that their results will be scrutinized and their data will be scrutinized. “That should be the price of entry to doing good science.” He also suggested that concerns about advocates doing sloppy or skewed analysis to distort public policy are no reason to not share data. “I think that is part of the price of democracy. We should have people arguing about whether analysis is good. That is a perfect place to focus.” He closed by suggesting, “Sharing data upon request should be the rule, not the exception. That should be the general expectation of good science.”
The benefit of having access to tens of thousands of data sets at one time was highlighted by George Daston, a Victor Mills Society Research Fellow at Procter & Gamble, in Session 4 of the workshop. Such “megadata” make it possible to do a variety of analyses that would not be feasible with the amounts of data available from one or a few data sets.
Daston introduced the topic by discussing the recent advances in information storage and analysis. “As a life scientist,” he said, “I believe that we are entering the third great age of life science research—the first age being one of description and classification that started with Aristotle and went through Linnaeus and the second being reductionist, mechanistic biology [that led to] understanding at a deep molecular level of how things work and which started with Darwin and ended in about
2000.” In this third age, biologists are beginning to put all of that reductionist information together in an effort to understand emergent properties. “This is the systems biology era,” he said. “I think a lot of people are under the impression that the systems biology era has been driven by biotechnology—and it has been—but it has even more so been driven by the revolution in computational power.”
In short, Daston said, the rapidly increasing power of computers has the potential to revolutionize environmental health and toxicology. In particular, he listed several old questions that increasing computational power may make it possible to answer:
- What is the relationship between chemical structure and toxicity?
- What does the universe of toxicity modes of action look like?
- How can we predict adverse outcome from initial molecular events?
But answering these questions will require huge amounts of data. “One of the things that we have done over the past 10 to 12 years,” Daston said, “is to try to see what we can do on a large scale to make sense of all of the individual toxicology studies that have ever been conducted. What we have done is put together a relational database that can be searched by chemical structure or chemical substructure so that we can do things like propose an understanding of the toxicity of a new chemical based on analogy to chemicals that have already been tested and try to get a handle on the universe of modes of action or the universe of structural features of chemicals that convey some sort of hazard.” Up to now, he said, these questions have not been answerable, and “because the boundaries of this are dark and fuzzy, we have always just thought there is no alternative other than to continue to test, and test chemical by chemical, and live with the uncertainties that they have.”
Daston then displayed a slide that showed the numbers of traditional toxicology studies that he and his colleagues have assembled in their database, listed by the particular toxicity endpoint (see Box 3-1). For example, the database contains 69,000 studies of acute toxicity, 14,756 studies of carcinogenicity, and 11,923 studies of developmental and reproductive toxicity.
“You can see that there is actually a wealth of information out there,” he said. “This is both from the published literature and what we can get out of the gray literature,” with studies published in the gray literature (reports of research that has been performed by associations or companies but that
has not gone through peer review) coming from various government agencies, such as studies that used data from a database of the U.S. Environmental Protection Agency (EPA) or the REACH database operated by the European Chemicals Agency. In total, the database contains nearly 160,000 studies.
Various things can be done with this type of information, Daston said. For instance, it is possible to put together a decision tree to help people decide whether a new chemical with a particular chemical structure is likely to have a certain type of toxicity. Daston showed an example of such a decision tree that he and his colleagues put together using data on about 800 developmental toxicants. The decision tree was largely based on chemical structural features, he said, but it could also be based on the putative modes of action for toxicity for reproductive or developmental toxicants.
“The simplicity of this is that for a toxicity as complicated as this one, we can break this down into essentially 25 groups of chemical features,” he said. “What you can see is that with these computational approaches, we can start to get our arms around things that were not feasible before.”
One thing that can be done is to determine whether, for a given chemical, there is any evidence in the literature that a chemical with a similar structure and chemistry is a developmental or reproductive toxicant. This is a valuable piece of information. “But more importantly,” Daston said, “we can start to put together ontologies for modes of action of
toxicity, which organizes the field in a way that hasn’t been done before.” This in turn will allow regulatory agencies to start to understand which of the tens of thousands of chemicals about which there is relatively little information should be prioritized for additional study and which ones are probably not of great concern.
But this sort of approach demands a great deal of data to work, as do various other promising approaches, such as toxicogenomic analyses of modes of action. “These are really data-hungry exercises,” Daston said. “For this kind of information, what we really need are data summaries,” he said. “A lot of this is generated with a common protocol. The data are analyzed in essentially the same sorts of statistical ways. Summary statistics are probably the most important.” It would be essentially impossible for him to do his work if he had to start with the raw data, he said. “Having gone through the process of compiling the database, [I can tell you] that this takes man-years worth of effort. Doing this ontology and doing this decision tree that you see here also take man-years worth of effort. That effort is only compounded if we would have had to start from raw data rather than some sort of summarized data.”
Thus, he concluded, “For these purposes—and I think as a general principle—what you want to do is start with the most refined and processed data that you can use, that you can live with, because it saves you time. Think about what the question is.”
Of course, he continued, there are indeed times when access to the raw data is important. “For other purposes, like genomic analyses, where we are still feeling our way through what the protocol should be for those things and how to analyze data, how to normalize data, in those sorts of situations, I think that raw data [are] probably also valuable to share, but probably not in exclusion to the refined data.” And, indeed, he added, it has become the norm in his field to require the raw data to be submitted to a public repository before a paper is considered for publication. This is the case for most journals in his field, he said.
“I guess my point is that sharing of data—at least from where I sit—is more valuable than retaining the data,” Daston concluded. “I realize that others have a different calculus from where they sit.”
Several points related to the importance of data sharing were raised during the workshop. Workshop speakers and participants provided individual remarks that are summarized in this section.
Benefits of Data Sharing
During the discussion after Session 2, Dominici brought up some additional benefits of data sharing. “One of the main benefits [for] reproducible research is something that we saw happening by making the National Mortality, Morbidity, and Air Pollution Study fully reproducible,” she said. Making the data behind the study accessible to other researchers elevated the level of discussion and the types of criticisms that were made about the study. Many of the smaller issues that other researchers had with the study were addressed just by making the data available on a website for those researchers to download and study, and once they had those data, those other researchers could interact with the scientists behind the study on a much more substantive basis. “That immediately elevated the scientific rigor of the discussions that we were having with other investigators,” Dominici said. “I think that that is one of the main benefits because then we are talking about important issues about whether or not these results are reproducible and valid. Clearly, we are not analyzing a two-by-two table. We are analyzing a tremendous amount of data.”
Another advantage of sharing data, Dominici said, is that it keeps researchers from having to “reinvent the wheel”—that is, from having to repeat the work that previous investigators have already done. “It seems silly to me that another investigator is going to start from what I started at 6 years ago or 10 years ago when he or she can start from where I left off,” she said. With all of the technology in place to share data among researchers, she said, such sharing should really increase the speed of scientific discovery.
In her opening and stage-setting remarks, Lynn Goldman, dean of the Milken Institute School of Public Health at George Washington University, offered another scientific benefit of sharing of data: it can be crucial in carrying out systematic reviews in a particular field. Often, she said, in a systematic review it is necessary to reanalyze at least some of the data in the studies being reviewed, and that is possible only if the original researchers make their data available to other scientists.
Gwen Collman, director of the Division of Extramural Research and Training at the National Institute of Environmental Health Sciences, described another class of benefits. “One of the benefits of sharing data broadly is that you can bring more intellectual power and people from different disciplines and different perspectives into analyzing data,” she said. “We are not only looking at data to redo or rethink or refute or
confirm somebody’s analysis, but with so [many] data being generated so quickly and in the hands of so few, there really are a lot of people who could be brought into the whole scientific enterprise. New discoveries and new directions can come from looking at data in fresh and new ways. I do not want us to lose the thought of that as we continue our discussions for the rest of the day because I think it is a very important part of data sharing.”
When Is It Valuable to Have Access to Raw Data?
During the discussion after Session 3 of the workshop, Goldman described a situation in which it is valuable to have access to the raw data. At one point in her career, she said, she was an assistant administrator at EPA in charge of the office that regulates chemicals and pesticides. The agency receives the raw data underlying studies of toxicity and reviews those data, and she found that there were always interesting things to be learned from the raw data. “What it has to do with sometimes are observations that are recorded in the margins about things that are going on with the animals that then have led to new discoveries about some toxicities,” she said.
In particular, she spoke of an experience with the fungicide vinclozolin. The Organisation for Economic Co-operation and Development (OECD) testing protocol had not been looking for antiandrogen effects from the fungicide, but the observations that were recorded indicated that it had the potential for such effects, which led to laboratory investigations that were quite different from the original OECD studies. “And none of that would have been detected if EPA scientists had not been looking at the raw data,” she said, “because, of course, the contract lab for the company was not looking for things like that. And if EPA had only had the data summaries, none of that would have been discovered.” In short, it can sometimes be quite valuable for the scientific community to have access to the raw data.
Documenting Data Sharing
Dominici noted during the discussion after Session 2 of the workshop that asking or requiring investigators to make their data—and often their software—available to other researchers puts a tremendous burden on the investigators. Therefore, she said, some sort of incentive should be provided to those investigators, particularly junior faculty members.
For example, she said, “if you publish a paper that is fully reproducible, that could be an additional comment [in the journal of publication], and it could be something that really improves and plays an important part in your career. Deans, chairs of the department, journal editors, government agencies—if they are all united, they could definitely make this possible with the right incentives, which right now are not in place.”
Bernard Lo, president and chief executive officer of The Greenwall Foundation, had some similar thoughts. He suggested that a researcher’s standing as a scientist and, specifically, as a faculty member should depend not just on gathering data in innovative ways and developing new ways of analyzing them but also on sharing these data in a way that generates even more knowledge. It would be necessary to figure out a way to track that, he acknowledged. “It is not just how many people use your data, but how many really interesting additional studies came out of them.” There should be metrics not just for sharing data but for sharing data in really positive, creative ways that lead to new science, he said. He explained that perhaps this kind of productive data sharing should be an expectation in faculty promotion reviews.
Daston, G. 2014. Data uses and data requirements. Presentation at the workshop Principles and Obstacles for Sharing Data from Environmental Health Research, Washington, DC.
Ioannidis, J. P. A. 2005. Why most published research findings are false. PLoS Medicine 2(8):e124.
Lehrer, J. 2011. Trials and errors: Why science is failing us. Wired. Available at http://www.wired.com/2011/12/ff_causation/all/1 (accessed October 27, 2015).
Sarewitz, D. 2012. Beware the creeping cracks of bias. Nature 485(7397):149.