Questions about the reproducibility of scientific research have been raised in numerous settings and have gained visibility through several high-profile journal and popular press articles. Quantitative issues contributing to reproducibility challenges have been considered (including improper data management and analysis, inadequate statistical expertise, and incomplete data, among others), but there is no clear consensus on how best to approach or to minimize these problems.
This is an issue across all scientific domains. A recent study found that 65 percent of medical studies were inconsistent when retested, and only 6 percent were completely reproducible (Prinz et al., 2011). The following year, a survey published in Nature found that 47 out of 53 medical research papers on the subject of cancer were irreproducible (Begley and Ellis, 2012). The Begley and Ellis Nature study was itself reproduced in the journal PLOS ONE, which confirmed that a majority of cancer researchers surveyed had been unable to reproduce a result.
A lack of reproducibility of scientific results has created some distrust in scientific findings among the general public, scientists, funding agencies, and industries. For example, the pharmaceutical and biotechnology industries depend on the validity of published findings from academic investigators prior to initiating programs to develop new diagnostic and therapeutic agents that benefit cancer patients. But that validity has come into question recently as investigators from companies have noted poor reproducibility of published results from academic laboratories, which limits the ability to transfer findings from the laboratory to the clinic (Mobley et al., 2013).
While studies fail for a variety of reasons, many factors contribute to the lack of perfect reproducibility, including insufficient training in experimental design, misaligned incentives for publication and the implications for university tenure, intentional manipulation, poor data management and analysis, and inadequate instances of statistical inference. The workshop summarized in this report was designed not to address the social and experimental challenges but instead to focus on the latter issues of improper data management and analysis, inadequate statistical expertise, incomplete data, and difficulties applying sound statistical inference to the available data.
As part of its core support of the Committee on Applied and Theoretical Statistics (CATS), the National Science Foundation (NSF) Division of Mathematical Sciences requested that CATS hold a workshop on a topic of particular importance to the mathematical and statistical community. CATS selected the topic of statistical challenges in assessing and fostering the reproducibility of scientific results.
On February 26-27, 2015, the National Academies of Sciences, Engineering, and Medicine convened a workshop of experts from diverse communities to examine this topic. Many efforts have emerged over recent years to draw attention to and improve reproducibility of scientific work. This workshop uniquely focused on the statistical perspective of three issues: the extent of reproducibility, the causes of reproducibility failures, and the potential remedies for these failures. CATS established a planning committee (see p. v) to identify specific workshop topics, invite speakers, and plan the agenda. A complete statement of task is shown in Box 1.1.
The workshop, sponsored by NSF, was held at the National Academy of Sciences building in Washington, D.C. Approximately 75 people, including speakers, members of the planning committee and CATS, invited guests, and members of the public, participated in the 2-day workshop. The workshop was also webcast live to nearly 300 online participants.
This report has been prepared by the workshop rapporteur as a factual summary of what occurred at the workshop. The planning committee’s role was limited to organizing and convening the workshop. The views contained in the report are those of individual workshop participants and do not necessarily represent the views of all workshop participants, the planning committee, or the National Academies of Sciences, Engineering, and Medicine.
In addition to the summary provided here, materials related to the workshop can be found online at the website of the Board on Mathematical Sciences and Their Applications (http://www.nas.edu/bmsa), including the agenda, speaker presentations, archived webcasts of the presentations and discussions, and other background materials.
Over the course of the workshop, speakers discussed possible reasons as to why studies may lack reproducibility. The following topics were discussed repeatedly throughout the workshop: clarifying definitions of reproducibility and associated terms, improving scientific discovery, increasing the accepted threshold for statistical significance, enhancing and clarifying protocols, uniting the broad scientific community in reproducibility efforts, changing research incentives, increasing
sharing of research material, and enhancing education and training. The discussions around each of these areas are summarized in this section.
Throughout the workshop, presenters (Yoav Benjamini, Ronald Boisvert, Steven Goodman, Xiaoming Huo, Randy LeVeque, Giovanni Parmigiani, Victoria Stodden, and Justin Wolfers) and participants referenced the confusion in the terminology associated with reproducibility. Below are some of the terms and definitions that were offered:
- Reproducibility. “The ability of a researcher to duplicate the results of a prior study using the same materials . . . as were used by the original investigator. . . . A second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis . . . [in an attempt to] yield the same results. . . . If the same results were not obtained, the discrepancy could be due to differences in processing of the data, differences in the application of statistical tools, differences in the operations performed by the statistical tools, accidental errors by an investigator, and other factors. . . . Reproducibility is a minimum necessary condition for a finding to be believable and informative.” (NSF, 2015 [as identified by Steven Goodman])
- Repeatability (also referred to as empirical reproducibility). The ability to see the data, run the code, and follow the specified steps, protocols, and designs as described in a publication. (Steven Goodman and Victoria Stodden)
- Replicability. “The ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected. . . . A failure to replicate a scientific finding is commonly thought to occur when one study documents [statistically significant] relations between two or more variables and a subsequent attempt to implement the same operations fails to yield the same [statistically significant] relations.” (NSF, 2015 [as identified by Steven Goodman])
- Robustness. The resistance of the quantitative findings or qualitative conclusions to (minor or moderate) changes in the experimental or analytic procedures and assumptions. (Steven Goodman)
- Statistical reproducibility. The notion of how statistics and statistical methods contribute to the likelihood that a scientific result is reproducible and to the study and measurement of reproducibility. (Victoria Stodden)
- Computational reproducibility. Any issue arising from having a computer involved somewhere in the work process, from researchers who do bench work and analyze their data with a spreadsheet, to researchers doing work
on large computing systems with an enormous amount of code and software. (Victoria Stodden)
Improving Scientific Discovery
Several speakers discussed the importance of enhancing reproducibility to improve scientific discovery (Micah Altman, Steven Goodman, Randy LeVeque, Giovanni Parmigiani, and Marc Suchard). In order to improve discovery, evidence must be generated to help the scientific community reach consensus about a question of interest. While it is rare that a single study will provide sufficient evidence to yield a consensus, a key step in this process is replication—generating evidence under different experimental settings and across different populations. The accumulation and weighing of such evidence informs the process by which the scientific community reaches consensus about the question of interest. A central component of this process includes the systematic elimination of alternative explanations for observed associations and the explicit acknowledgement that as new evidence arises, the consensus in the scientific community might change.
Presenter Steven Goodman discussed two additional advantages of strengthening replication: (1) increased understanding of the robustness of results, including their resistance to (minor or moderate) changes in the experimental or analytic procedures and assumptions; and (2) increased understanding of the generalizability (i.e., transportability) of the results, including the truth of the findings outside the experimental frame or in a not-yet-tested situation. Goodman added that the border between robustness and generalizability is indistinct because all scientific findings must have some degree of generalizability.
Increasing the Threshold for Scientific Significance
Several speakers (Dennis Boos, Andreas Buja, Steven Goodman, Valen Johnson, and Victoria Stodden) and participants discussed the inadequacy of the current p-value standard of 0.05 to demonstrate scientific significance. Some alternative proposals included reducing the standard p-value (Buja) by at least an order of magnitude (Johnson), switching to a p-value range (Boos), or switching to a Bayes factor equivalent (Johnson). There was some opposition to changing the standard, specifically due to the possibility that additional resources would be needed to meet the requirement for larger sample sizes.
Enhancing and Clarifying Protocols
Multiple speakers (Micah Altman, Andreas Buja, Marcia McNutt, Joelle Lomax, Mark Suchard, and Victoria Stodden) discussed the importance of protocols through-
out the workshop, including experimental methodology, data analytical actions (e.g., model selection and tuning), and coding decisions (e.g., instrumentation design and software).
Uniting the Community in Reproducibility Efforts
The need for a unified, multifaceted approach for dealing with reproducibility was emphasized by multiple speakers (Chaitan Baru, Philip Bourne, Steven Goodman, Mark Liberman, Marcia McNutt, Victoria Stodden, Irene Qualters, and Lawrence Tabak). They argued that this effort must include all stakeholders, including funding agencies, journals, universities, industries, and researchers.
Changing Research Incentives
Community incentives for reproducibility are misaligned, according to several speakers (Micah Altman, Lida Anestidou, Andreas Buja, Tim Errington, Irene Qualters, Victoria Stodden, Lawrence Tabak, and Justin Wolfers) and participants. Some of the concerns regard the conflicting messages given to researchers about whether reproducibility research is valued within the community. Researchers are often told that the replications are essential to the health of their scientific community, but most journals do not publish replication papers, most funding agencies do not financially support such work, and many researchers who conduct replication studies can face unpleasant and time-consuming resistance from the community.
Increasing Sharing of Research Material
The increased availability of supplementary research materials, such as data, code, and software, and expanded research methodology descriptions, was highlighted by several speakers (Micah Altman, Ronald Boisvert, Philip Bourne, Tim Errington, Steven Goodman, Randy LeVeque, John Ioannidis, Mark Liberman, Gianluca Setti, Courtney Soderberg, Victoria Stodden, and Justin Wolfers) and participants as being of significant value to enhancing reproducibility. Some journals and funding agencies require data sharing for research publication and funding, but many do not. Furthermore, often the material that is provided in response to these requirements is incomplete, incorrect, or otherwise unreadable.
Enhancing Education and Training
Many speakers (Micah Altman, Chaitan Baru, Yoav Benjamini, Philip Bourne, Xiaoming Huo, and Rafael Irizarry) called for enhanced data science training for people at all levels, including undergraduate and graduate students, beginning
and established researchers, and senior policy leaders. While some of these training courses currently exist and others are being funded by agencies such as the National Institutes of Health, they currently do not sufficiently cover the education landscape; more work needs to be done to identify and fill gaps (Bourne).
Subsequent chapters of this report summarize the workshop presentations and discussion in sequential order. Chapter 2 provides an overview of the importance of reproducibility and discusses two relevant case studies. Chapter 3 focuses on conceptualizing, measuring, and studying reproducibility. Chapter 4 discusses the way forward by using statistics to achieve reproducibility. Finally, Appendix A lists the registered workshop participants, Appendix B shows the workshop agenda, and Appendix C defines acronyms used throughout this report.