The third session of the workshop consisted of three panels discussing how to move forward using statistics to improve reproducibility. The first panel on open problems, needs, and opportunities for methodologic research was moderated by Giovanni Parmigiani (Dana-Farber Cancer Institute and workshop planning committee co-chair) and included Lida Anestidou (National Academies of Sciences, Engineering, and Medicine), Tim Errington (Center for Open Science), Xiaoming Huo (National Science Foundation), and Roger Peng (Johns Hopkins Bloomberg School of Public Health). The second panel on reporting scientific results and sharing scientific study data was moderated by Victoria Stodden (University of Illinois, Urbana-Champaign) and included the following panelists: Keith Baggerly (MD Anderson Cancer Center), Ronald Boisvert (Association for Computing Machinery and National Institute of Standards and Technology), Randy LeVeque (Society for Industrial and Applied Mathematics and University of Washington), and Marcia McNutt (Science magazine). The final panel discussion on research as the way forward from the data sciences perspective was moderated by Constantine Gatsonis (Brown University, planning committee co-chair, and chair of the Committee on Applied and Theoretical Statistics) and included Chaitan Baru (National Science Foundation), Philip Bourne (National Institutes of Health), Rafael Irizarry (Harvard University), and Jeff Leek (Johns Hopkins University).
In addition to the references cited in this chapter, the planning committee would like to highlight the following background references: Bossuyt et al. (2003); Couzin-Frankel (2015); Donoho and Huo (2004); Heller et al. (2014); Karr (2014); Laine et al. (2007); Leek and Peng (2015a); LeVeque et al. (2012); Motulsky (2014); Nosek
Giovanni Parmigiani began the first panel discussion by noting that there should be a more integrated approach to several issues, beginning with terminology. While he commented that many believe the confusion and reversal of terminology across fields is too established to correct at this point, Goodman stated that the underlying conceptual construct behind the terms is shared. Parmigiani sees room to build upon this commonality and identify some construct everyone should refer to in trying to devise reproducibility solutions. He observed that Yoav Benjamini’s definitions of and distinction between single-study and multistudy problems reemerges in various attempts at defining terminologies. The distinction between meta-analysis and problems of reproducibility is related to the issue of how to accumulate evidence as it accrues versus how to quantify the extent to which it disagrees. Parmigiani speculated that building on this type of concept is a step that could help the terminology to converge and thus be more useful and conducive to scientific discourse.
He also noted several recurring themes of statistical issues that emerge across fields. He, as well as many other researchers, believes that selection bias is one of the most important of these issues. One aspect of selection bias is hunting for models that provide the desired answer. He suggested a systematic exploration of robustness both in models and experiments as an approach to make headway in this area across fields. He also noted that the ongoing frequentist versus Bayesian debate seems to be dissolving, which may imply that the community is reaching a compromise that could be useful for further progress. A place to start may be an agreement on how to report results (either Bayesian or frequentist) and how to better assess the meaning and significance of a study’s results.
Some steps can be taken immediately to identify areas of future work that could benefit multiple fields, according to Parmigiani. He said the statistics community should pay more attention to the issue of reproducibility of prediction across studies, contexts, and data sources. This would allow the scientific community to shift from an abstract definition of truth to a paradigm that can be measured more practically and objectively and tied more directly to the decision- and policy-making consequences of studies. For example, pharmacoeconomics, as discussed by Marc Suchard, highlights an arena where it would be possible to have competitions in which research groups, working with given data sets, would be challenged to identify significant associations and interactions, predicting the number of people who have adverse effects over a certain period of time if they take a certain drug.
This exemplifies a prediction question within the context of which reproducibility could be quantified and monitored over time.
Parmigiani noted that this is only one aspect of going beyond statistics’ somewhat overused approach of defining tools framed in terms of hypothesis testing. However, he admitted that the community has a lot of work to do to develop tools that are as effective for more complex problems. He also commented that exporting this understanding of reproducibility across sources and subsets of data to the world of Big Data, where data sets are so large that the p-values become meaningless, can be fruitful.
Lida Anestidou, National Academies of Sciences, Engineering, and Medicine
In June 2014, the National Academies of Sciences, Engineering, and Medicine’s Institute for Laboratory Animal Research (ILAR), which offers guidance on the use of laboratory animals both in the United States and around the world, convened a workshop on science and welfare in laboratory animal use (NASEM, 2015). Anestidou, who directs ILAR and coordinated that workshop, explained that the purpose of ILAR is to bring the diverse voices in laboratory animal fields together, including members of the public, researchers, veterinarians, laboratory animal facility staff, and committees who have oversight about animal use protocols.
She noted that the use of statistics within the animal research community is diverse and each community member’s unique understanding of statistics plays a role in the way reproducibility and other methodologic issues are understood and can be improved. The 2014 workshop discussed fundamental aspects of experimental design of research using animals and animal models, with the goal of improving reproducibility. According to Anestidou, four key themes arose at that workshop:
- Transformation of the research enterprise, specifically systemic issues, scientific training and culture, public perceptions, and incentives for research integrity;
- Interactive assessment of published research;
- Improvement in the reliability of published results; and
- Enhanced understanding of animals and animal models, specifically from clinical research, and proactive planning in preclinical research. This includes reproducibility and the “3Rs” (reduce the number of animals used, refine the methodology, and replace animal models with in vitro and in silico approaches), as well as animal welfare considerations.
An irreproducible study violates the community’s notion of ethics and animal welfare, Anestidou explained, because animals are affected and time and money are
wasted. More animals may be needed to repeat a study and animals may not be used appropriately. These are important issues to the laboratory animal community.
She summarized some key points identified by speakers during the ILAR workshop:
- A lack of reproducibility is generally not intentional fraud, and many issues can be linked to flawed experimental design, including statistics and experimental planning.
- C. Glenn Begley defined the following criteria to evaluate journal papers:
- Is the study blinded? Is a complete set of results shown?
- Can experiments be replicated?
- Are positive and negative controls shown? Are statistical tests used (in) appropriately?
- Are reagents validated?
- Animal models are not poor predictors, and the use of such models does not a priori contribute to reproducibility problems. Rather, speakers at the 2014 workshop identified issues such as small sample sizes, genetic variation among species, and inbred versus outbred strains as leading to reproducibility issues.
Anestidou suggested the following steps to address these issues within the animal models community:
- Educate Institutional Animal Care and Use Committees (IACUCs) so they are able to (re)train investigators on the basics of proper experimental design for animal protocols;
- Design and track metrics of reproducibility involving animal experiments
- To compare outcomes and trends (i.e., evidence) in association with specific interventions (what about the systemic issues?), and
- To identify those interventions that appear to be more effective and understand how they may be applied and taught broadly; and
- Energize and interest the broader U.S. research community and involve the laboratory animal veterinary community in the reproducibility conversation.
Tim Errington, Center for Open Science
Tim Errington began by describing some reproducibility issues that arise from researchers’ degrees of freedom and explaining how they can essentially short-circuit the scientific process, including a lack of replication (Makel et al., 2012). He also described studies that were designed with low statistical power (Cohen, 1962; Sedlmeier and Gigerenzer, 1989; Bezeau and Graves, 2001); p-hacking (John
Errington said the research community could take several steps to address these problems. The first is changing the publication process so that review occurs prior to the data collection; this would shift the incentive structure by emphasizing the importance of a study’s questions and the quality of its research plans, while lessening pressure to find highly significant or surprising results. The second is study preregistration, which would distinguish between exploratory and confirmatory analysis by requiring information about what data are going to be collected, how the data will be collected, and how the analysis is going to be done. These steps would lead to increased accuracy of reporting, expanded publication of negative results, improved replication research, and enhanced peer review that would focus on the methods and approaches instead of the final result. A handful of journals have already adopted this approach, Errington noted, and the Many Labs and Science Exchange project offers examples of what can be done.1
However, Errington explained that adjusting incentives in this way is not enough. He said more tools and technology are needed to couple with the underlying data and methods. Better training is also important, specifically on methodologies that strengthen reproducible statistical analysis and reproducible practices in general, as is increased transparency. In conclusion, he summarized that to improve the scientific ecosystem, technology should enable change, training should enact change, and incentives should embrace change.
Xiaoming Huo, National Science Foundation
Xiaoming Huo began by speaking about the WaveLab project at Stanford University, which began more than 20 years ago and aimed to develop a toolbox to reproduce most of the algorithms available at that time for working with wavelets (Buckheit and Donoho, 1995). He noted that this project showed how important reproducibility is while also providing meaningful workforce development, especially with regard to graduate student training. He emphasized that a focus on reproducibility is a good way to drive stronger methodologic research. For example, he suggested that publications that partially explain how to reproduce software or previously published analysis methods significantly lower the barrier for others to use those methods. However, conducting research into the reproducibility of published work is often viewed as time intensive and outmoded. Because of this, reproducibility work is often not rewarded and may be harmful to those developing academic careers.
Huo stressed that reproducibility is not only about confirming the work that has been done by someone else; it can also contribute to readability and comprehensibility, especially helping to improve the accessibility of software and methods. Ultimately, Huo explained that the goal of disseminating knowledge is more likely achieved with the use of common terminology.
Huo discussed some National Science Foundation (NSF) programs that help to improve reproducibility. The Advanced Cyberinfrastructure (ACI) Division2 supports and coordinates the development, acquisition, and provision of state-of-the-art cyberinfrastructure resources, tools, and services essential to the advancement and transformation of science and engineering. In pursuit of this mission, ACI supports a wide range of cyberinfrastructure technologies. In these efforts, ACI collaborates with all NSF Offices and Directorates to develop models, prototypes, and common approaches to cyberinfrastructure.
The Computational and Data-Enabled Science and Engineering program3 aims to identify and capitalize on opportunities for new computational and data analysis approaches that could enable major scientific and engineering breakthroughs. Research funded under this program relies on the development, adaption, and utilization of one or more of the capabilities offered by advancing research or infrastructure in computation and data, either through cross-cutting or disciplinary programs. Huo noted that the effort’s focus on computation and data has a strong connection with reproducibility.
The NSF solicitation for Critical Techniques and Technologies for Advancing Foundations and Applications of Big Data Science and Engineering (BIGDATA) was released in February 2015.4 According to Huo, this program seeks to fund novel approaches in computer science, statistics, computational science, and mathematics, along with innovative applications in the scientific domain of science, which will enrich the future development of the interdisciplinary field of data science. In conclusion, Huo noted that NSF program officers are always open to hearing new ideas, and he encouraged researchers to reach out directly to discuss potential proposals.
Roger Peng, Johns Hopkins Bloomberg School of Public Health
Roger Peng began by discussing a National Research Council workshop on the Future of Statistical Software (NRC, 1991). At that workshop, Daryl Pregibon set
3 The NSF’s Computational and Data-Enabled Science and Engineering program website is https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504813, accessed January 12, 2016.
4 The NSF’s BIGDATA website is http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504767, accessed January 12, 2016.
the stage with the observation that data analysis is a combination of many things that are put together, but that the process as a whole is poorly understood.
Peng noted that he has been writing about reproducibility for about 10 years, and that over this time there has been a tremendous amount of progress in this area, both in the cultures of various communities and in the tools available. For example, he observed that many journals now require data and code(s) when papers are published, and there are entire fields where it has become standard to make code and data available. Tools such as IPython Notebook5 and Galaxy6 have been developed to facilitate reproducibility work.
While Peng is encouraged by the developments over recent years, they often do not address the primary hope of researchers that data analysis be more trustworthy and be executed properly. Making data and code available is a good step, according to Peng, because other researchers are then able to correct a broken analysis, but that degree of reproducibility does not prevent incorrect analysis. He said this is analogous to telling someone not to worry if he develops asthma because there are great drugs to control it. What if the asthma could be prevented in the first place?
There are many mistakes in the literature (such as poor experimental design) for which researchers know the solutions, according to Peng. He emphasized the need to better disseminate knowledge that is already available. Many of the reproducibility issues that have arisen over the last couple of years would be best addressed using preventative measures (Leek and Peng, 2015b). In terms of the opportunities for statisticians in particular, Peng suggested that poor data analysis should be proactively prevented, as opposed to something caught after the fact. It is not enough to do peer review or reproducibility work after a bad analysis has been done. Instead, he encourages statisticians to think of the data analysis process more broadly and consider all of it—not just the development of a model—to be a part of statistics. He elaborated that, while experimental design and model development are typically thought of as statistics, the part between has grown massively over the last 10 years. While there is not uniform agreement among statisticians over what part of the process they should be involved in, Peng urges statisticians to be involved in all of it. He asserted that statisticians need to study the process more carefully so they can make recommendations and develop guidelines for how to analyze data appropriately in certain situations, domains, and disciplines. He argued that to take on this new role, the statistics toolbox might need to expand to delve into the realm of experimentation process (i.e., how people do data analysis and what works robustly).
An important step in disseminating information and best practices relates to teaching these techniques and ideas in the simplest possible way. Peng said that
most researchers need to understand statistics because they are analyzing data, and improving their understanding can help curb some of the poor analyses. He commented that this is a big opportunity for statisticians to embrace the analytic and scientific pipeline and uncover ways to prevent the same reproducibility problems from recurring.
A participant noted that in the life sciences, Internal Review Boards (IRBs) in many academic institutions are currently reviewing research proposals before or after funding is received and these IRBs typically have statistical committees. Instead of creating new structures, could IRBs and funding agencies develop a framework for reducing errors in research, which in turn helps increase reproducibility? Errington suggested that having more of the processes tied together would increase the understanding of what the research will be and could lead to improved reproducibility.
Anestidou noted that the animal care and use committees could incorporate statistical subpanels, but there is discussion within the research community about increased regulatory burden, oversight, and paperwork. She wondered how this additional step would fit within the current paradigm. She suggested that the solutions should come from the community instead of being pushed as a top-down regulation.
A participant suggested that government agencies interested in national security analysis know if their analyses are reliable, consistent, and repeatable across analysts. Perhaps reproducibility needs a new framework of analytic engineering to be able to describe how an analysis could be performed and explained to others. Huo noted that in the broader scientific community, the issue of being able to trust an analysis (as Roger Peng described) is important. There are several approaches to this, including increased government regulations or required review, but perhaps a community-based free market model similar to what has been done in some computational communities would be helpful. In such a community, once a method is developed, a paper and the software used are both published. Another approach, Huo noted, is to employ search engines to identify software (using associated comments and reviews) that could potentially be used for a comparison. He suggested that the community could do this sort of work, and those researchers who put more emphasis on reproducibility are more likely to see their papers receive high impact and high citation ratings. He stated that this model is more efficient than having someone else try to impose regulations on the research work.
A participant asked how free statistical consulting relates to the prevention of poor analysis. Errington explained that this type of consulting does not typically carry out a statistical analysis for a researcher; rather, it fosters training and helps
identify methods that would work for a particular research question. He explained that these sorts of interactions aim to help researchers understand that the entire research process, from the way an experiment is designed all the way through to the analysis, is linked and that the entire process needs to be considered holistically. Such services offer advice to help researchers understand what approaches can be used and what resources can help understand the context. The advisors are both in-house and community-driven.
A participant echoed Peng’s comment that the scientific community seems not to have absorbed many of the things that statisticians and other designers of research have known for a long time. For example, good scientific practices (such as having a control group or a larger sample size) are well known but not always used. The participant questioned why behavior that researchers know is not optimal is allowed to continue within many communities. The participant then suggested that statisticians play a role here, but they have to partner with people within the disciplines because each culture can only reform itself. The cultures of the disciplines need to change their value systems and understand that every choice made in an analysis is fundamentally an issue of scientific norms and integrity, as opposed to simply moving the dial up or down on the error rate.
Peng agreed that many scientists in many areas know the basics, but his view is that data analysis can quickly get complicated. He also agreed that there is a cultural resistance to accepting this knowledge. However, he suggested that statisticians should bear much of the responsibility to take on this problem and work in the communities to do what is necessary to get them to change. Anestidou agreed that this is an issue of integrity and how to “do” science; she noted that the American Statistical Association has had guidelines about how to “do” statistics for more than a decade. She recalled a statistics professor in her first year of graduate school saying that the methods of statistical analysis need to be chosen before the methodology is set up and work begins. However, she does not see that happening in most cases, which is leading to flawed results. The prevention should start with training because doing better and more reliable analysis is a much larger issue than focusing on the analysis of data that are already collected. Errington commented that a solution to this requires all of the stakeholders to get involved because training and the incentive structure need to be aligned.
A participant noted that the International Initiative for Impact Evaluation7 funds impact evaluation and systematic reviews that generate high-quality evidence on what works and why. The participant noted that research transparency and better training are almost uniformly agreed upon, but what about all the studies that have already been published? How can incentives truly be changed? How can
researchers be encouraged to do replication studies? Errington agreed that there is value and knowledge to be gained from doing replication research but reiterated that the incentive structure needs to promote it.
A participant suggested that many problems relating to reproducibility could be addressed by adding requirements in the government grant and contracting process. Huo said NSF has been considering how to enhance or impose reproducibility. However, he noted that the funding reality is such that supporting work that attempts to reproduce existing results directly competes with funding other research; this decision, then, has the potential to impact the nation’s other science and engineering goals negatively.
A participant then asked how communities might pursue and incentivize improvements to data analysis. Peng said that, at Johns Hopkins, staff members are trying to teach statistics to as many people as possible. And on an individual basis, statisticians there work closely with scientists in their laboratories and environments to improve quality across the board.
A participant stated that prereview of research plans, as would occur with Errington’s proposal to revise the publication process, does not allow science to innovate freely. However, the existing IRB and IACUC systems are places where the improvements to analysis could be identified. These committees should include statisticians to evaluate the design of the experiment. The participant also stated that replicability studies might conflict with the 3Rs outlined by Anestidou. Replication is needed as the first step in continued research and should be publishable, but there are ethical concerns of replicating research that may not be of interest to future researchers.
A participant commented that it is encouraging that there is a strong voice emerging from the statistical and data processing community with respect to standards of replicability and improved methodology. However, the funding structure limits what can be accomplished because most investigators are under enormous pressure to get results out quickly in order to demonstrate their productivity and thus qualify for the next increment of funding. There are many instances in which analyses are done prematurely against the advice of statisticians, and researchers shift outcomes or fail to define outcomes adequately at the onset so as to look for an outcome that produces a significant result. The participant noted that it is hard to resist the pressure because statisticians usually work for the investigator and an investigator can look for other statisticians whose recommended adjustments are less burdensome. One possible solution that would address some of these problems is to expand the scope of clinicaltrials.gov, which requires a declaration of the methodology, and to expand this approach to observational studies.
Victoria Stodden began the second panel of the day by noting that the topic of reporting scientific results and sharing scientific data should include sharing scientific code as well. She noted that the topic of dissemination had already come up repeatedly during the workshop, including a discussion of the deliverables that are important when publishing scientific results and the implications this has for dissemination, as well as the methods used to complete studies and report results. She also commented that it is important for the community to consider the public perception of reproducibility.
Keith Baggerly, MD Anderson Cancer Center
Keith Baggerly explained that he has been associated with reproducibility efforts for a few years, motivated by a number of cases where he encountered process failures. This, at times, has led him to explore the development of forensic bioinformatics, which is the art of taking reported results and raw data and inferring what methods were used. He commented that, while this is a useful and often informative art, it does not scale and cannot be used system-wide.
He offered the following as a summary of major takeaway messages he had heard in the workshop:
- The statistical community needs to figure out specific steps that it could contribute to the reproducibility effort.
- The strength of the evidence for a claim presented goes from (a) same results from same source data to (b) same results from new data to (c) aggregate results from lots of data. Baggerly noted that the latter is the goal but that merely getting the same results from the same data can be immensely complicated.
- Research communities need a clearer understanding of the significance cutoff that is acceptable.
Baggerly noted that the case studies discussed during the workshop highlight some notable issues affecting reproducibility, particularly the complications in drawing inferences from large-scale data sets. When utilizing large-scale data sets, Baggerly warns that it is important not to focus on small variation that can be caused by the batch effects that are present, and he reminded the audience that Benjamini had discussed some ways to account for this. A related idea is to look for big effects, particularly in genomics (Zilliox and Irizarry, 2007). Large databases allow for the scale of data noise to be estimated, and that can be used to identify large effects.
Baggerly commented that the community is fortunate to have these large data sets in part because they make it easier to identify real results. In the case of the Cancer Genome Atlas,8 he explained that the 10,000 samples across 30 different tissues could be viewed as different replications with different disease types, which can help identify where defects are or which tissues have extremely high expression of a gene. He cautioned that it is important not to focus solely on p-values for large data sets, because those values tend to be small. Rather, he suggested that the effect size also be quantified to see if it is big enough to be of practical relevance. Baggerly cautioned that there are still flaws in data processing that come up in the case examples, and these highlighted some of the reasons statisticians need to be involved.
He explained that his main recommendation in terms of reporting is to include sanity checks because multidisciplinary big data studies magnify the chances for inadvertent errors. He elaborated that there are some ways to avoid this, in part through the process of pointing out this possibility to collaborators and soliciting their explanations, plotting by run date, and prespecifying positive and negative controls. As an example, he encourages genomics researchers to write down, before analyzing their data, a short list of the genes that they expect will be changed in response to treatment and the directions in which they should change. This step forces researchers to think about what the results should be before the analysis is performed, thus giving both the analysis team and the data suppliers a way of checking and calibrating results. The positive and negative controls come down to considering (1) what should be seen after the analysis, and (2) what results would indicate that the treatment resulted in no significant differences.
Ronald Boisvert, Association for Computing Machinery and
National Institute of Standards and Technology
Ronald Boisvert discussed some of the efforts he has been involved with in the course of his position as a member and former co-chair of the Association for Computing Machinery (ACM) Publications Board, where issues related to reproducibility and data sharing are currently being considered. The ACM is the world’s largest scientific and educational society in computing with a substantial publication program (45 research journals, 8 magazines, 26 newsletters, and approximately 450 annual conference proceedings) and an extensive digital library (with more than 430,000 full-text articles and more than 2,300,000 bibliographic records covering the entire computing field).
He explained that publishers could support reproducibility through journal policy mandates as well as by establishing incentives that encourage investigators
to give greater attention to reproducibility. Publishers can also provide platforms for archiving supplementary material such as data and codes.
Boisvert commented that, on the surface, reproducibility in computing research should be easier than other areas of science because studies are typically carried out computationally, and computational experiments are more easily portable than physical experiments. He noted, though, that this is less so when the research is in areas such as hardware and human-computer interaction.
An early success for the computing community in evaluation and distribution of research software was the ACM Transactions on Mathematical Software (TOMS), which publishes research on the implementation of algorithms for solving standard mathematical problems such as systems of linear equations and partial differential equations. These implementations are packaged into reusable software bundles, which are refereed at the same time that the text of the paper is refereed. The referee gets the software, tries to run it, inspects it, and decides whether it is a useful contribution. Evaluation criteria include aspects such as code structure, usability, documentation, efficiency, and portability. TOMS has published more than 450 such papers since 1975, representing about one-third of the papers published in the journal. The software is made available in the ACM Digital Library as supplementary material associated with the paper. The capability of archiving such supplementary material has been available to all ACM publications since 1998, although it is not well promoted and the uptake of data is relatively small. Nevertheless, within the smaller mathematical software community, the desire of researchers to have their code used by others, along with the seal of approval coming from the refereeing process, has sustained the flow of ACM’s “Collected Algorithms” for 40 years.
Boisvert noted that other ACM journals have tried without success to replicate what TOMS has done. These other journals include the Journal of Experimental Algorithms, which has since morphed into a traditional publication, and the Journal of Educational Resources in Computing, which is now defunct.
ACM currently encourages pilot efforts to strengthen the reproducibility of papers in its journals and conferences. For example, since 2008 the Special Interest Group on the Management of Data’s (SIGMOD’s) main conference for database research has had a voluntary process for accepted papers to undergo reproducibility reviews by a committee. Authors submit the software and execution instructions, and these materials are judged on criteria such as sharability, coverage, and flexibility. Papers that pass the review get a “reproducible label”9 to indicate that the paper was carefully done in a certain sense. Over the years, the standards and procedures for doing the review and the terminologies have changed as the com-
munity has gained experience in this practice. The acceptance of this certification process has been fairly high. For example, 35 of the 88 papers accepted in 2011 participated and 24 were confirmed “repeatable” based on a number of criteria (Freire et al., 2012).
Within the programming languages and software engineering community, the issue of reproducibility has been taken on directly, according to Boisvert. There are 11 major conferences on programming language and software engineering that carry out a process known as artifact evaluation,10 with more than 13 conference sessions participating since 2011. He noted that the optimal committee-based evaluation process for accepted papers has two evaluators per artifact, typically graduate students and postdoctoral fellows. The criteria are similar to those used by SIGMOD in that they look at the packaging, reproducibility, implementation, and usability. This step is beginning to take off in the community: for one particular conference in 2014, 20 out of 52 accepted papers volunteered for this evaluation and 12 passed. While not a requirement, in many cases the artifacts are subsequently made available for download.
The ACM Transactions on Mathematical Software recently extended its replicability review process to the two-thirds of its papers in which software is not submitted for review and distribution.11 Papers in this journal typically present new algorithms and compare them to existing methods via some form of computational experiment, according to Boisvert. Authors of papers that have been accepted subject to minor revisions can opt for an additional “replicability review.” In that process, a single reviewer works collaboratively with the authors to replicate the computational results that contribute to the main conclusions of the paper. The reviewer then writes a short paper on the experience, which is published along with the original paper. Authors are incentivized by having the label “Replicated Computational Result” affixed to their papers, while reviewers are incentivized by having a publication of their own.
Boisvert emphasized that while the ACM Publications Board would like to propagate these practices throughout all of its publications, the society understands that success depends on subcommunity acceptance. And before each subcommunity can develop its own procedures for the review process, uniform terminology tied to baseline review standards needs to be developed in order to enable meaningful labeling of papers that have undergone some form of replicability review.
11 ACM Digital Library, “Editorial: ACM TOMS Replicated Computational Results Initiative,” http://dl.acm.org/citation.cfm?doid=2786970.2743015, accessed January 12, 2016.
Randy LeVeque, Society for Industrial and Applied Mathematics and
University of Washington
Randy LeVeque provided some examples of reproducibility work he has been doing over the last 20 years. For example, he co-developed an open-source software package for solving wave propagation problems through his numerical analysis and scientific computing research, studied software applications such as tsunami modeling and hazard assessment, and—as chair of the Society for Industrial and Applied Mathematics (SIAM) Journals Committee—advised the SIAM vice president for publications on issues related to journals. LeVeque also discussed his involvement with the eScience Institute at the University of Washington through the Reproducibility and Open Science Working Group, which aims to change the culture and the way data science is done. This group has regular monthly seminars on reproducibility and other open science issues. This effort began as a single-campus reproducibility effort, where researchers could submit something that they planned to publish and ask other people to download it, run the code, and evaluate the clarity of the instructions. The current goal is to increase the scale of this resource.
The SIAM Journal program has 15 high-impact research journals in applied mathematics. Traditionally, LeVeque explained, supplementary materials had not been published with these journals, but beginning in 2013, editorial boards could determine whether or not they wanted to support additional materials. He emphasized that the idea of supplementary materials was foreign to many researchers in applied mathematics. For example, of the approximately 1,500 articles SIAM publishes each year, only 38 articles have published unrefereed supplementary materials. Two SIAM journals have a longer history of having refereed materials associated with them: the Journal of Dynamical Systems uses the DSWeb12 and the Journal on Imaging Science partners with Image Processing On Line.13
Several other SIAM journals focus on publishing software,14 and LeVeque is interested in ensuring people get credit for working on software because it often requires a large investment of time and represents an encapsulation of algorithms and knowledge. He observed, though, that publishing research code for processing, analyzing, and visualizing data; for testing new methods or algorithms; and for computational science or engineering problems is rare in applied mathematics and computation science and engineering.
14 Other journals focused on publishing software include the ACM Transactions on Mathematical Software, SIAM Journal on Scientific Computing (Software Section), Journal of Open Research Software, Open Research Computation, Journal of Statistical Software, Geoscientific Model Development, and PeerJ Computer Science, among others.
Making supplementary materials more widely available, both online as well as permanently archived, would advance the field, according to LeVeque. He mentioned two sites, Zenodo15 and Figshare,16 that archive material, assign it a digital object identifier, and allow it to link automatically to GitHub.17 LeVeque recommends that researchers be encouraged to use one of these options for sharing the code that goes along with their research papers.
The culture needs to change, according to LeVeque, and there are still questions about incentives versus requirements. From his perspective, journal publications continue to be valued more highly than software and data sets, and writing another paper is rewarded more than making existing code more reliable and sharable. In LeVeque’s view, code and data need to be “first-class objects” in research; he suggested that the broader scientific community needs to imagine a new world in which all data is freely available online and the incentive to hoard data is eliminated.
LeVeque noted that institutional roles concerning code and data sharing are important, particularly regarding whether the institution or the researcher owns the software developed at a given university and what that means for making code open source. He also commented that some curricular changes are needed in computational science, starting at a very basic level with early programming, statistics, and numerical analysis courses. He argued that topics such as version control, code review, and general data hygiene (such as management, metadata, and posting) should be taught early.
He commented that computational mathematicians write papers about numerical methods, often containing a new algorithm, and spend weeks cleaning up the theorems in the paper, but they do not want to spend any time at all cleaning up the code and making it available to others. Traditional mathematics does not struggle as much with reproducibility, according to LeVeque, because proofs are required to publish a theorem. According to David Hume (1738): “There is no . . . mathematician so expert in his sciences, as to place entire confidence in any truth immediately upon his discovery of it. . . . Every time he runs over his proofs, his confidence increases; but still more by the approbation of his friends; and is raised to its utmost perfection by the universal assent and applauses of the learned world.” LeVeque argued that computational mathematics should embrace this approach more because it is difficult to evaluate the accuracy of the programs if they have not been cleaned up, published, and peer reviewed.
In conclusion, he noted that many of the arguments against publishing code seem ludicrous when applied to proofs (LeVeque, 2013):
- The proof is too ugly to show anyone else.
- I didn’t work out all the details.
- I didn’t actually prove the theorem—my student did.
- Giving the proof to my competitors would be unfair to me.
- The proof is valuable intellectual property.
- Including proofs would make math papers much longer.
- Referees would never agree to check proofs.
- The proof uses sophisticated mathematical machinery that most readers/referees don’t know.
- My proof invokes other theorems with unpublished (proprietary) proofs. So it won’t help to publish my proof—readers still will not be able to fully verify its correctness.
- Readers who have access to my proof will want user support.
He hopes that 300 years from now, people will look back and see this as a transition time when science moved into doing things very differently.
Marcia McNutt, Science Magazine
Marcia McNutt began by noting that the scientific community is embracing the concept of reproducibility quickly. She believes that, in the future, the past couple of years will be identified as the period of time in which the community (i.e., funding agencies, journals, universities, and researchers) recognized it had to take reproducibility seriously and come up with better practices and solutions across all disciplines for the sake of the reputation and quality of science.
She noted that the spectrum of reproducibility (Ioannidis and Khoury, 2011) includes the low end (minimum standard) of repeatability—where another group can access the data, analyze it using the same methodology, and obtain the same result—and the high end (gold standard) of replication—where the study is repeated start to finish, including new data collection and analysis, using fresh materials and reagents, and obtains the same result. For some fields of science, McNutt noted that true replication is not possible. For example, an earthquake cannot be repeated, and forests evolve, so whenever time is a vector in an analysis, exact replication is impossible and the best that can be done is to take the data and analyze them again. There is a certain degree to which you might be able to repeat or generate new data, but it is never going to be an exact repeat.
McNutt explained that the approach at Science has been to acknowledge that the differences in fields and communities lead to different reproducibility issues. The journal, then, needs to work with these fields and communities to find the best practices, procedures, and policies that raise the standards for transparency and promote reproducibility. Science started with the assumption that a study’s
reproducibility or lack of reproducibility does not necessarily mean it is right or wrong, respectively.
She explained that there are many examples of this, including one in which three top laboratories took a global data set of earthquake sources and receivers and analyzed the data to show that there were bumps on the boundary between the Earth’s core and mantle. This very reproducible result was widely shown on covers of famous journals and was a source for speculation on the creation of Earth’s geodynamo and the coupling mechanism between the differential rotations of the Earth’s core and mantle. McNutt emphasized that this was a very solid, reproducible result with major geodynamic repercussions for how Earth behaves. However, the result was fundamentally wrong. All the analyses were preconditioned on the earthquake sources being located in the major subduction zones and the earthquake seismometers being located on the continents. This bias in source receiver locations, when put in a spherical harmonic representation, led to the artifact of bumps on the core/mantle boundary.
Following a workshop on the topic of promoting reproducibility in the preclinical sciences, Science published an editorial recommending best practices for transparency in those sciences (McNutt, 2014). That editorial, signed by representatives of 120 journals, recommended that researchers discuss the following information in order to publish:
- Power analysis for how many samples are required to resolve the identified effect,
- Random assignment of samples to treatment and controls,
- Blinding of experimenter to which samples were in the treatment and which were in the controls, and
- Data availability.
A goal was that improved transparency of these four experimental protocols would allow reviewers and readers to gain a level of confidence in the results. McNutt noted that authors are not required to follow these protocols; they are only required to state whether or not they did so.
A follow-on workshop focused on the social and behavioral sciences and resulted in a general document that could be applied more broadly beyond those fields, McNutt stated. That document includes a number of guidelines (Nosek et al., 2015) that journals can choose to follow:
- Tier 1: Asking author to declare what was done,
- Tier 2: Conforming to a community standard, or
- Tier 3: Verifying that the standard was followed.
An upcoming third workshop focusing largely on the availability of data and sample metadata as they pertain to reproducibility in the field sciences will include representatives from journals, data repositories, funders, and the scientific community. A fourth workshop is being planned to focus on the computational sciences.
In conclusion, McNutt noted that Science added several statisticians to its board of reviewing editors to help screen and identify papers that may need extra scrutiny for the use of statistics or numerical analysis. She said this addition has raised the journal’s standards.
Victoria Stodden began the panel discussion by asking each of the panelists to give one or two concrete recommendations for improving how science is conducted, reported, disseminated, or viewed by the public. The following recommendations were offered:
- Establish publication requirements for open data and code. Journal editors and referees should confirm that data and code are linked and accessible before a paper is published. (Keith Baggerly)
- Clarify strength of evidence for findings. The strength of evidence should be clearly stated for theories and results (in publications, press releases, etc.) to ensure that initial explorations are not misrepresented as being more conclusive than they actually are. (Keith Baggerly)
- Align incentives. Communities need to examine how to build a culture that rewards researchers who put effort into verifying their own results rather than quickly rushing to publication. (Marcia McNutt)
- Institutions need to make extra efforts to instill students with an ethos of care and reproducibility. (Marcia McNutt)
- Universities need to change the curriculum to incorporate topics such as version control, code review, and general data management, and communities need to revise their incentives to improve the chances of reproducible, trustworthy research in the future. Steps to improve the future workforce are necessary to keep the public trust of science. (Randy LeVeque)
- Many graduates are well steeped in open-source software norms and ethics, and they are used to this as a normal way of operating. However, they come into a scientific research setting where codes are not shared, transparent, or open; instead, codes are being built or constructed in a way that feels haphazard to them. This training disconnect can interfere with mentorship and with their continuation in science. Better under
standing of these norms is needed in all levels of research. (Victoria Stodden)
- Prevention and motivation need to be components of instilling the proper ethos. This could be part of National Institutes of Health (NIH)mandated ethics courses. (Keith Baggerly)
- Clarify terminology. A clearer set of terms is needed, especially for teaching students and creating guidelines and best practices. Some examples of how to do this can be found within the uncertainty quantification community, which successfully clarified the terms verification and validation that were almost used synonymously 10-15 years ago. (Ronald Boisvert)
Regarding the problem of rushing research into publication without consideration of the effects if it is not replicable, a participant stated that there should be mechanisms available to make replications (both positive and negative) better known. McNutt noted that eLife, for biomedical sciences, and the Center for Open Sciences, for the social sciences, are already making such efforts.
A participant noted that many of the journal-sponsored workshops on reproducibility focus on operational issues, such as having transparency, making data available, cataloging, and developing computing infrastructure that allow for the data to become available. That focus overlooks other critical questions: What constitutes evidence of reproducibility (which requires a conceptual framework)? How is reproducibility defined? Who decides whether something is reproducible? How can reproducibility be assessed on an evidentiary basis? The participant wondered how the current machinery is helpful in those efforts and how to make it more clear to researchers what they and the community should be looking for when checking for reproducibility. The participant stated that more development on the conceptual and evidence-base level is needed. LeVeque added that there is still a lot of uncertainty about what exactly should be expected in computational reproducibility even with respect to terminology; for example, there is not a uniform agreement on the terms reproducible, repeatable, and replicable. He noted that the first step in defining an evidence base is having a clear understanding of the terms.
Another participant suggested that one of the ways to change incentives is to make replication research more broadly publishable, possibly through the use of short replication study papers. Boisvert agreed that allowing these replication papers to be published is important for the incentive structure. He referred back to his discussion of how the ACM Transactions on Mathematical Software checks for reproducibility as part of the review process, which in part allows reviewers to publish their experiences in doing the replication. Boisvert pointed out that the journal does not currently have a policy of what to do if the replication work fails to reproduce the study findings. Baggerly and a participant agreed that many
organizations and journals are struggling with the same question of how to handle irreproducibility.
A participant commented that researchers at Stanford are beginning to develop a short online module on proper training regarding issues of reproducibility, which is intended to be added to the responsible conduct of research course that students are required to take.
A participant asked how science journalists can help advance reproducibility itself, as well as the public’s perception of reproducibility. McNutt suggested that journalists avoid overstating the results of a study (which is a common problem with university press releases) and clarify the caveats and limitations associated with new findings, as well as what might have led the new work to a different conclusion than had previously existed. Baggerly suggested following up on novel findings from the recent past, assessing their implications, and evaluating how well the findings have held up. Stodden commented that journalists can be somewhat hesitant about interacting with the scientists after a story is drafted. Acknowledging the time constraints, she suggested it would be helpful if there were a consistent ethos about having scientists sign off on all quotes. The participant noted that some of the publications she has worked for ask reporters to look through the original papers to assess if the data analysis looks like it was done appropriately. The publications do not want to report on results that end up being undermined by bad data analysis and later criticized on statistics blogs. She agreed that a better system with more open lines of communications is needed.
A participant commented that sensitivity analysis is essential to reproducibility in that the analytical methods used must be assessed to see which are the most sensitive to noise in the study process and how to make them less sensitive. This is at the core of figuring out why something is not reproducible. He wondered if there is a way to get a better scientific infrastructure beyond just journal publishing such as active, open-use databases. Such a system would foster a good interchange among disciplines with regard to how results are reported in various disciplines so that best practices can be accepted and improved upon by other fields. Boisvert noted that uncertainty quantification in computational science, which is related to sensitivity analysis, is a very important consideration to which few people dedicate time. Within the applied mathematics community, there is a large effort in understanding how to do uncertainty quantification of models and simulations, which leads directly to understanding whether results are reproducible. Baggerly agreed that more sensitivity analyses would be helpful and that there needs to be better training in this area. He noted that assessing variation in larger databases is a form of sensitivity analysis and may be about as good as can be done in those cases. McNutt noted that there is new laboratory software entering beta testing that can track laboratory results to reveal systemic issues, such as equipment degradation, and help identify sources of bias and error in results (Gardner, 2014). The participant suggested that the community
could evolve with the help of a standards development organization that conducts interlaboratory studies and makes results available to the public so other researchers can identify similar sensitivities in their own laboratories. Baggerly agreed that having a built-in system to spot interlaboratory variations would be ideal.
Constantine Gatsonis opened the final workshop panel by highlighting some of the previously identified themes. The first relates to statistical thinking and determining evidence of reproducibility. He echoed previous speakers in stating that the discussion of reproducibility is at a critical point and the issue is well recognized across the scientific and policy domains. In some specific areas, approaches toward assessing evidence for reproducibility have been developed that are applicable to a particular area of research. However, he emphasized that there is not a broadly accepted framework for conceptualizing and assessing reproducibility. Some key questions that need to be addressed in such a framework include the following, according to Gatsonis:
- What is meant by reproducibility?
- What evidence is needed to support reproducibility?
- How should experiments be designed?
- What is the role of publishing in supporting that enterprise?
- How stringent should the evidence be before a result is declared reliable from a reproducibility perspective?
- What is the right p-value or Bayes factor?
Gatsonis stressed that these are important issues about evidence that highlight the lack of consensus among scientific communities. Individual scientific communities are developing solutions to portions of the challenge, and certain areas are evolving quickly, such as policy, computing approaches, and IT. However, a more conceptual framework still needs to be developed. The National Academies’ Committee on Applied and Theoretical Statistics, which organized this workshop, is looking for ways to move this forward.
Another open issue identified by Gatsonis is what researchers and students should be taught about reproducibility. He emphasized that there needs to be explicit curricula with courses that address reproducibility directly. However, he argued that before anyone could develop this curriculum, a broadly accepted framework for what constitutes strong evidence for reproducibility is needed. He noted that different scientific communities are at the stage of developing structures and processes, but that the basic scientific consensus is still evolving.
Philip Bourne, National Institutes of Health
Philip Bourne explained that he would build on Lawrence Tabak’s ideas from earlier in the workshop and go into more detail with respect to data. He observed that NIH’s data science strategy incorporates statistical rigor, replication, and reproducibility by focusing on community, policy, and infrastructure, most of which is being done through the Big Data to Knowledge (BD2K) initiative.
Bourne noted a few key issues, the first of which is the significant time required to reproduce research (Garijo et al., 2013). The second is the insufficient reporting of data and a lack of negative data. He suggested that the way to address this gap is through the use of a Commons, which is essentially a shared space where research objects are posted (e.g., data, software associated with analyzing that data, statistical analysis, narratives, and final publications). A third issue is p-hacking and robust research training revolving around the best use of statistics and analytics. NIH is creating a training coordination center to begin collecting and recording courses offered and materials available (both physical and virtual), as well as cataloging the analysis training landscape and identifying gaps which might require additional funding.
Incentives, Bourne asserted, are a key aspect of encouraging reproducibility and statistical rigor in research. NIH’s policies are changing with respect to data sharing. Currently, the Office of Science and Technology Policy directs that any grant over $500,000 must have a data-sharing plan, but soon this requirement will be extended to all grants. Bourne commented that while some of the incentives come from funders, many come from the community. For example, he does not believe data are regarded highly enough in the realms of scholarship. Perhaps endorsing data citations in new ways would be helpful, as is being done through the National Library of Medicine’s PubMed and PubMed Central.
Chaitan Baru, National Science Foundation
Chaitan Baru affirmed that the issue of data is important to NSF; the effectiveness of their current data management plan will be evaluated over time. NSF funds individual research proposals in this area, as well as a community group that studies ethics concerns. Baru explained that there are three primary areas that make up data science: computer science, statistics, and ethics and social issues. He discussed the 2014 Big Data Strategic Initiative18 workshop that brought together federal agencies, academia, and industry to discuss agencies’ strategies for dealing with data and data analysis. An important theme that emerged from that workshop
was education and training for current researchers and for the next generation. As a result, his office intends to run another workshop on how data science curricula should be designed. He emphasized that the concepts of reproducibility and repeatability would be essential elements within a data science curriculum.
Baru concluded by recognizing the ACM database conference, which instituted the notion of looking at the repeatability of results in the papers submitted, and stressing that more such work would help the community. He commented that it is difficult for a funding agency to advance a cultural change that is not already occurring in a community. If the norm develops, then the internal pressure and behaviors such as control begin to move in the right direction. For example, as a community, ACM SIGMOD generated the notion of a test-of-time award, in which conference proceedings and papers from 10 years ago are evaluated and an award is given to the paper that had the most impact. A participant later commented that there are branches of ethnography that study how people collaborate with different branches of science and how people and cultures change; such work may provide some insight into how to change aspects of the scientific cultures relating to reproducibility.
Rafael Irizarry, Harvard University
Rafael Irizarry echoed Philip Bourne’s message about the importance of education: better training is the best way to prevent errors in methods and analysis and thus improve reproducibility. He elaborated that improved education is particularly important now as many fields are transitioning from a data-poor to a data-driven state, and many researchers are becoming data analysts out of necessity. He commented that while he is not an expert in reproducibility, he has been working in biomedical data science for 20 years, helping to manage data and make discoveries. During this time, biomedical sciences have become data intensive and many researchers must now be proficient in data management, data wrangling, computer algorithm optimization, and software development to implement methods. Irizarry noted that a relatively small investment of time and resources at the beginning of a project has the potential to improve reproducibility and save a lot of time in the end.
He highlighted a few readily available tools created by data analysts to improve reproducibility, including Bioconductor,19 R,20 Subversion,21 and GitHub.22 All of these tools were developed from the bottom up. For example, as researchers were
analyzing data, they identified gaps among the available tools and went on to create R and Bioconductor to facilitate their work.
Irizarry stated that it is important to assess if irreproducibility is truly a crisis and if there is a difference now compared to how science was done 50 years ago. For example, the published estimates of the speed of light from 1900 to 1960 were regularly refined, and error bars narrowed (Youden, 1972), which illustrates that the community found a way to continue improving despite the problems that have always surrounded science.
Irizarry was not optimistic about incentive changes such as top-down measures, rules, and regulations, although he agreed with Bourne that data sharing should be incentivized. In addition to enhancing reproducibility, it will encourage more researchers to look at more data and potentially make additional discoveries. However, he cautioned that some policies could be used in unintended ways, and adding hurdles to publication can slow progress in a number of ways.
He noted that although he does not see clear evidence of drastic change in the rate of irreproducibility in the biomedical field, one remarkable change over the past 50 years has been the attention given to press releases, with more emphasis now on getting results in the top newspapers. He also commented that with the quick biomedical transformation from data poor to data driven, much of the infrastructure (people in leadership positions, journal editors, and training programs) has not changed even though the nature of the work has changed dramatically.
To try to help, Irizarry collaborates with as many researchers as possible. He and several of his colleagues have grants funded by NIH’s BD2K initiative to create massive open online courses to improve statistics and data analysis among researchers who did not have that as part of their training. Efforts such as these are important in the biomedical sciences and also in other fields that are moving from data poor to data driven.
His final point was that statisticians should not shy away from teaching students how to do applied statistics. This goes beyond teaching methods and theories and includes showing them how to clean and then analyze data, to check and explore the results, and to be skeptical and critical of data analyses. He emphasized that educating researchers who do not have statistics, computing, and reproducibility as part of their formal training is needed to improve that situation.
Jeff Leek, Johns Hopkins University
In his discussion of evidence-based data analysis, Jeff Leek stated that for small to medium-sized problems, reproducibility (if defined as repeatability of analysis) is a solved problem. The tools and ability exist, so it is possible to achieve. Leek noted that the question that remains is why people are not doing it. The main reason he offered is that researchers are not rewarded for it. If senior leadership
in the communities believes this work should be done, it needs to find a way to communicate these ideas and create a suitable environment.
Leek mentioned that many openly available data analysis tools for reproducible research already exist and are being used, such as R Markdown,23 Galaxy,24 IPython Notebook,25 and GitHub.26 He referenced a recent study he and a collaborator published about the rate at which discoveries are false in science (Jager and Leek, 2014), which ended up stimulating a debate in the scientific literature. Researchers wrote positive and negative responses, reproduced the analysis using the available data and code, and built and improved upon it. However, he noted that reproducibility and replicability work are often unfairly criticized and held to a higher quality standard than original research; this can be a disincentive for researchers interested in conducting this work.
Leek echoed previous speakers in noting that an analysis can be fully reproducible and yet still be wrong (Baggerly and Coombes, 2009). He emphasized that many communities are getting to a point of looking beyond reproducibility to assess if reproducible results are trustworthy. He also agreed that training is often more important than tools yet is often ignored. He commented that he and most of his colleagues are receiving more requests than they can accept to act as statistical referees for papers. Because there are not currently enough trained researchers to fill this role, Leek suggested that the statistics community should think about prevention and ways to (re)train students and researchers quickly. One example of training being scaled up is the Johns Hopkins University Data Science program,27 which includes a class on reproducible research. This program has trained more than 100,000 people on reproducible research. It includes lessons on data collection and cleaning, exploratory data analysis using GitHub, and version control. This is a program designed specifically for the modern data scientist, who Leek noted is in high demand. It is also important to make clear what kinds of questions researchers are asking, such as whether the data set is analyzed with descriptive, exploratory, or causal inference methods (Leek and Peng, 2015b). Enforcing any statistical procedure, including p-values, across all science would likely result in resentment and mistakes in implementation.
Constantine Gatsonis began by asking the panel to comment on the three types of reproducibility, as explained by Victoria Stodden: empirical reproducibil-
27 The Johns Hopkins University Data Science Program website is https://www.coursera.org/specialization/jhudatascience/1, accessed January 12, 2016.
ity, computational reproducibility, and statistical reproducibility. In particular, he suggested that the concepts and challenges of statistical reproducibility—namely, what kinds of evidence are needed to assess reproducibility—have not yet matured as much as they have in discussions of empirical and computational reproducibility. Rafael Irizarry noted that statistical aspects of reproducibility can get very complicated, as was illustrated in some of the case studies and examples discussed throughout the workshop. Much of the understanding of how to best use statistics comes from experience, and often reproducibility is not ensured simply by documenting researchers’ methods. He stressed that researchers need to learn from experience how to evaluate and be critical of data analyses. Gatsonis mentioned that at some point, experiences are quantified in the set of assumptions such as what p-value is acceptable. However, there is a debate now over whether the standard p-value of 0.05 is stringent enough to trust the result. Irizarry commented that there is a trade-off between having low false-positive rates and overlooking important discoveries; since both are important, he would prefer that they both increase. A participant countered that lower standards do not increase the rate of discoveries. Instead, the scientific community wants to be sure that the discoveries being reported are actually true. He stated that there are serious costs for false discoveries because people will be following up on misleading results and thereby wasting resources, and increasing standards for publication is not going to slow the rate of true discoveries. Irizarry reiterated his assertion that true positives decrease with the more conservative research standards. Leek commented that much of the discussion around statistical reproducibility involves shifting the p-value up or down or choosing one test statistic over another. He echoed Irizarry’s point that the only way to learn how to do good data analysis is to just do it for a while and figure out what works and what does not. He suspects that data analysis needs to be made an empirical discipline whose efficacy can be studied. Bourne agreed that training is an essential component because data science is accelerating what has been going on in computational biology and bioinformatics for some time. A problem he identified is a propagation effect where people without sufficient statistical knowledge apply methodology incorrectly and low-quality analysis proliferates. Education is the only way to curb this.
A participant noted the existence of a generational problem where new data scientists are being trained but there is not a mechanism for current researchers to improve their existing training. Irizarry and Bourne both agreed that ongoing professional development is needed and wanted across fields, and NIH is funding initiatives to support this development. He noted that many of the courses available online are advanced and could be used to fill this need. An example is a program that affords researchers an opportunity to take sabbaticals at highly analytical laboratories to learn techniques that can be applied back in their own laboratories. Leek noted that the NIH-funded course he helped develop is designed for current
researchers. Bourne mentioned that there is also a need for research administrators to gain a better awareness of how things are changing and the importance of this kind of work. Baru added that in his work overseeing NSF’s multidirectorate big data program, he has seen that approaches for teaching data management to a geoscientist, for example, are going to be different from those for teaching it to a biologist, a psychologist, or a molecular biologist. He has found that being able to tune curriculum to the audience is important.
A participant commented that the issue of selective inference is the number one problem that hampers statistical replicability. Irizarry agreed and noted that using statistics more appropriately improves results. Bourne also agreed but added that the interdisciplinary nature of scientific research is changing and, while that is not a statistical issue, issues regarding communities and collaboration among them need to be considered.
A participant wondered if there is any information about the backgrounds of the people who are taking the online data science courses such as how many are nonstatistical domain scientists. Much of the material presented in data science master’s programs is applicable and important for researchers who would not identify themselves as data scientists. Leek said some data exist through surveys of participants; these surveys indicate a broad community interest in data science, with programs drawing participants from business, economics, and other disciplines. Bourne added that the University of California, San Diego, held a data science workshop that attracted more faculty than any other program at the university. He said that data science is a catalyst to bring together people from diverse disciplines and foster collaborations.
A participant wondered how many years it would take for the community to fully understand reproducibility, especially as it relates to big data. Gatsonis noted that many statistical tools break down in the big data context, and researchers need to think in fresh ways about how to do these types of analyses with large data.
A participant commented that NIH’s Gene Expression Omnibus28 has been a remarkable feat of data sharing: a majority of micro experiments performed have been uploaded, and there is strong buy-in from authors and journals. He wondered if biomedical advances might be slowed due to concerns of privacy when working with sequencing data. Bourne commented that privacy concerns of sharing data are being worked through and discussed. He said that recent policies begin to address some of this, but this issue needs further immediate attention.
A participant noted that the default in large genomic data sets is to resort to multiple hypothesis testing to correct for really small p-values, while keeping the same p-value thresholds, but wondered whether that is a reasonable thing to do.
Leek responded that correcting for multiple testing is a good idea, particularly using measures such as the false discovery rate or other error rates, but there are tricky issues when going to higher dimensions in terms of dependence, when to do multiple hypothesis tests, p-value hacking, and selective inference. There are many ways to get things wrong, even if one corrects for multiple testing, but this testing is generally recommended.