Appendix C
Reproducibility and Validity
The essence of the scientific method is the drawing of sound conclusions from carefully collected evidence. Scientific progress occurs when questions that expand the boundaries of knowledge are explored with the best available experiments, assessments, and analytic tools, but key to progress is what comes next. In assessing the merits of new work, the scientific community must consider the reliability of the evidence and whether the research approach was suitable to the inquiry. Scientists ask such questions as: Could these results be reproduced by remeasuring the same kind of evidence a second time, by broadening the evidence considered, or by expanding and triangulating the sources of evidence? Was the relevant evidence collected and interpreted in an objective manner, or was the work toward a desired conclusion? How can biased interpretations be avoided? Can the research be reproduced? Can the results be generalized to other people or contexts?
Within scientific fields, particularly including branches of psychology and the life sciences and computer science, there has been a growing recognition that many studies cannot be replicated—an issue that has attracted significant attention and is referred to as the replicability crisis. Advances in computational power, the collection of vast datasets, and the plurality of statistical methods have raised additional challenges associated with issues of the reproducibility, validity, and generalizability of results.
The National Academies recently conducted a congressionally mandated study of the issues and practices of reproducibility and replication across scientific and engineering research domains (National Academies of
Sciences, Engineering, and Medicine, 2019).1 Major journals and agencies have published editorials and commentaries on the issues (Baker, 2016a; Collins and Tabak, 2014; McNutt, 2014a, 2014b). Ongoing discussion of reproducibility in social and behavioral sciences (SBS) fields, in allied fields that use big data and computation, and in biometrics and behavior metrics is likely to continue, but in this appendix, we review the main components of reproducibility and describe three ideas that have been suggested for ensuring that research is reproducible.
COMPONENTS OF REPRODUCIBILITY
When researchers report the results of an experimental or computational study, another researcher or laboratory will ideally be able to carry out the same or a similar experiment and analysis, and derive similar findings and conclusions. But there are irreducible sources of randomness or variation in any experiment or observational data sample. Since subjects, organisms, or digital traces are samples from a larger group, each observational dataset is a sample that may differ from other samples. The result is unavoidable statistical and sampling variation, especially for designs with smaller datasets. There is also inevitable or intentional randomness in experimental protocols. The problem may be further exacerbated by “publication bias”—the implicit or explicit policy of journals to publish statistically significant findings, favoring them over studies that fail to find an effect. Rarely, such failures of reproducibility result from intentional selection of evidence.
These issues of reproducibility occur in all domains of science: examples have been seen with topics as disparate as measuring and calculating physical properties (Lejaeghere et al., 2016); growing the same cell lines in different laboratories (Hines et al., 2014); impacts of housing practices on the physiology and behavior of animal models (Voelkl et al., 2018; Vogt et al., 2016); biometric measurement devices (Hamill et al., 2009); and the code used to generate research products or clean or filter data (Baker, 2016b; Stodden, 2015; Stodden et al., 2013, 2014). Yet debates about how best to address these issues are especially vigorous in the SBS, where there have been high-profile research efforts aimed at replicating a sample of experiments and a systematic focus on statistical analysis of data (Open Science Collaboration, 2015; Pashler and Wagenmakers, 2012).
___________________
1 Information about workshops conducted as part of this project is also available. Topics included reproducibility in federal statistics, see https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_070786 [October 2018]); research with animals and animal models (see NASEM, 2015); and statistical challenges in assessing and fostering reproducibility in scientific results (NASEM, 2016).
The language used to describe reproducibility is itself in flux. Scholars have defined components of reproducibility from different perspectives. Those focusing on the design of the experiment have identified the reproducibility of methods, results, and inferences as primary targets (Goodman et al., 2016). Those focusing on the computation carried out as part of the work identify empirical, statistical, and computational reproducibility as key (Stodden, 2017). And some consider the robustness or generalizability of findings with respect to variations in procedures, stimuli, or sampling and the validity of research approaches for the domain of application of equal or perhaps greater importance (see Table C-1).
Validation refers to matching the design of the research to the inquiry, whether the approach entails collecting appropriate and applicable observa-
TABLE C-1 Key Components of Reproducibility
Component | Definition |
---|---|
Experimentation | |
Reproducibility of methods (“transparency”) | Providing sufficient detail to enable the same procedures to be repeated by others |
Reproducibility of results (“replication”) | An independent study with procedures or methods matched as closely as possible yields the same results, subject to statistical variation in samples |
Inferential reproducibility | Equivalent inferences from the same data with independently conceived analyses |
Computation | |
Empirical reproducibility | Replication enabled by providing details of data methods and data collection and the data themselves being available |
Statistical reproducibility | Specifying the models and statistical tests used and their parameters to enable independent replication |
Computational reproducibility | Making the codes, software, hardware, and implementation details used to conduct the original research freely available |
Validation | |
Surface validity | Experimental procedures, results, and inferences are appropriate for the domain of application |
Model validation | Methods for evaluating the match between the results of a model and data from the real-world system |
Robustness or generalizability | Similar results and inferences obtain even with some variation in procedures or samples |
SOURCE: Generated by the committee drawing on Goodman et al. (2016), Stodden (2017), and Babuska and Oden (2004).
tional or experimental data or developing computational models or methods. To obtain results that support valid interpretations, a researcher must carefully consider the theory and assumptions that drive the experimental design or model for collecting evidence that can provide information about the questions to be answered or hypothesis to be tested. The determination of validity is complex, perhaps more so in SBS fields because human systems are particularly subject to variation and change. Methods for validation generally have focused on whether the same conclusions would have followed from a different subsampling of data.
Validation techniques used for mechanical and physical systems often test a model against historical data. Some classical treatments of verification and validation methods in computer science and engineering can be found in Law and Kelton (1991) and Babuska and Oden (2004). These techniques often assume a constant system, whereas social situations may be more complex and changeable (e.g., new leadership, new technologies), raising questions about whether models based on historical data will be appropriate for forecasting future outcomes. One approach might be the generation of several different models that triangulate a target domain, followed by generation of a narrative description of possibilities that may be more persuasive to the end consumer.
The proper approach to assessing the validity and generalizability of research for more complex situations remains an open question. It is likely that different approaches may be appropriate for distinct scientific domains or problems. Over the next decade, developing policies on reproducibility in science will likely suggest nuanced standards adapted to the respective fields and to the nature of the research questions.
The issues of reproducibility, generalizability, and validity will need to be addressed in appropriate and potentially different ways in undertaking the varied opportunities presented in this report. The recommendations of the National Academies committee and discussions of standards and practices in specific field domains should support future research practices and implementations.
PRACTICES
Certain fields have embarked on systematic efforts to improve replication of experiments, while others have argued that replication may slow science by diverting resources (Bissell, 2013). Some are advocates of preregistration of studies, whereby researchers submit their research rationale, hypotheses, design, and analytic strategy to a scientific journal for peer review prior to beginning the study (Munafò et al., 2017; Wagenmakers et al., 2012). Preregistration would allow studies to be rejected and/or revised and resubmitted before the data collection begins. Proponents of
preregistration argue that the process will result in improved use of theory and stronger research methods, and ultimately better studies, as well as a decrease in false-positive publications (Chambers, 2014). Others (e.g., Scott, 2013) have noted that preregistration may reduce support for exploratory research. The process does not allow for use of the research results as an indicator of the value of the study, potentially leaving editors and reviewers to rely on the prestige of the researchers and accepted methods—an outcome that would disproportionately affect graduate students and early-career researchers.
Another approach is to build heterogeneity explicitly into research programs, and proposals to this end have been developed in a number of domains. These proposals focus on heterogeneity of materials, subjects, or laboratories. For example, in the context of research on animal models in medical and physiological discovery, advocates of heterogeneity have proposed introducing explicit variation into subject samples, housing variables, and other factors (Vogt et al., 2016). Similar ideas have been suggested for work involving tracking of human movement and behavior (Pantic et al., 2007; Pentland and Liu, 1999). Cognitive scientists have proposed “meta-studies” for robust tests of theory (Baribault et al., 2017), coupled with Bayesian analysis (Etz and Vandekerckhove, 2016).2 These meta-studies purposely include variations in potentially important experimental variables to test theories or applications and increase the likelihood that conclusions will be generalizable within the relevant domain, supporting robust conclusions. Notably, this practice of introducing purposeful yet randomized variations may work successfully in some research domains but not others.
For large-scale datasets and the associated data mining computations used to understand patterns in those data, the issues of reproducibility arise in the context of statistical reproducibility and computational reproducibility. One initiative in these domains has focused on a broad call for open publication of statistical and computational code so that others can repeat and test the products (Baker, 2016b; Stodden, 2015; Stodden et al., 2013, 2014). While the call for publication of databases, code, and statistical products would raise privacy and ethical considerations, the principle of independent validation and targeted replication tests could be prioritized, at least for a subset of study conditions.
___________________
2 Bayesian analysis addresses research questions about unknown parameters using probability statements. Bayesian approaches allow researchers to incorporate background knowledge into their analyses, taking into account the issues of reproducibility and replication. Because the statistical model incorporates background knowledge, new data can be evaluated against the plausibility of previous research findings (van de Schoot et al., 2014).
REFERENCES
Babuska, I.M., and Oden, J.T. (2004). Verification and validation in computational engineering and science: Basic concepts. Computer Methods in Applied Mechanics and Engineering, 193(36–38), 4057–4066.
Baker, M. (2016a). Reproducibility crisis? Nature, 533(26).
Baker, M. (2016b). Why scientists must share their research code. Nature News, September 13. Available: https://www.nature.com/news/why-scientists-must-share-their-research-code-1.20504 [December 2018].
Baribault, B., Donkin, C., Little, D.R., Trueblood, J.S., Oravecz, Z., van Ravenzwaaij, D., White, C.N., De Boeck, P., and Vandekerckhove, J. (2017). Meta-studies for robust tests of theory. Proceedings of the National Academy of Sciences of the United States of America, 115(11), 2607–2612.
Bissell, M. (2013). Reproducibility: The risks of the replication drive. Nature News, 503(7476), 333–334. Available: https://www.nature.com/news/reproducibility-the-risks-of-the-replication-drive-1.14184 [December 2018].
Chambers, C. (2014). Psychology’s “registration revolution.” The Guardian, May 20. Available: http://www.theguardian.com/science/head-quarters/2014/may/20/psychology-registration-revolution [November 2018].
Collins, F.S., and Tabak, L.A. (2014). NIH plans to enhance reproducibility. Nature, 505(7485), 612–613.
Etz, A., and Vandekerckhove, J.A. (2016). Bayesian perspective on the reproducibility project: Psychology. PLoS One, 11(2), e0149794. doi:10.1371/journal.pone.0149794.
Goodman, S.N., Fanelli, D., and Ioannidis, J.P. (2016). What does research reproducibility mean? Science Translational Medicine, 8(341). doi:10.1126/scitranslmed.aaf5027.
Hamill, N., Romero, R., Hassan, S.S., Lee, W., Myers, S.A., Mittal, P., Kusanovic, J.P., Chaiworapongsa, T., Vaisbuch, E., Espinoza, J., Gotsch, F., Carletti, A., Gonçalves, L.F., and Yeo, L. (2009). Repeatability and reproducibility of fetal cardiac ventricular volume calculations using spatiotemporal image correlation and virtual organ computer-aided analysis. Journal of Ultrasound in Medicine, 28(10), 1301–1311.
Hines, W.C., Su, Y., Kuhn, I., Polyak, K., and Bissell, M.J. (2014). Sorting out the FACS: A devil in the details. Cell Reports, 6(5), 779–781.
Law, A.M., and Kelton, W.D. (1991). Simulation Modelling and Analysis (2nd ed.). New York: McGraw-Hill.
Lejaeghere, K., Bihlmayer, G., Björkman, T., Blaha, P., Blügel, S., Blum, V., Caliste, D., Castelli, I.E., Clark, S.J., Dal Corso, A., de Gironcoli, S., Deutsch, T., Dewhurst, J.K., Di Marco, I., Draxl, C., Dułak, M., Eriksson, O., Flores-Livas, J.A., Garrity, K.F., Genovese, L., Giannozzi, P., Giantomassi, M., Goedecker, S., Gonze, X., Grånäs, O., Gross, E.K., Gulans, A., Gygi, F., Hamann, D.R., Hasnip, P.J., Holzwarth, N.A., Iuşan, D., Jochym, D.B., Jollet, F., Jones, D., Kresse, G., Koepernik, K., Küçükbenli, E., Kvashnin, Y.O., Locht, I.L., Lubeck, S., Marsman, M., Marzari, N., Nitzsche, U., Nordström, L., Ozaki, T., Paulatto, L., Pickard, C.J., Poelmans, W., Probert, M.I., Refson, K., Richter, M., Rignanese, G.M., Saha, S., Scheffler, M., Schlipf, M., Schwarz, K., Sharma, S., Tavazza, F., Thunström, P., Tkatchenko, A., Torrent, M., Vanderbilt, D., van Setten, M.J., Van Speybroeck, V., Wills, J.M., Yates, J.R., Zhang, G.X., and Cottenier, S. (2016). Reproducibility in density functional theory calculations of solids. Science, 351(6280), aad3000. doi:10.1126/science.aad3000.
McNutt, M. (2014a). Journals unite for reproducibility. Science, 346(6210), 679. doi:10.1126/science.aaa1724.
McNutt, M. (2014b). Reproducibility. Science, 343(6168), 229. doi:10.1126/science.1250475.
Munafò, M.R., Nosek, B.A., Bishop, D.V., Button, K.S., Chambers, C.D., du Sert, N.P., Simonsohn, U., Wagenmakers, E.-J., Ware, J.J., and Ioannidis, J.P.A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021. Available: https://www.nature.com/articles/s41562-016-0021 [December 2018].
National Academies of Sciences, Engineering, and Medicine (NASEM). (2015). Reproducibility Issues in Research with Animals and Animal Models: Workshop in Brief. Washington, DC: The National Academies Press.
NASEM. (2016). Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop. Washington, DC: The National Academies Press.
NASEM. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. doi:10.1126/science.aac4716.
Pantic, M., Pentland, A., Nijholt, A., and Huang, T.S. (2007). Human computing and machine understanding of human behavior: A survey. In Artificial Intelligence for Human Computing (pp. 47–71). Berlin/Heidelberg, Germany: Springer-Verlag.
Pashler, H., and Wagenmakers, E.J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7(6), 528–530.
Pentland, A., and Liu, A. (1999). Modeling and prediction of human behavior. Neural Computation, 11(1), 229–242.
Scott, S. (2013). Pre-registration would put science in chains. Time Higher Education, July 25. Available: http://www.timeshighereducation.co.uk/comment/opinion/pre-registration-would-put-science-in-chains/2005954.article [December 2018].
Stodden, V. (2015). Reproducing statistical results. Annual Review of Statistics and Its Application, 2, 1–19. doi:10.1146/annurev-statistics-010814-020127.
Stodden, V. (2017). Framing the Issues: Reproducibility in Many Forms. Available: https://web.stanford.edu/~vcs/talks/NASSacklerMar82017-STODDEN.pdf [December 2018].
Stodden, V., Guo, P., and Ma, Z. (2013). Toward reproducible computational research: An empirical analysis of data and code policy adoption by journals. PLoS One, 8(6), e67111. doi:10.1371/journal.pone.0067111.
Stodden, V., Leisch, F., and Peng, R.D. (2014). Implementing Reproducible Research. Boca Raton, FL: CRC Press.
van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J.B., Neyer, F.J., and van Aken, M.A. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85(3), 842–860.
Voelkl, B., Vogt, L., Sena, E., and Würbel, H. (2018). Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biology, 16(2), e2003693. doi:10.1371/journal.pbio.2003693.
Vogt, L., Reichlin, T.S., Nathues, C., and Würbel, H. (2016). Authorization of animal experiments is based on confidence rather than evidence of scientific rigor. PLoS Biology, 14(12), e2000598. doi:10.1371/journal.pbio.2000598.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H.L., and Kievit, R.A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.
This page intentionally left blank.