National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 4 Reproducibility

« Previous: 3 Understanding Reproducibility and Replicability
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 45
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 46
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 47
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 48
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 49
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 50
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 51
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 52
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 53
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 54
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 55
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 56
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 57
Suggested Citation:"4 Reproducibility." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 58

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Prepublication copy, uncorrected proofs. 4 REPRODUCIBILITY As defined by the committee, reproducibility relates strictly to computational reproducibility—obtaining consistent results using the same input data, computational methods, and conditions of analysis (see Chapter 3). This chapter reviews the technical and procedural challenges in ensuring reproducibility and assesses the extent of non-reproducibility in scientific and engineering research. The committee also examines factors that may deter or limit reproducibility. WIDESPREAD USE OF COMPUTATIONAL METHODS Most scientific disciplines today use computation as a tool (Hey, et al., 2009). For example, public health researchers “data mine” large databases looking for patterns, earth scientists run massive simulations of complex systems to learn about geological changes in our planet, and psychologists use advanced statistical analyses to uncover subtle effects from randomized controlled experiments. Many researchers use software at some point during their work and some are creating their own software to advance their research (Nangia and Katz, 2017). Researchers can use computation as a tool to enable data acquisition (e.g., from instruments), data management (e.g., transforming or “cleaning”, processing, curating, archiving), analysis (e.g., modeling, simulation, data analysis, and data visualization), automation, and other various tasks. Computation can also be the object of study, with researchers using computing to design and test new algorithms and systems. However, the vast majority of researchers do not have formal training in software development (e.g., managing workflow processes such as maintaining code and using version control, performing unit testing). While the abundance of data and widespread use of computation have transformed most disciplines and have enabled important scientific discoveries, this revolution is not yet reflected in how scientific results aided by computations are reported, published, and shared. Most computational experiments or analyses are discussed informally in papers, results are briefly described in table and figure captions, and the code that produced the results is seldom available. Buckheit and Donoho (1995, p. 5) paraphrase Jon Claerbout as saying that “An article about computational science [. . .] is merely advertising of the scholarship. The actual scholarship is the complete software development environment, and the complete set of instructions which generated the figures.” The connection between reproducibility and transparency (open code and data) was made early by the pioneers of the reproducible-research movement. Claerbout and Karrenbach (1992) advocated merging research publications with the availability of the underlying computational analysis and using a public license that allows others to reuse, copy, and redistribute the software. Buckheit and Donoho (1995, p. 4) support similar ideals, stating that “reproducibility . . . requires having the complete software environment available in other laboratories and the full source code available for inspection, modification, and application under varied parameter settings.” Later, Donoho et al. (2009, p. 8) explicitly defined reproducible computational research as that in which 45

Prepublication copy, uncorrected proofs. “all details of the computation—code and data—are made conveniently available to others.” The Yale Law School Roundtable on Data and Code Sharing (2010) issued a statement urging more transparency in computational sciences and offered concrete recommendations for reproducibility: assign a unique identifier to every version of the data and code, describe within each publication the computing environment used, use open licenses and nonproprietary formats, and publish under open-access conditions (or post preprints). Peng (2011, p. 1226) explains: . . . every computational experiment has, in theory, a detailed log of every action taken by the computer. Making these computer codes available to others provides a level of detail regarding the analysis that is greater than the analogous non- computational experimental descriptions printed in journals using a natural language. Non-Public Data and Code In many cases, sharing or submitting data and code with submission of a manuscript to a journal is the responsibility of the researcher. However, the researcher may not be allowed to do so when data or code are not publicly releasable for licensing, privacy, or commercial reasons. For example, data or code may be proprietary as is the case often with commercial data sets; privacy laws (such as the Health Insurance Portability and Accountability Act, HIPAA) may restrict sharing of personal information.1 Non-public data are often managed by national organizations or commercial (private) entities. In each case, protecting data and code has a reasonable goal, although at odds with the aim of computational reproducibility. In some instances, access is allowed to researchers for both original research and reproducibility efforts (i.e., the U.S. Federal Statistical Research Data Center or the German Research Data Center of the Institute for Employment Research); in other cases, prior agreements with data or code owners will allow a researcher to share their data and code with others for reproducibility efforts (Vilhuber, 2018). Non-public databases such as those storing national statistics are of particular interest to economists. Access is granted through a set of protocols. However, data sets used in research may still not be shared with others. Creation of a data set for research is a considerable task requiring the development, in the case of databases, of queries and cleaning of the data set prior to use. While a second researcher may have access to the same non-public database and the query used by the original, differences in data cleaning decisions will result in a different final data set. Additionally, many of the large databases used by economists continuously add data so queries submitted at different times, result in different initial data sets. In this case, reproducibility is not possible while replicability is (Vilhuber, 2018). Resources and Costs of Reproducibility Newly developed tools allow researchers to more easily follow Peng’s advice by capturing detailed logs of a researchers’ key strokes or changes to code (for more details on these tools, see Chapter 6). Studies that have been designed with computational reproducibility as a key 1 Journals that require data to be shared generally allow some exceptions to the data-sharing rule. For example, the Public Library of Science (PLOS) publications allow researchers to exclude data that would violate participant privacy, but they will not publish research that is based solely on proprietary data that are not made available or if data that are withheld for personal reasons (e.g., future publication or patents). 46

Prepublication copy, uncorrected proofs. component may take advantage of these tools and efficiently track and retain relevant computational details. For studies and longstanding collaborations that have not designed their processes around computational reproducibility, retrofitting existing processes to capture logs of computational decisions represents a resource choice between advancing current research or redesigning a potentially large and complex system. Such studies have often developed methods for gaining confidence in the function of the system through verification and validation checks and internal reviews, for example. While efforts to improve reporting and reproducibility in computational sciences has expanded to the broader scientific community (“Error prone”, 2012; Vandewalle et al., 2007; Cassey and Blackburn, 2006; Konkol et al., 2019), the costs and resources required to support computational reproducibility are not well established and may well be substantial. As new computational tools and data storage options become available, and as the cost of massive digital storage continues to decline, these developments will eventually make computational reproducibility more affordable, feasible, and routine. FINDING 4-1: Most scientific and engineering research disciplines use computation as a tool. While the abundance of data and widespread use of computation have transformed many disciplines and have enabled important scientific discoveries, this revolution is not yet uniformly reflected in how scientists develop and use software and how scientific results are published and shared. FINDING 4-2: When results are produced by complex computational processes using large volumes of data, the methods section of a traditional scientific paper is insufficient to convey the necessary information for others to reproduce the results. RECOMMENDATION 4-1: To help ensure the reproducibility of computational results, researchers should convey clear, specific, and complete information about any computational methods and data products that support their published results in order to enable other researchers to repeat the analysis, unless such information is restricted by non-public data policies. That information should include the data, study methods, and computational environment:  the input data used in the study either in extension (e.g., a text file or a binary) or in intension (e.g., a script to generate the data), as well as intermediate results and output data for steps that are nondeterministic and cannot be reproduced in principle;  a detailed description of the study methods (ideally in executable form) together with its computational steps and associated parameters; and  information about the computational environment where the study was originally executed, such as operating system, hardware architecture, and library dependencies (Library dependency, in the context of research software as used here, is the relationship of pieces of software that are needed for another software to run. Problems often occur when 47

Prepublication copy, uncorrected proofs. installed software has dependencies on specific versions of other software.).2 ASSESSING REPRODUCIBILITY When a second researcher attempts to computationally reproduce the results of another researcher’s work, the attempt is considered successful if the two results are consistent. For computations, one may expect that the two results be identical (obtaining a bitwise identical numeric result). In most cases, this is a reasonable expectation, and the assessment of reproducibility is straightforward. However, there are legitimate reasons for reproduced results to differ while still being considered consistent.3 In some research settings, it may make sense to relax the requirement of bitwise reproducibility and settle on reproducible results within an accepted range of variation (or uncertainty). This can only be decided, however, after fully understanding the numerical-analysis issues affecting the outcomes. Researchers applying high-performance algorithms thus recognize (as noted in Diethelm, 2012) that when different runs with the same input data produce slightly different numeric outputs, each of these results is equally credible, and the output must be understood as an approximation to the correct value within a certain accepted uncertainty. Sources of the uncertainty could be, for example, floating point averaging in parallel processors (Box 4-1) or even cosmic rays interacting with processors within a supercomputer in climate change research (Box 4-2). In other research settings, there may be a need to reproduce the result extremely accurately, and researchers must tackle variability in computations using higher-precision arithmetic or by redesigning the algorithms (Bailey et al., 2012). BOX 4-1 Parallel Processing and Numerical Precision Although it may seem evident that running an analysis with identical inputs would result in identical outputs, this is sometimes not true. One condition under which computed results can vary between runs of the same computational analysis occurs when using computers that rely on parallel processors. Two factors are at play: the way that numbers are represented in a computer, and how individual processors cooperate in a multicore or distributed system. Numbers are represented in a computer using floating-point representation, consisting of a number of significant digits scaled by an exponent in a fixed base. For example, the speed of light is 299,792,458 m/s; in normalized floating-point representation, this is 2.99792458 × 108 2 The definition of “library dependency” was corrected during copy editing of the prepublication version of the report. 3 As briefly mentioned in Chapter 2, reproducibility does not ensure that the results themselves are correct. If there was a mistake in the source code, and another researcher used the same code to rerun the analysis, the reproduced results would be consistent but still incorrect. However, the fact that the information was transparently shared would allow other researchers to examine the data, code, and analysis closely and possibly detect errors. For example, an attempt by an economic researcher to reproduce earlier results highlighted software errors in a statistics program used by many researchers in the field (McCullough and Vinod, 2003). Without a high level of transparency, it is difficult to know if and where a computational error may have occurred. 48

Prepublication copy, uncorrected proofs. (in base 10). The number of significant digits gives the precision of the floating-point approximation. Nine digits are needed for the exact value of the speed of light, but computers store numbers with limited precision and will round this to 2.997924 × 108 when working with only seven digits of precision. If some calculation were to involve, say, adding a speed of 10 m/s to the speed of light, the rules of floating-point arithmetic mean that to add the numbers, the smaller one has to be shifted to the same exponent as the larger one, so 10 m/s is represented as 0.00000010 × 108, which with seven-digit precision gets rounded off to zero. Adding floating- point numbers of disparate scales can thus result in lost accuracy in the result. Diethelm (2012) discusses the limits of reproducibility in high-performance (parallel) computing, given the approximate nature of floating-point arithmetic. When a large calculation (such as adding millions of numbers) is divided up so that many processors cooperate in obtaining the result in parallel, the order in which each processor finishes computing (its partial sum) cannot be guaranteed. Partial results get computed, and loss of accuracy may occur when the numbers involved have disparate scales (as described above). The final result will be different depending on the order in which the partial results are gathered together by the master process. (That is, in mathematical terms, floating-point addition is commutative but not associative.) It is possible to prevent this lack of (numerical) reproducibility, but doing so involves artificial synchronization points in the calculation, which degrades performance. When the research requires expensive simulations that run for many days on supercomputers, the focus of research teams is understandably on maximizing performance. Thus, there is a tension between computational performance and strict numerical reproducibility of the results in parallel computing. A computational result may be in the form of confirming a hypothesis that entails a complex relationship among variables. Consider this example: on observing a marked seasonal migration of a species of butterflies between Europe and North Africa, researchers posed the hypothesis that the migratory strategy evolved to track the availability of host plants (for breeding) and nectar sources (Stefanescu et al., 2017). After collecting field data of plant abundance and butterfly populations, the researchers built statistical models to confirm a correlation in the temporal patterns of migration and plant abundance. The computational results were presented in the form of model parameter estimates, computed using statistical software and custom scripts. A consistent computational result, in this case, means obtaining the same model parameter estimates and measures of statistical significance within some degree of sampling variation. Artificial intelligence and machine learning present unique new challenges to computational reproducibility and as these fields continue to grow, the techniques and approaches for documenting and capturing the relevant parameters to enable reproducibility and confirmation of study results needs to keep pace. BOX 4-2 Reproducing Climate Model Results For global climate models (GCMs), computational reproducibility refers to the ability to rerun a model with a given set of initial conditions and produce the same results. Such a result is achievable for short time spans and individual locations and is essential for model testing and 49

Prepublication copy, uncorrected proofs. software debugging, but the dominance of this definition as a paradigm in the field is giving way to a more statistical way of understanding model output. Historically, climate modelers believed that they needed the more rigid definition of bitwise reproduction because the nonlinear equations governing Earth systems are chaotic and sensitive to initial conditions. However, this numerical reproducibility is difficult to achieve with the computing arrays required by modern GCMs. There is also a long history of occurrences in the models that have caused random errors and have never been reproduced, such as possible cosmic ray strikes.a Other reported events in uncontrolled model runs may or may not have been the result of internal model variability or software problems (see, e.g., Hall and Stouffer, 2001; Rind et al., in press). Reproducing the conditions that cause these random events is difficult, and scientists’ lack of understanding of their effects diminishes the utility of the model. Features of computer architecture that undermine the ability to achieve bitwise reproducibility include fused multiply- add, which cannot preserve the order of operations, memory details, and issues of parallelism when a calculation is divided across multiple processors (see Box 4-1, above). Moreover, the environment in which GCMs are run is fragile and ephemeral on the scale of months to years, as compilers, libraries, and operating systems are continually updated, such that revisiting a 10- year-old study would require an impractical museum of supercomputers. Retaining bitwise reproducibility will become even more difficult in the near future as machine-learning algorithms and neural networks are introduced. Therefore, scientists are also interested in representing stochasticity in the physical models by harnessing noise inherent within the electronics, and some current devices have mixed or variable bit precision. a Cosmic ray strikes within computer hardware are another source of undetected error, and by mapping errors in model output, researchers have been able to reconstruct the path of a particle as it passed through the memory of a supercomputer stack. Therefore, the focus of the discipline has not been on model run reproducibility, but rather on replication of the model phenomena that are observed and their magnitudes (Hansen et al, 1984). SOURCE: Adapted from Bush (2018, pp. 12-13. FINDING 4-3: Computational reproducibility, within the range of thoughtfully assessed uncertainties, can be expected for research results given sufficient access and description of data, code, and methods, with a few notable exceptions, such as complex processing techniques and the use of proprietary or personal information. FINDING 4-4: Understanding the limits of computational reproducibility in increasingly complex computational systems, such as artificial intelligence, high- performance computing, and deep learning, is an active area of research. RECOMMENDATION 4-2: The National Science Foundation should consider investing in research that explores the limits of computational reproducibility in instances in which bitwise reproducibility is not reasonable in order to ensure that the meaning of consistent computational results remains in step with the development of new computational hardware, tools, and methods. 50

Prepublication copy, uncorrected proofs. THE EXTENT OF NON-REPRODUCIBILITY The committee was asked to assess what is known and, if necessary, identify areas that may need more information to ascertain the extent of non-reproducibility in scientific and engineering research. The committee examined current efforts to assess the extent of non- reproducibility within several fields, reviewed literature on the topic, and heard from expert panels during its public meetings. It also drew on the previous work of committee members and other experts in the reproducibility of research. A summary of the reproducibility studies assembled by the committee is shown in Table 4-1. As noted earlier, transparency is a prerequisite for reproducibility. Transparency represents the extent to which researchers provide sufficient information to enable others to reproduce the results. A number of studies have examined the extent of the availability of computational information within particular fields or publications as an indirect measure of computational reproducibility. Most of the studies shown in Table 4-1 assess transparency and are thus indirect measures of computational reproducibility. Four studies listed in Table 4-1 are results of direct reproducibility (reruns of the available data and code): Dewald, Jacoby, Moraila, and Chang and Li. In the Dewald study, nine original research results were reproduced in a 2-year effort; of the nine, four were unsuccessful. Jacoby described the standing contract of the American Journal for Political Science with a university to computationally reproduce every article prior to publication; he reported to the committee that each article requires ~8 hours to reproduce. In Moraila’s effort, software could be built for fewer than half of the 231 studies, highlighting the challenges of reproducing computational environments. Chang and Li were able to reproduce the results of half of the 67 studies they examined. 51

Prepublication copy, uncorrected proofs. TABLE 4-1 Examples of Reproducibility-Related Studies __________________________________________________________________________ Author Field Scope of Study Reported Concerns Prinz et al. (2011) Biology Data from 67 projects Published data in line with (oncology, within Bayer Healthcare in-house results: ~20 to 25 women’s health, percent of total projects cardiovascular health) Iqbal et al. (2016) Biomedical An examination of 441 Of 268 papers with biomedical studies empirical data, 267 did not published between 2000 include a link to a full study and 2014 protocol, and none provided access to all of the raw data used in the study. Stodden et al. Computational An examination of the Over half (50.9 %) of the (2018a) physics availability of artifacts articles were impossible to for 307 articles reproduce. published in the Journal About 6 percent of the of Computational articles (17) made artifacts Physics available in the publication itself, and about 36 percent discussed the artifacts (e.g., mentioned code) in the article. Of the 298 authors who were emailed with a request for artifacts, 37 percent did not reply, 48 percent replied but did not provide any artifacts, and 15 percent supplied some artifacts. Stodden et al. Cross- A randomly selected Fewer than half of the (2018b) disciplinary, sample of 204 articles provided data: 24 computation- computation-based articles had data, an based research articles published in additional 65 provided some Science, which has a data when requested. data sharing requirement for publication 52

Prepublication copy, uncorrected proofs. Chang and Li Economics An effort to reproduce 67 Of the 67 articles, 50 percent (2018) economics papers from reproduced 13 different journals Dewald (1986) Economics A 2-year study that Data were available for 72- collected programs and 78% of the nine articles, 2 data from authors reproduced successfully, 3 published empirical “near” successful, 4 economic research unsuccessful. articles. Duvendack Economics A progress report on the In 27 of 333 economics (2015) number of economics journals, more than 50% of journals with data- the articles included the sharing requirements authors’ sharing of data and code (and increase from 4 journals in 2003). Jacoby (2017) Political science A review of the results Of the first 116 articles, 8 of a standing contract were reproduced on the first between American attempt. Journal for Political Science and universities reproduce all articles submitted to the journal Gunderson (2018) Artificial A review of challenges In a survey of 400 intelligence and lack of algorithms presented in reproducibility in papers at two top artificial artificial intelligence intelligence conferences in the past few years, 6 percent of the presenters shared the algorithm's code; 30 percent shared the data they tested their algorithms on; and 54 percent shared "pseudocode"—a limited summary of an algorithm. Setti (2018) Imaging A review of the For the year covered, 9 published availability of percent reported available data and code for code, and 33 percent articles in Transactions reported available data. on Imaging for 2004 53

Prepublication copy, uncorrected proofs. Moraila et al. An empirical study of The software could be built (2014) reproducibility in for less than half of the computer-systems studies for which artifacts research conferences were available (108 of 231) Read et al. (2015) Data work A preliminary estimate 12 percent explicitly funded by the of the number and type mention deposition of National of NIH-funded datasets; datasets in recognized Institutes of focused on those repositories, leaving 88 Health (NIH) datasets that were percent (2000,000 of “invisible” or not 235,000) with invisible deposited in a known datasets; of the invisible repository; studied datasets, approximately 87 published articles in percent consisted of data 2011 cited in PubMed newly collected for the and deposited in research reported, and 13 PubMed Central percent reflected reuse of existing data. More than 50 percent of the datasets were derived from live human or nonhuman animal subjects. Byrne (2017) An assessment of the 20 percent of the articles open data policy of have data or code in a PLOS ONE as of 2016 repository; (noting that rates of data 60 percent of the articles and code availability are have data in main text or increasing) supplemental information; 20 percent have restrictions on data access. Notable in the studies listed above is the lack of a uniform standard for success or failure. The determination of transparency has layers of success. For example, downloadable data or code, downloadable data and code but not functioning, or available after a single request of the author. Similar assessments are shown for reproducibility attempts, for example, “near” successful results provided by Dewald. FINDING 4-5: There are relatively few direct assessments of reproducibility, replaying the computations to obtain consistent results, in comparison to assessments of transparency, the availability of data and code. Direct assessments 54

Prepublication copy, uncorrected proofs. of computational reproducibility are more limited in breadth and often take much more time and resources than assessments of transparency. CONCLUSION 4-1: Assessments of computational reproducibility take more than one form—indirect and direct—and the standards for success of each are not universal and not clear-cut. In addition, the evidence base of non-reproducibility of computations across science and engineering research is incomplete. These factors contribute to the committee’s assessment that determining the extent of issues related to computational reproducibility across fields or within fields of science and engineering is a massive undertaking with a low probability of success. Rather, the committee’s collection of reproducibility attempts across a variety of fields allows us to note that a number of systematic efforts to reproduce computational results have failed in more than half of the attempts made, mainly due to insufficient detail on digital artifacts, such as data, code and computational workflow. Expecting computational reproducibility is considered by some to be too low of a bar for scientific research yet our data in Table 4-1 show that many attempts to reproduce results initially fail. As noted by Peng (2016), “[Reproducibility] may initially sound like a trivial task but experience has shown that it’s not always easy to achieve this seemingly minimal standard.” SOURCES OF NON-REPRODUCIBILITY The findings and conclusion in the previous section raise a key question: What makes reproducibility so difficult to achieve? A number of factors can contribute to the lack of reproducibility in research. In addition to lack of access to non-public data and code, mentioned previously, the contributors include:  Inadequate record-keeping: the original researchers did not properly record the relevant digital artifacts such as protocols or steps followed to obtain the results, the details of the computational environment and software dependencies, and/or information on the archival of all necessary data.  Non-transparent reporting: the original researchers did not transparently report and provide open access to or archive the relevant digital artifacts necessary for reproducibility  Obsolescence of the digital artifacts: over time, the digital artifacts in the research compendium are compromised because of technological breakdown and evolution or lack of continued curation.  Flawed attempts to reproduce others’ research: the researchers who attempted to reproduce the work lacked expertise or failed to correctly follow the research protocols.  Barriers in the culture of research: lack of resources and incentives to adopt computationally reproducible and transparent research across fields and researchers. The rest of this section explores each of these factors. 55

Prepublication copy, uncorrected proofs. Inadequate Record-Keeping The information that needs to be shared in order for research to be reproducible may vary depending on the type of research and the methods and tools used. However, the essential component is that the relevant information required to obtain a consistent result by another researcher (also referred to as “the full compendium of artifacts”) must be provided by the original researcher. In order to transparently report and share the full compendium of artifacts required for reproducibility, a researcher must first take care to adequately record a detailed provenance of all the research results. Provenance refers to information about how a result was produced and it includes: how, when, and who collected any data; what steps were followed to transform, curate, or clean it; what software (and its version) was used to analyze it (Davidson and Freire 2008). In general, the computational details that need to be captured and shared for reproducible research include data, code, parameters, computational environment, and computational workflow: ● the data that were used in the analysis,3 formatted appropriately for the research question, and complemented with standard or sufficient metadata; ● written statements in a programming language (i.e., the source code of the software used in the analysis or to generate data products) including models, data processing scripts, and software notebooks; ● numeric values of all configurable settings for software, instruments, or other hardware, that is, the parameters, for each individual experiment or run; ● detailed specification of computational environment including system software and hardware requirements, including the version number of each software used; and ● computational workflow which is a collection of data-processing scripts, statistical model specification, secondary data, and code that generated tables and figures in final published form, that is, the computational workflow for how the software applications are configured and how the data flows between them). Meticulous and complete record-keeping is increasingly challenging and potentially time consuming as scientific workflows involve ever more intricate combinations of digital and physical artifacts and entail complex computational processes that combine a multitude of tools and libraries.4 Satisfying all of these challenging conditions for transparent computation requires 3 Final data sets used in analysis are the result of data collection and data culling (or cleaning). Decisions related to each step must be captured. 4 For example, consider a scientific workflow that involves processing an image captured by an instrument, where the final presentation of the image enables the researcher to glean understanding from the data. If the researcher used image-processing software through a graphical user interface (GUI), that is, by clicking and dragging graphical elements on the computer screen, it might be impossible for another researcher to subsequently reproduce the resulting image. For this reason, reproducibility advocates find fault with any interactive programs “unless they include the ability to arrive in any previous state by means of a script” (Fomel and Claerbout, 2009, p. 6). Some observers go as far as saying that “two technologies are enemies of reproducible research: GUI-based image manipulation, and spreadsheets” (Barba et al., 2017). The use of spreadsheet software impairs reproducibility because spreadsheets conflate input, output, code, and presentation (Stark, 2016). Spreadsheets inhibit one’s ability to make a record of all steps taken to construct a full analysis of the data, and they are notoriously hard to debug. Hettrick (2017) describes 56

Prepublication copy, uncorrected proofs. that researchers are highly motivated to ensure reproducibility. If will and incentives are lacking, it is easier for researchers to forego creating the conditions for reproducibility, as suggested by the results of reproducibility studies shown in Table 4-1. Manually keeping track of every decision in the process to include the details in a scientific paper is time-consuming and potentially error prone. Tools are available and more are being developed to autocapture relevant details in these complex environments (see Chapter 6). Non-Transparent Reporting A second barrier to computational reproducibility is the lack of sharing or insufficient sharing of the full compendium of artifacts necessary to rerun the analysis, including the data used,5 the source code, information about the computational environment, and other digital artifacts. This information may not be reported for a number of reasons. First, a researcher may be unaware of a norm to share the information, or unaware of the details necessary to ensure reproducibility (as detailed above). Second, a researcher could be unwilling to share to ensure priority in patenting or publishing or because he or she does not see any benefit to sharing. Third, a researcher might lack the ability to share due to limited infrastructure (i.e., tools to capture the provenance or a repository to store the data or code), non- public restrictions (see Non-Public Data and Code), or the compendium of artifacts is too large. For example, the sharing policies for Science offers ideas for where to share data, but it does not “suggest specific repositories or give instructions for hosting and sharing code and computational methods,” and there “is no consensus regarding repositories, metadata, or computational provenance” (Stodden et al., 2018b, p. 2584). Obsolescence of Digital Artifacts The ability to reproduce published results can decline over time because digital artifacts can become unusable, inoperative, or unavailable due to technological breakdown and evolution, or poor curation. This means that even if the original researcher properly recorded all of the relevant information, transparently reported it, and researchers with expertise and resources are available, reproduction attempts could still fail. Research software exists in an ecosystem of scientific libraries, system tools, and compilers. All of these are dynamic, receiving updates to improve security, fix bugs, or add features; some are no longer maintained and fail to operate with other software in the system evolves through upgrade. In the process of adding new features, a library could change how it interfaces with other software, making other code that depends on it unusable unless updated. Researchers often refer to this as “code rot.” Potential solutions through archival systems have been proposed (see Chapter 6). the difficulties faced when trying to reproduce an analysis originally conducted on spreadsheet software, and he concluded: “it’s almost impossible to reconstruct the logic behind spreadsheet-based analysis.” 5 Data quality issues also add to the complexity of identifying problems in a computational pipeline. According to J. Freire (New York University, personal communication): Because people now must manage (ingest, clean, integrate, analyze) vast amounts of data, and data come from multiple sources with different levels of reliability, it is often not practical to curate the data. To extract actionable insight from data, complex computational processes are required. They are hard to assemble, and, once deployed, they can (break) in unforeseen ways (e.g., due to a library upgrade, or a small change in the simulation code). If you have an analysis consisting of many steps, there are many ways that you could be wrong and that data could be wrong. 57

Prepublication copy, uncorrected proofs. Flawed Attempts to Reproduce Others’ Research Just as researchers conducting original studies may make mistakes or have insufficient expertise to conduct the experiments or analysis properly, a researcher who is attempting to reproduce a result may also make mistakes or fail to follow the original protocols. Even when the original study qualifies as reproducible research, because all the relevant protocols were automated and the digital artifacts are available such that it is capable of being checked, another researcher without proper training and capabilities may be unable to use those artifacts. Barriers in the Culture of Research While interest in open science practices is growing, and many stakeholders have adopted policies or created tools to facilitate transparent sharing, the research enterprise as a whole has not adopted sharing and transparency as near-universal norms and expectations for reproducibility (National Academies of Sciences, Engineering, and Medicine, 2018). As shown in Table 4-1, low levels of transparency are common. Currently, sharing and transparency are generally not rewarded in academic tenure and promotion systems, while the perception or reality that greater openness requires significant effort and apprehension about being scrutinized or “scooped” remain. In some disciplines and research groups, data are seen as resources that must be closely held, and it is widely believed that researchers best advance their careers by generating as many publications as possible using data before the data are shared. Shifting rewards and incentives will require thoughtful changes on the part of research institutions, working with funders and publishers (see Chapter 6). 58

Next: 5 Replicability »
Reproducibility and Replicability in Science Get This Book
×
Buy Prepub | $69.00 Buy Paperback | $65.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!