National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 5 Replicability

« Previous: 4 Reproducibility
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 59
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 60
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 61
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 62
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 63
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 64
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 65
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 66
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 67
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 68
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 69
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 70
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 71
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 72
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 73
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 74
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 75
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 76
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 77
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 78
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 79
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 80
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 81
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 82
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 83
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 84
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 85
Suggested Citation:"5 Replicability." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 86

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Prepublication copy, uncorrected proofs. 5 REPLICABILITY Replicability is a subtle and nuanced topic, especially when discussed broadly across scientific and engineering research. An attempt by a second researcher to replicate a previous study is an effort to determine whether applying the same methods to the same scientific question produces similar results. Beginning with an examination of methods to assess replicability, in this chapter we discuss evidence that bears on the extent of non-replicability in scientific and engineering research and examine factors that affect replicability. Replication is one of the key ways scientists build confidence in the scientific merit of results. When the result from one study is found to be consistent by another study, it is more likely to represent a reliable claim to new knowledge. As Popper (2005, p. 23-24) wrote (using “reproducibility” in its generic sense): We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated coincidence, but with events which, on account of their regularity and reproducibility, are in principle intersubjectively testable. However, a successful replication does not guarantee that the original scientific results of a study were correct, nor does a single failed replication conclusively refute the original claims. A failure to replicate previous results can be due to any number of factors, including the discovery of an unknown effect, inherent variability in the system, inability to control complex variables, substandard research practices, and, quite simply, chance. The nature of the problem under study and the prior likelihoods of possible results in the study, the type of measurement instruments and research design selected, and the novelty of the area of study and therefore lack of established methods of inquiry can also contribute to non-replicability. Because of the complicated relationship between replicability and its variety of sources, the validity of scientific results should be considered in the context of an entire body of evidence, rather than an individual study or an individual replication. Moreover, replication may be a matter of degree, rather than a binary result of “success” or “failure.”1 We explain in Chapter 7 how research synthesis, especially meta- analysis, can be used to evaluate the evidence on a given question. ASSESSING REPLICABILITY How does one determine the extent to which a replication attempt has been successful? When researchers investigate the same scientific question using the same methods and similar tools, the results are unlikely to be identical—unlike in computational reproducibility in which 1 See, for example, the cancer biology project in Table 5-1, below. 59

Prepublication copy, uncorrected proofs. bitwise agreement between two results can be expected (see Chapter. 4). We repeat our definition of replicability, with emphasis added: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. Determining consistency between two different results or inferences can be approached in a number of ways (Simonsohn, 2015; Verhagen and Wagenmakers, 2014). Even if one considers only quantitative criteria for determining whether two results qualify as consistent, there is variability across disciplines (Zwaan et al., 2018; Hanisch and Plant, 2018). The Royal Netherlands Academy of Arts and Sciences (2018, p. 20) concluded: " . . . it is impossible to identify a single, universal approach to determining [replicability]." As noted in Chapter 2, different scientific disciplines are distinguished in part by the types of tools, methods, and techniques used to answer questions specific to the discipline, and these differences include how replicability is assessed. Acknowledging the different approaches to assessing replicability across scientific disciplines, however, we emphasize several core characteristics and principles: 1. Attempts at replication of previous results are conducted following the methods and using similar equipment and analyses as described in the original study or under sufficiently similar conditions (Cova et al., 2018).2 Yet regardless of how similar the replication study is, no second event can exactly repeat a previous event. 2. The concept of replication between two results is inseparable from uncertainty, as is also the case for reproducibility (as discussed in Chapter 4). 3. Any determination of replication (between two results) needs to take account of both proximity (the closeness of one result to the other, such as the closeness of the mean values) and uncertainty (variability in the measures of the results). 4. To assess replicability, one must first specify exactly what attribute of a previous result is of interest. For example, is only the direction of a possible effect of interest? Is the magnitude of effect of interest? Is surpassing a specified threshold of magnitude of interest? With the attribute of interest specified, one can then ask whether two results fall within or outside the bounds of “proximity-uncertainty” that would qualify as replicated results. 5. Depending on the criteria (measure, attribute) selected, assessments of a set of attempted replications could appear quite divergent.3 6. A judgment that “Result A is replicated by result B,” must be identical to the judgment that “Result B is replicated by result A.” There must be a symmetry in the judgment of replication; otherwise, internal contradictions are inevitable. 7. There could be advantages to inverting the question from, “Does result A replicate result B (given their proximity and uncertainty)?” to “Are results A and B sufficiently divergent 2 Cova et al. (2018, p. xx) discuss the challenge of defining sufficiently similar as well as the interpretation of the results: In practice, it can be hard to determine whether the ‘sufficiently similar’ criterion has actually been fulfilled by the replication attempt, whether in its methods or in its results (Nakagawa and Parker 2015). It can therefore be challenging to interpret the results of replication studies, no matter which way these results turn out (Collins 1975; Earp and Trafimow 2015; Maxwell et al. 2015). 3 See Table 5-1, below, for an example of this in the reviews of a psychology replication study by Open Science Collaboration (2015) and Patil et al. (2016). 60

Prepublication copy, uncorrected proofs. (given their proximity and uncertainty) so as to qualify as a non-replication?” It may be advantageous, in assessing degrees of replicability, to define a relatively high threshold of similarity that qualifies as “replication,” a relatively low threshold of similarity that qualifies as “non-replication,” and the intermediate zone between the two thresholds that is considered “indeterminate.” If a second study has low power and wide uncertainties, it may be unable to produce any but indeterminate results. 8. While a number of different standards for replicability/non-replicability may be justifiable, depending on the attributes of interest, a standard of “repeated statistical significance” has many limitations because the level of statistical significance is an arbitrary threshold (Amrhein et al., 2019a; Boos et al, 2011; Goodman, 1992; Lazzeroni et al., 2016). For example, one study may yield a p-value of 0.049 (declared significant at the p = 0.05 level) and a second study yields a p-value of 0.051 (declared non-significant by the same p-value threshold) and therefore the studies are said not to be replicated. However, if the second study had yielded a p-value of 0.03, the reviewer would say it had successfully replicated the first study, even though the result could diverge more sharply (by proximity and uncertainty) from the original study than in the first comparison. Rather than focus on an arbitrary threshold such as “statistical significance,” it would be more revealing to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (or uncertainties), and additional metrics tailored to the subject matter. The final point above is reinforced by a recent special edition of the American Statistician in which the use of a statistical significance threshold in reporting is strongly discouraged due to overuse and wide misinterpretation (introduction to the special issue, Wasserstein et al, 2019). A figure from (Amrhein et al., 2019b) also demonstrates this point, as shown in Figure 5-1. FIGURE 5-1 The comparison of two results to determine replicability. NOTES: The figure shows the issue with using statistical significance as an attribute of comparison (Point 8 above in the text); the two results would be considered to have replicated if using a “proximity-uncertainty” attribute (Points 3 and 4 above). SOURCE: Amrhein et al., (2019b, p. 306). 61

Prepublication copy, uncorrected proofs. One concern voiced by some researchers about using a proximity-uncertainty attribute to assess replicability is that such an assessment favors studies with large uncertainties; the potential consequence is that many researchers would choose to perform low-power studies to increase the replicability chances (Cova et al., 2018). While two results with large uncertainties and within proximity such that the uncertainties overlap with each other may be consistent with replication, the large uncertainties indicate that not much confidence can be placed in that conclusion. CONCLUSION 5-1: Different types of scientific studies lead to different or multiple criteria for determining a successful replication. The choice of criteria can affect the apparent rate of non-replication, and that choice calls for judgment and explanation. CONCLUSION 5-2: A number of parametric and nonparametric methods may be suitable for assessing replication across studies. However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained "statistical significance," that is, when the p-values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (uncertainties), and additional metrics tailored to the subject matter. THE EXTENT OF NON-REPLICABILITY The committee was asked to assess what is known and, if necessary, identify areas that may need more information to ascertain the extent of non-replicability in scientific and engineering research. The committee examined current efforts to assess the extent of non-replicability within several fields, reviewed literature on the topic, and heard from expert panels during its public meetings. We also drew on the previous work of committee members and other experts in the field of replicability of research. Some efforts to assess the extent of non-replicability in scientific research directly measure rates of replication, while others examine indirect measures to infer the extent of non-replication. Approaches to assessing non-replicability rates include: ● direct and indirect assessments of replicability; ● perspectives of researchers who have studied replicability; and ● surveys of researchers; and retraction trends. This section discusses each of these lines of evidence. Assessments of Replicability The most direct method to assess replicability is to perform a study following the original methods of a previous study and to compare the new results to the original ones. Some high-profile replication efforts in recent years include studies by Amgen, that showed low replication rates in biomedical research (Begley and Ellis, 2012), and work by the Center for Open Science on psychology (Open Science Collaboration, 2015), cancer research (Nosek and Errington 2017), and social science (Camerer et al., 2018). In these examples, a set of studies was selected and a single 62

Prepublication copy, uncorrected proofs. replication attempt was made to confirm results of each previous study, or one-to-one comparisons were made. In other replication studies, teams of researchers performed multiple replication attempts on a single original result, or many-to-one comparisons (see for example, Klein et al, 2014; Hagger et al, 2017; Cova, et al, 2018 in Table 5-1). Other measures of replicability include assessments that can provide indicators of bias, errors, and outliers, including for example, computational data checks of reported numbers and comparison of reported values against a database of previously reported values. Such assessments can identify data that are outliers to previous measurements and may signal the need for additional investigation to understand the discrepancy.4 Table 5-1 summarizes the direct and indirect replication studies assembled by the committee. Other measures of non-replicabilty are discussed later in this chapter in the Sources of Non-replicability section. Many direct replication studies are not reported as such. Replication—especially of surprising results or those that could have a major impact—occurs in science often without being labelled as a replication. Many scientific fields conduct reviews of articles on a specific topic— especially those on new topics or likely to have a major impact—to assess the available data and determine which measurements and results are rigorous (see Chapter 7). Therefore, replicability studies included as part of the scientific literature but not cited as such add to the difficulty in assessing the extent of replication and non-replication. One example of this phenomenon relates to research on hydrogen storage capacity. The U.S. Department of Energy (DOE) issued a target storage capacity in mid-1990s. One group using carbon nanotubes reported surprisingly high values that met DOE’s target (Hynek et al., 1997); other researchers who attempted to replicate these results could not do so. At the same time, other researchers were also reporting high values of hydrogen capacity in other experiments. In 2003, an article reviewed previous studies of hydrogen storage values and reported new research results, which were later replicated (Broom and Hirscher, 2016). None of these studies was explicitly called an attempt at replication. TABLE 5-1 Examples of Replication Studies ______________________________________________________________________________ Field and Description Results Type of Author(s) Assessment Experimental A group of 20 research 70 percent of the 40 studies Direct Philosophy, teams performed were replicated by comparing (Cova, et al. replication studies of 40 the original effect size to the 2018). experimental philosophy confidence interval of the studies published replication.a between 2003 and 2015. Behavioral Performed replications of 87 percent of the replication Direct science, 78 previously published attempts were statistically Personality traits associations between the significant in the expected linked to life Big Five personality direction and effects were 4 There is risk of missing a new discovery by rejecting data outliers without further investigation. 63

Prepublication copy, uncorrected proofs. outcomes (Soto, traits and consequential typically 77 percent as strong as in press) life outcomes.b the corresponding original effects. Behavioral Multiple laboratories (23 Meta-analysis of the studies science, in total) conducted revealed that the size of the ego- Ego-Depletion replications of a depletion effect was small with (Hagger, et al., standardized ego- 95 percent confidence intervals 2017) depletion protocol based (CIs) that encompassed zero (d on a sequential-task = 0.04, 95% CI [−0.07, 0.15]. paradigm by Sripada et al. General biology: Attempt by researchers Published data were completely Direct preclinical from Bayer HealthCare in line with the results of the animal studies: to validate data on validation studies in 20-25 Prinz et al. potential drug targets percent of cases. (2011) obtained in 67 projects by copying models exactly or by adapting them to internal needs. Oncology Attempt by Amgen team Scientific results were Direct preclinical to reproduce the results confirmed in 11 percent of the studies: of 53 ”landmark” studies. studies. Begley and Ellis (2012) Genetics, Replication of data Of the 18 studies, 2 analyses Direct preclinical analyses provided in 18 (11%) were replicated and 6 studies: articles on microarray- were partially replicated or Ioannidis (2009) based gene expression showed some discrepancies in studies. results; 10 could not be replicated. Experimental Replication of 13 77 percent of phenomena were Direct psychology: psychological replicated consistently. Klein (2014) phenomena across 36 independent samples. Experimental Replications of 28 classic 54 percent of replications Direct psychology: and contemporary produced a statistically Many Labs 2 published studies. significant effect in the same (Klein, et al. direction as the original study, 2018) 75 percent yielded effect sizes smaller than the original ones 64

Prepublication copy, uncorrected proofs. and 25 percent yielded larger effect sizes than the original ones. Experimental Attempt to independently 36 percent of the replication Direct psychology: replicate selected results studies produced significant Open Science from 100 studies in results, compared to 97 percent Collaboration psychology. of the original studies. The (2015) mean effect sizes were halved. Experimental Using reported data from 77 percent of the studies Direct psychology: the Open Science replicated by comparing the Patil et al. (2016) Collaboration 2015 original effect size to an replication study in estimated 95 percent confidence psychology, reanalyzed interval of the replication. the results. Experimental Attempt to replicate 21 Found a significant effect in the Direct psychology: systematically selected same direction as the original Camerer (2018) experimental studies in study for 62 percent (13 of 21) the social sciences studies, and the effect size of published in Nature and the replications was on average Science in 2010- 2015. about 50 percent of the original effect size. Empirical 2-year study that 2 of 9 replications successful, 3 Direct economics: collected programs and “near” successful, and 4 Dewald (1986) data from authors and unsuccessful; findings suggest attempted to replicate that inadvertent errors in their published results on published empirical articles are empirical economic a commonplace rather than a research. rare occurrence. Economics: Progress report on the 10 journals explicitly note they N/A Duvendack, number of journals with publish replications; of 167 (2015) data-sharing published replication studies, requirements and an approximately 66 percent of assessment of 167 studies were unable to confirm studies. the original results; 12% disconfirmed at least one major result of the original study, while confirming others. Economics: An effort to replicate 18 Significant effect in the same Direct Camerer et al. studies published in direction as the original study (2016) the American Economic found for 11 replications (61%); 65

Prepublication copy, uncorrected proofs. Review and the Quarterly on average, the replicated effect Journal of Economics in size was 66 percent of the 2011-2014. original. Chemistry: Collaboration with 27 percent of papers reporting Indirect Park et al (2017); National Institute of properties of adsorption had Sholl (2017) Standards and data that were outliers, 20 Technology (NIST) to percent of papers reporting CO2 check new data against isotherms as outliers. NIST database, 13,000 measurements. Chemistry: Collaboration with NIST, 33 percent experiments had data Indirect Plant, (2018) Thermodynamics problems, such as uncertainties Research Center (TRC) too small, reported values databases, prepublication outside of TRC database check of solubility, distributions viscosity, critical temperature, vapor pressure. Reproducibility Large-scale replication The first five articles have been Direct Project: Cancer project to replicate key published; 2 replicated Biology (RP:CB) results in 29 cancer important parts of the original papers published in papers, one did not replicate, Nature, Science, Cell and and two were uninterpretable. other high-impact journals. Psychology, Statcheck tool used to 49.6 percent of the articles with Indirect statistical test statistical values null hypothesis statistical test checks: Nuijten within psychology (NHST) results contained at et al. (2016) articles in 1985-2013. least one inconsistency (8,273 of the 16,695 articles), and 12.9 percent (2,150) of the articles with NHST results contained at least one gross inconsistency. Engineering, Full replication studies of Replication of the main result Direct computational previously published was achieved in three out of fluid dynamics: results on bluff-body four of the computational Mesnard et al. aerodynamics, using four efforts. (2017) different computational methods. 66

Prepublication copy, uncorrected proofs. Psychology, Attempted to replicate 10 3 of 10 studies replicated at Direct Ebersole et al. psychology studies in p < 0.05. (2016) Many one on-line session. Labs 3 Psychology, Argued that one of the The original study replicated Direct Luttrell, et al, failed replications in when the original procedures 2017 Ebersole et al. was due to were followed more closely, but changes in the procedure. not when the Ebersole et al. They randomly assigned procedures were used. participants to a version closer to the original or to Ebersole et al.’s version. Psychology, Seventeen different labs None of the studies replicated Direct Wagenmakers et attempted to replicate the result at p < 0.05. al. (2016) one study on facial feedback by Strack et al. (1988). Psychology, Pointed out that all of the The original study was Direct Noah, Schul, and studies in the replicated when the original Mayo (2018) Wagenmakers et al. procedure was followed (2016) replication project (p = 0.01), the original study changed the procedure by was not replicated when the videotaping participants. video camera was present Conducted a replication (p = 0.85). in which participants were randomly assigned to be videotaped or not. Psychology: 31 labs attempted to Replicated the original study. Direct Alona et al., replicate a study by The effect size was much larger 2014 Schooler and Engstler- when the original study was Schooler (1990). replicated more faithfully (the first set of replications inadvertently introduced a change in the procedure). NOTES: Some of the studies in this table also appear in Table 4-1 as they evaluated both reproducibility and replicability. N/A, not applicable. a From Cova et al, 2018, p. 14: “For studies reporting statistically significant results, we treated as successful replications for which the replication 95 percent CI [confidence interval] was not lower than the original effect size. For studies reporting null results, we treated as successful replications for which original effect sizes fell inside the bounds of the 95 percent CI.” 67

Prepublication copy, uncorrected proofs. b From Soto, in press, p. 7: “Previous large-scale replication projects have typically treated the individual study as the primary unit of analysis. Because personality-outcome studies often examine multiple trait-outcome associations, we selected the individual association as the most appropriate unit of analysis for estimating replicability in this literature.” Based on the content of the collected studies in Table 5-1, one can observe:  the majority of the studies are in the social and behavioral sciences (including economics) or in biomedical fields,  the methods of assessing replicability are inconsistent and the replicability percentages depend strongly on the methods used. The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported replication studies found widely varying rates of non-replication (Gilbert et al., 2016). At the same time, replication studies often provide more and better-quality evidence than most original studies alone, and they highlight such methodological features as high precision or statistical power, preregistration, and multi-site collaboration (Nosek, 2016). Some would argue that focusing on replication of a single study as a way to improve the efficiency of science is ill-placed. Rather, reviews of cumulative evidence on a subject, to gauge both the overall effect size and generalizability, may be more useful (Goodman, 2018; and see Chapter 7). Apart from specific efforts to replicate others’ studies, investigators will typically confirm their own results, as in a laboratory experiment, prior to publication. More generally, independent investigators may replicate prior results of others before conducting, or in the course of conducting, a study to extend the original work. These types of replications are not usually published as separate replication studies. Perspectives of Researchers Who Have Studied Replicability Several experts who have studied replicability within and across fields of science and engineering provided their perspectives to the committee. Brian Nosek, cofounder and director of the Center for Open Science, said there was “not enough information to provide an estimate with any certainty across fields and even within individual fields.” In a recent paper discussing scientific progress and problems, Richard Shiffrin, professor of psychology and brain sciences at Indiana University, argued that there are “no feasible methods to produce a quantitative metric, either across science or within the field” to measure the progress of science (Shiffrin et al., 2018, p. 2632). Skip Lupia, now serving as head of the Directorate for Social, Behavioral, and Economic Sciences at the National Science Foundation, said that there is not sufficient information to be able to definitively answer the extent of non-reproducibility and non-replicability, but that there is evidence of p-hacking and publication bias (see below), which are problems. Steven Goodman, the codirector of the Meta-Research Innovation Center (METRICS) at Stanford University, suggested that the focus ought not be on the rate of non-replication of individual studies, but rather on cumulative evidence provided by all studies, and convergence to the truth. He suggested the proper question is “How efficient is the scientific enterprise in generating reliable knowledge, what affects that reliability, and how can we improve it?” 68

Prepublication copy, uncorrected proofs. Surveys Surveys of scientists about issues of replicability or on scientific methods are indirect measures of non-replicability. For example, Nature published the results of a survey in 2016 with the title “1,500 Scientists Lift the Lid on Reproducibility;”5 this survey reported that a large percentage of researchers who had published articles in Nature believe that replicability is a problem (Baker, 2016). This article has been widely cited by researchers studying subjects ranging from cardiovascular disease to crystal structures (Warner et al., 2018; Ziletti et al., 2018). Surveys and studies have also assessed the prevalence of specific problematic research practices, such as a 2018 survey about questionable research practices in ecology and evolution (Fraser et al., 2018). However, many of these surveys rely on poorly defined sampling frames to identify populations of scientists and do not use probability sampling techniques. The fact that nonprobability samples “rely mostly on people … whose selection probabilities are unknown [makes it] difficult to estimate how representative they are of the [target] population” (Dillman, Smyth, and Christian, 2014, pp. 70 and 92)” (Dillman, 2014, p. 92), In fact, we know that people with a particular interest in or concern about a topic, such as replicability and reproducibility, are more likely to respond to surveys on the topic (Brehm, 1993). As a result, we caution against using surveys based on nonprobability samples as the basis of any conclusion about the extent of non-replicability in science. High-quality researcher surveys are expensive and pose significant challenges, including constructing exhaustive sampling frames, reaching adequate response rates, and minimizing other non-response biases that might differentially affect respondents at different career stages or in different professional environments or fields of study (Corley et al., 2011; Peters et al., 2008; Scheufele et al., 2009). As a result, the attempts to date to gather input on topics related to replicability and reproducibility from larger numbers of scientists (Baker, 2016; Boulbes et al., 2018) have relied on convenience samples and other methodological choices that limit the conclusions that can be made about attitudes among the larger scientific community or even for specific subfields based on the data from such surveys. More methodologically-sound surveys following guidelines on adoption of open science practices and other replicability-related issues are beginning to emerge.6 See Appendix E for a discussion of conducting reliable surveys of scientists. Retraction Trends Retractions of published articles may be related to their non-replicability. As noted in a recent study on retraction trends (Brainard, 2018, p. 392), “Overall, nearly 40% of retraction notices did not mention fraud or other kinds of misconduct. Instead, the papers were retracted because of errors, problems with reproducibility [or replicability], and other issues.” Overall, about one-half of all retractions appear to involve fabrication, falsification, or plagiarism. Journal article retractions in biomedicine increased from 50–60 per year in the mid-2000s, to 600–700 per year by the mid-2010s (National Library of Medicine, 2018), and this increase attracted much commentary and analysis (see, e.g., Grieneisen and Zhang, 2012). A recent comprehensive review 5 Nature uses the word “reproducibility” to refer to what we call “replicability.” 6 See https://cega.berkeley.edu/resource/the-state-of-social-science-betsy-levy-paluck-bitss-annual-meeting- 2018/ [May 2019]. 69

Prepublication copy, uncorrected proofs. of an extensive database of 18,000 retracted papers dating back to the 1970s found that while the number of retractions has grown, the rate of increase has slowed; approximately 4 of every 10,000 papers are now retracted (Brainard, 2018). Overall, the number of journals that report retractions has grown from 44 journals in 1997 to 488 journals in 2016; however, the average number of retractions per journal has remained essentially flat since 1997. These data suggest that more journals are attending to the problem of articles that need to be retracted rather than a growing problem in any one discipline of science. Fewer than 2 percent of authors in the database account for more than one-quarter of the retracted articles, and the retractions of these frequent offenders are usually based on fraud rather than errors that lead to non-replicability. The Institute of Electrical and Electronics Engineers alone has retracted more than 7,000 abstracts from conferences that took place between 2009 and 2011, most of which had authors based in China (McCook, 2018). The body of evidence on the extent of non-replicabilty gathered by the committee is not a comprehensive assessment across all fields of science or even within any given field of study. Such a comprehensive effort would be daunting due to the vast amount of research published each year and the diversity of scientific and engineering fields. Among studies of replication that are available, there is no uniform approach across scientific fields to gauge replication between two studies. The experts who contributed their perspectives to the committee all question the feasibility of such a science-wide assessment of non-replicability. While the evidence base assessed by the committee may not be sufficient to permit a firm quantitative answer on the scope of non-replicability, it does support several findings and conclusions. FINDING 5-1: There is an uneven level of awareness of issues related to replicability across fields and even within fields of science and engineering. FINDING 5-2: Efforts to replicate studies aimed at discerning the effect of an intervention in a study population may find a similar direction of effect, but a different (often smaller) size of effect. FINDING 5-3: Studies that directly measure replicability take substantial time and resources. FINDING 5-4: Comparing results across replication studies may be compromised because different replication studies may test different study attributes and rely on different standards and measures for a successful replication. FINDING 5-5: Replication studies in the natural and clinical sciences (general biology, genetics, oncology, chemistry) and social sciences (including economics and psychology) report frequencies of replication ranging from fewer than one of five studies to more than three of four studies. CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete. 70

Prepublication copy, uncorrected proofs. SOURCES OF NON-REPLICABILITY Non-replicability can arise from a number of sources. In some cases, non-replicability arises from the inherent characteristics of the systems under study. In others, decisions made by a researcher or researcher(s) in study execution that reasonably differ from the original study such as judgment calls on data cleaning, or selection of parameter values within a model may also result in non-replication. Other sources of non-replicability arise from conscious or unconscious bias in reporting, mistakes and errors (including misuse of statistical methods), and problems in study design, execution, or interpretation in either the original study or the replication attempt. In many instances, non-replication between two results could be due to a combination of multiple sources, but it is not generally possible to identify the source without careful examination of the two studies. Below, we review these sources of non-replicability and discuss how researchers’ choices can affect each. Unless otherwise noted, the discussion below focuses on the non-replicability between two results (i.e., a one-to-one comparison) when assessed using proximity and uncertainty of both results. Non-Replicability that Is Potentially Helpful to Science Non-replicability is a normal part of the scientific process and can be due to the intrinsic variation and complexity of nature, the scope of current scientific knowledge, and the limits of current technologies. Highly surprising and unexpected results are often not replicated by other researchers. In other instances, a second researcher or research team may purposefully make decisions that lead to differences in parts of the study. As long as these differences are reported with the final results, these may be reasonable actions to take yet result in non-replication. In scientific reporting, uncertainties within the study (such as the uncertainty within measurements, the potential interactions between parameters, and the variability of the system under study) are estimated, assessed, characterized, and accounted for through uncertainty and probability analysis. When uncertainties are unknown and not accounted for, this can also lead to non-replicability. In these instances, non-replicability of results is a normal consequence of studying complex systems with imperfect knowledge and tools. When non-replication of results due to sources such as those listed above are investigated and resolved, it can lead to new insights, better uncertainty characterization, and increased knowledge about the systems under study and the methods used to study them. See Box 5-2 for examples of how investigations of non-replication have been helpful to increasing knowledge. BOX 5-2 Varied Sources of Non-Replication Below are two examples of studies in which non-replication of results led researchers to investigate the source of the discrepancies and ultimately increased understanding of the systems under study. Shaken or Stirred: Two separate labs were conducting experiments on breast tissue, using what they assumed was the same protocol (Hines et al., 2014), yet their results continued to differ. When the researchers from the two labs sat side by side to conduct the experiment, they discovered that one lab was 71

Prepublication copy, uncorrected proofs. stirring the cells gently while the other lab was using a more vigorous shaking system. Both of these methods are commonplace, so neither researcher thought to mention the details of the mixing process (Harris, 2017). Before these researchers discovered the variation in technique, it was not known that the mixing method could affect the outcome in this experiment. After their discovery, clarifying the type of mixing technique in the methods of the study became an avoidable source of non-replicability—something that researchers who are using best practices would account for in their research (e.g., by reporting which method was used in the experiment or by systematically varying the method in order to fully understand the effect). The Lifespan of Worms In 2013, three researchers set out to attempt to clarify inconsistent research results on compounds that could extend the lifespan of lab animals (Phillips et al., 2017). Some research had found that the compound resveratrol (found in red wine) could dramatically extend the life of worms in the lab, but other scientist had difficulty replicating the results. The researchers found a number of reasons for this lack of replicability. For example, they found differences in lab protocol that affected outcomes: worms that were handled by gentle lab technicians lived a full day longer than others. Another difference lay in how labs measured the age of the worms: for example, one lab determined age on the basis of when an egg was laid; another began counting when it was hatched. After more than a year of painstaking work to align protocols among the labs, the variability decreased. Once these sources of non-replicability were eliminated, the researchers discovered inherent variability in the system that was responsible for some of the non-replicability. The three researchers found that some cohorts of worms could partition into short-lived or long-lived modes of aging. This characteristic was previously unknown, and, based on this new information, scientists in the field realized they needed to test compounds on a wider variety and a larger number of worms in order to obtain reliable results. This example demonstrates the variety of legitimate sources of non-replicability and the time and effort required to perform replication studies—even when the researchers are making their best efforts. It also demonstrates that non-replicability can result in advances in scientific knowledge. The susceptibility of any line of scientific inquiry to sources of non-replicability depends on many factors, including factors inherent to the system under study, such as: ● the complexity of the system under study; ● understanding of the number and relations among variables within the system under study; ● the ability to control the variables; ● levels of noise within the system (or signal to noise ratios); ● the mismatch of scale of the phenomena and the scale at which it can be measured; ● stability across time and space of the underlying principles; ● fidelity of the available measures to the underlying system under study (e.g., direct or indirect measurements); and 72

Prepublication copy, uncorrected proofs. ● prior probability (pre-experimental plausibility) of the scientific hypothesis. Studies that pursue lines of inquiry that are able to better estimate and analyze the uncertainties associated with the variables in the system and control the methods that will be used to conduct the experiment are more replicable. On the other end of the spectrum, studies that are more prone to non-replication often involve indirect measurement of very complex systems (e.g., human behavior) and require statistical analysis to draw conclusions. To illustrate how these characteristics can lead to results that are more or less likely to replicate, consider the attributes of complexity and controllability. The complexity and controllability of a system contribute to the underlying variance of the distribution of expected results and thus the likelihood of non- replication.7 The systems that scientists study vary in their complexity. Although all systems have some degree of intrinsic or random variability, some systems are less well understood, and their intrinsic variability is more difficult to assess or estimate. Complex systems tend to have numerous interacting components (e.g., cell biology, disease outbreaks, friction coefficient between two unknown surfaces, urban environments, complex organizations and populations, and human health). Interrelations and interactions among multiple components cannot always be predicted and neither can the resulting effects on the experimental outcomes, so an initial estimate of uncertainty may be an educated guess. Systems under study also vary in their controllability. If the variables within a system can be known, characterized, and controlled, research on such a system tends to produce more replicable results. For example, in social sciences, a person’s response to a stimulus (e.g., a person’s behavior when placed in a specific situation) depends on a large number of variables— including social context, biological and psychological traits, verbal and nonverbal cues from researchers—all of which are difficult or impossible to control completely. In contrast, a physical object’s response to a physical stimulus (e.g., a liquid’s response to a rise in temperature) depends almost entirely on variables that can either be controlled or adjusted for, such as temperature, air pressure, and elevation. Because of these differences, one expects that studies that are conducted in the relatively more controllable systems will replicate with greater frequency than those that are in less controllable systems. Scientists seek to control the variables relevant to the system under study and the nature of the inquiry, but when these variables are more difficult to control, the likelihood of non-replicability will be higher. Figure 5-2 illustrates the combinations of complexity and controllability. 7 Complexity and controllability in an experimental system affect its susceptibility to nonreplicability independently from the way prior odds, power, or p-values associated with hypothesis testing affect the likelihood that an experimental result represents the true state of the world. 73

Prepublication copy, uncorrected proofs. FIGURE 5-2 Controllability and complexity: Spectrum of studies with varying degrees of the combination of controllability and complexity. NOTE: See text for examples from the fields of engineering, physics, and psychology that illustrate various combinations of complexity and controllability that affect susceptibility to non- replication. Many scientific fields have studies that span these quadrants, as demonstrated by the following examples from engineering, physics, and psychology. Veronique Keimer, PLOS ONE editor, in her briefing to the committee noted: “There is a clear correlation between the complexity of the design, the complexity of measurement tools, and the signal to noise ratio that we are trying to measure.” (See also Goodman et al., 2016, on the complexity of statistical and inferential methods.) Engineering: Aluminum-lithium alloys were developed by engineers because of their strength-to-weight ratio, primarily for use in aerospace engineering. The process of developing these alloys spans the four quadrants. Early generation of binary alloys was a simple system that showed high replicability (Quadrant A). Second-generation alloys had higher amounts of lithium and resulted in lower replicability that appeared as failures in manufacturing operations because the interactions of the elements were not understood (Quadrant C). The third-generation alloys contained less lithium and higher relative amounts of other alloying elements, which made it a more complex system but better controlled (Quadrant B), with improved replicability. The development of any alloy is subject to a highly controlled environment. Unknown aspects of the system, such as interactions among the components, cannot be controlled initially and can lead to failures. Once these are understood, conditions can be modified (e.g., heat treatment) to bring about higher replicability. Physics In physics, measurements of the electronic band gap of semiconducting and conducting materials using scanning tunneling microscopy is a highly controlled, simple system (Quadrant A). The searches for the Higgs boson and gravitational waves were separate efforts, and each required the development of large, complex experimental apparatus and careful characterization of the measurement and data analysis systems (Quadrant B). Some systems, such as radiation portal monitors, require setting of thresholds for alarms without knowledge of when and if a threat will ever pass through them; the variety of potential signatures is high and there is 74

Prepublication copy, uncorrected proofs. little controllability of the system during operation (Quadrant C). Finally, a simple system with little controllability is that of precisely predicting the path of a feather dropped from a given height (Quadrant D). Psychology In psychology, Quadrant A includes studies of basic sensory and perceptual processes that are common to all human beings, such as the purkinje shift (a change in sensitivity of the human eye under different levels of illumination). Quadrant D includes studies of complex social behaviors that are influenced by culture and context: for example, a study of the effects of a father’s absence on children's ability to delay gratification revealed stronger effects among younger children (Mischel, 1961). Inherent sources of non-replicability arise in every field of science, but they can vary widely depending on the specific system undergoing study. When the sources are knowable, or arise from experimental design choices, researchers need to identify and assess these sources of uncertainty insofar as they can be estimated. Researchers need also to report on steps that were intended to reduce uncertainties inherent in the study or differ from the original study (i.e., data cleaning decisions that resulted in a different final data set). The committee agrees with those who argue that the testing of assumptions and the characterization of the components of a study are as important to report as are the ultimate results of the study (Hanisch and Plant, 2018) including studies using statistical inference and reporting p-values (Boos et al, 2012). Every scientific inquiry encounters an irreducible level of uncertainty, whether this is due to random processes in the system under study, limits to our understanding or ability to control that system, or limitations of the ability to measure. If researchers do not adequately consider and report these uncertainties and limitations, this can contribute to non-replicability. RECOMMENDATION 5-1: Researchers should, as applicable to the specific study, provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. Researchers should thoughtfully communicate all recognized uncertainties and estimate or acknowledge other potential sources of uncertainty that bear on their results, including stochastic uncertainties and uncertainties in measurement, computation, knowledge, modeling, and methods of analysis. Unhelpful Sources of Non-Replicability Non-replicability can also be the result of human error or poor researcher choices. Shortcomings in the design, conduct, and communication of a study may all contribute to non- replicability. These defects may arise at any point along the process of conducting research, from design and conduct to analysis and reporting, and errors may be made because the researcher was ignorant of best practices, was sloppy in carrying out research, made a simple error, or had unconscious bias towards a specific outcome. Whether arising from lack of knowledge, perverse incentives, sloppiness or bias, these sources of non-replicability warrant continued attention because they reduce the efficiency with which science progresses; time spent resolving non-replicablity issues that are found to be caused by these sources do not add to scientific understanding. That is, they 75

Prepublication copy, uncorrected proofs. are unhelpful in making scientific progress. We consider here a selected set of such avoidable sources of non-replication:  publication bias;  misaligned incentives;  inappropriate statistical inference; and  poor study design, errors, and incomplete reporting of a study. We will discuss each source in turn. Publication Bias Both researchers and journals want to publish new, innovative, ground-breaking research. The publication preference for statistically significant, positive results produces a biased literature through the exclusion of statistically non-significant results (i.e., those that do not show an effect that is sufficiently unlikely if the null hypothesis is true). As noted in Chapter 2, there is great pressure to publish in high impact journals and for researchers to make new discoveries. Furthermore, it may be difficult for researchers to publish even robust non-significant results, except in circumstances where the results contradict what has come to be an accepted positive effect. Replication studies and studies with valuable data but inconclusive results may be similarly difficult to publish. This publication bias results in a published literature that does not reflect the full range of evidence about a research topic. One powerful example is a set of clinical studies performed on the effectiveness of tamoxifen, a drug used to treat breast cancer. In a systematic review (see Chapter 7) of the drug’s effectiveness, 23 clinical trials were reviewed; the statistical significance of 22 of the 23 studies did not reach the criterion of p less than 0.05 yet the cumulative review of the set of studies showed a large effect (a reduction of 16 (±3) percent in the odds of death among women of all ages assigned to tamoxifen treatment, Peto et al, 1988, p. 1684). Another approach to quantifying the extent of non-replicability is to model the false discovery rate, that is, the number of research results that are expected to be “false.” Ioannidis (2005) developed a simulation model to do so for studies that rely on statistical hypothesis testing, incorporating the pre-study (prior) odds, the statistical tests of significance, investigator bias, and other factors. Ioannidis concluded, and used as the title of his paper that “most published research findings are false.” Some researchers have criticized Ioannidis’s assumptions and mathematical argument (Goodman and Greenland, 2007); others have pointed out that the takeaway message is that any initial results that are statistically significant need further confirmation and validation. Analyzing the distribution of published results for a particular line of inquiry can offer insights into potential bias, which can relate to the rate of non-replicability. Several tools are being developed to compare a distribution of results to what that distribution would look like if all claimed effects were representative of the true distribution of effects. Figure 5-3 shows how publication bias can result in a skewed view of the body of evidence when only positive results that meet the statistical significance threshold are reported. When a new study fails to replicate the previously published results—for example, if a study finds no relationship between variables when such a relationship had been shown in previously published studies—it appears to be a case of non-replication. However, if the published literature is not an accurate reflection of the state of the 76

Prepublication copy, uncorrected proofs. evidence because only positive results are regularly published, the new study could actually have replicated previous but unpublished negative results.8 FIGURE 5-3 Funnel charts showing estimated coefficients and standard error (a) if all hypothetical study experiments are being reported and (b) if only statistically significant results are reported. SOURCE: National Academies of Sciences, Engineering, and Medicine (2016d, p. 29). Several techniques are available to detect and potentially adjust for publication bias, all of which are based on the examination of a body of research as a whole (cumulative evidence), rather 8 Earlier in this chapter, we discuss an indirect method for assessing non-replicability in which a result is compared to previously published values; results that do not agreed with the published literature are identified as outliers. If the published literature is biased, this method would inappropriately reject valid results. This is another reason for investigating outliers before rejecting them. 77

Prepublication copy, uncorrected proofs. than individual replication studies (i.e., one-on-one comparison between studies). These techniques cannot determine which of the individual studies are affected by bias (i.e., which results are false positives) or identify the particular type of bias, but they arguably allow one to identify bodies of literature that are likely to be more or less accurate representations of the evidence. The techniques, discussed below, are funnel plots, a p-curve test of excess significance, and assessing unpublished literature. Funnel Plots One of the most common approaches to detecting publication bias involves constructing a funnel plot that displays each effect size against its precision (e.g., sample size of study). Asymmetry in the plotted values can reveal the absence of studies with small effect sizes, especially in studies with small sample sizes—a pattern that could suggest publication/selection bias for statistically significant effects (see Figure 5-3, above) There are criticisms of funnel plots, however; some argue that the shape of a funnel plot is largely determined by the choice of method (Tang and Liu, 2000), and others maintain that funnel plot asymmetry may not accurately reflect publication bias (Lau et al., 2006). P-Curve One fairly new approach is to compare the distribution of results (e.g., p-values) to the expected distributions (see Simonsohn et al., 2014a, and 2014b). P-curve analysis tests whether the distribution of statistically significant p-values shows a pronounced right-skew,9 as would be expected when the results are true effects (i.e., the null hypothesis is false) or whether the distribution is not as right-skewed (or is even flat, or, in the most extreme cases, left-skewed), as would be expected when the original results do not reflect the proportion of real effects (Gadbury and Allison, 2012; Nelson et al., 2018, Simonsohn et al., 2014a). Test of Excess Significance A closely related statistical idea for checking publication bias is the test of excess significance. This test evaluates whether the number of statistically significant results in a set of studies is improbably high given the size of the effect and the power to test it in the set of studies (Ioannidis and Trikalinos, 2007), which would imply that the set of results is biased and may include exaggerated results or false positives. When there is a true effect, one expects the proportion of statistically significant results to be equal to the statistical power of the studies. If a researcher designs her studies to have 80 percent power against a given effect, then, at most, 80 percent of her studies would produce statistically significant results if the effect is at least that large (fewer if the null hypothesis is sometimes true). Schimmack (2012) has demonstrated that the proportion of statistically significant results across a set of psychology studies often far exceeds the estimated statistical power of those studies: this pattern of results that is “too good to be true” suggests that results were either not obtained following the rules of statistical inference (i.e., conducting a single statistical test that was chosen a priori) or did not report all studies attempted (i.e., there is a “file drawer” of non-statistically significant studies that do not get published; or possibly the results were p-hacked or cherry picked (see Chapter 2). In many fields, the proportion of published papers that report a positive (i.e., statistically significant) result is around 90 percent (Fanelli, 2012). This raises concerns when combined with the observation that most studies have far less than 90 percent statistical power (i.e., would only 9 Distributions which have more p-values of low value than high are referred to as “right-skewed.” Similarly, “left-skewed” distributions have more p-values of high than low value. 78

Prepublication copy, uncorrected proofs. successfully detect an effect, assuming an effect exists, far less than 90 percent of the time) (Button et al., 2013; Fraley and Vazire, 2014; Szucs and Ioannidis, 2017; Yarkoni, 2009, Stanley et al, 2018). Some researchers believe that the publication of false positives is common and that reforms are needed to reduce this. Others believe that there has been an excessive focus on Type I errors (false positives) in hypothesis testing at the possible expense of an increase in Type II errors (false negatives, or failing to confirm true hypotheses) (Fiedler et al., 2012; Finkel et al., 2015; LeBel, et al., 2017). Assessing Unpublished Literature One approach to countering publication bias is to search for and include unpublished papers and results when conducting a systematic review of the literature. Such comprehensive searches are not standard practice. For medical reviews, one estimate is that only 6 percent of reviews included unpublished work (Hartling et al., 2017), although another found that 50 percent of reviews did so (Ziai et al., 2017). In economics, there is a large and active group of researchers collecting and sharing “grey” literature, research results outside of peer-reviewed publications (Vilhuber, 2018). In psychology, an estimated 75 percent of reviews included unpublished research (Rothstein, 2006). Unpublished but recorded studies (such as dissertation abstracts, conference programs, and research aggregation websites) may become easier for reviewers to access with computerized databases and with the availability of preprint servers. When a review includes unpublished studies, researchers can directly compare their results with those from the published literature, thereby estimating file-drawer effects. Misaligned Incentives Academic incentives–such as tenure, grant money, and status–may influence scientists to compromise on good research practices (Freeman, 2018). Faculty hiring, promotion, and tenure decisions are often based in large part on the “productivity” of a researcher, such as the number of publications, the number of citations, and amount of grant money received (Edwards and Roy, 2017). Some have suggested that these incentives can lead researchers to ignore standards of scientific conduct, rush to publish, and overemphasize positive results (Edwards and Roy, 2017). Formal models have shown how these incentives can lead to high rates of non-replicable results (Smaldino and McElreath, 2016). Many of these incentives may be well intentioned, but they could have the unintended consequence of reducing the quality of the science produced, and poorer quality science is less likely to be replicable. Although it is difficult to assess how widespread are the sources of non-replicability that are unhelpful to improving science, factors such as publication bias toward results qualifying as “statistically significant” and misaligned incentives on academic scientists create conditions that favor publication of non-replicable results and inferences. Inappropriate Statistical Inference Confirmatory research is research that starts with a well-defined research question and a priori hypotheses before collecting data; confirmatory research can also be called “hypothesis testing research.” In contrast, researchers pursuing exploratory research collect data and then examine the data for potential variables of interest and relationships among variables, forming a posteriori hypotheses; as such, exploratory research can be considered “hypothesis generating research.” Exploratory and confirmatory analyses are often described as two different stages of the research process. Some have distinguished between the “context of discovery” and the “context of 79

Prepublication copy, uncorrected proofs. justification” (Reichenbach, 1938), while others have argued that the distinction is on a spectrum rather than categorical. Regardless of the precise line between exploratory and confirmatory research, researchers’ choices between the two affects how they and others interpret the results. A fundamental principle of hypothesis testing is that the same data that were used to generate a hypothesis cannot be used to test that hypothesis (de Groot, 2014). In confirmatory research, the details of how a statistical hypothesis test will be conducted must be decided before looking at the data on which it is to be tested. When this principle is violated, significance testing, confidence intervals, and error control are compromised. Thus, it cannot be assured that false positives are controlled at a fixed rate. In short, when exploratory research is interpreted as if it were confirmatory research, there can be no legitimate “statistically significant” result. Researchers often learn from their data, and some of the most important discoveries in the annals of science have come from unexpected results that did not fit any prior theory. For example, Arno Allan Penzias and Robert Woodrow Wilson found unexpected noise in data collected in the course of their work on microwave receivers for radio astronomy observations. After attempts to explain the noise failed, the “noise” was eventually determined to be cosmic microwave background radiation, and these results helped scientists to refine and confirm theories about the “big bang.” While exploratory research generates new hypotheses, confirmatory research is equally important because it tests the hypotheses generated and can give valid answers as to whether these hypotheses have any merit. Exploratory and confirmatory research are essential parts of science, but they need to be understood and communicated as two separate types of inquiry, with two different interpretations. A well-conducted exploratory analysis can help illuminate possible hypotheses to be examined in subsequent confirmatory analyses. Even a stark result in an exploratory analysis has to be interpreted cautiously, pending further work to test the hypothesis using a new or expanded dataset. It is often unclear from publications whether the results came from an exploratory or a confirmatory analysis. This lack of clarity can misrepresent the reliability and broad applicability of the reported results. In Chapter 2 we discuss the meaning, overreliance, and frequent misunderstanding of “statistical significance,” including misinterpreting the meaning and overstating the utility of a particular threshold, such as p < 0.05. More generally, a number of flaws in design and reporting can reduce the reliability of a study’s results. Misuse of statistical testing often involves post-hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives (John et al., 2012; Munafo et al., 2017). A study from the late-1980s gives a striking example of how such post-hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post-hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most (Peto, 2011). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified. 80

Prepublication copy, uncorrected proofs. Little information is available about the prevalence of such inappropriate statistical practices as p-hacking, cherry picking, and hypothesizing after results are known (HARKing), discussed below. While surveys of researchers raise the issue—often using convenience samples— methodological shortcomings mean that they are not necessarily a reliable source for a quantitative assessment.10 P-hacking and Cherry Picking P-hacking is the practice of collecting, selecting, or analyzing data until a result of statistical significance is found. Different ways to p-hack include stopping data collection once p ≤ 0.05 is reached, analyzing many different relationships and only reporting those for which p ≤ 0.05, varying the exclusion and inclusion rules for data so that p ≤ 0.05, and analyzing different subgroups in order to get p ≤ 0.05. Researchers may p-hack without knowing or without understanding the consequences (Head et al., 2015). This is related to the practice of “cherry picking” in which researchers may (unconsciously or deliberately) cherry pick through their data and results and selectively report those that meet criteria such as meeting a threshold of statistical significance or supporting a positive result, rather than reporting all of the results from their research. Hypothesizing After Results Are Known (HARKing) Confirmatory research begins with identifying a hypothesis based on observations, exploratory analysis, or building on previous research. Data are collected and analyzed to see if they support the hypothesis. HARKing applies to confirmatory research which incorrectly bases the hypothesis on the data collected and then using that same data as evidence to support the hypothesis. It is unknown to what extent inappropriate HARKing occurs in various disciplines, but some have attempted to quantify the consequences of HARKing. For example, a 2015 article compared hypothesized effect sizes against non-hypothesized effect sizes and found that effects were significantly larger when the relationships had been hypothesized, a finding consistent with the presence of HARKing. (Bosco et al., 2015). Poor Study Design Before conducting an experiment, a researcher must make a number of decisions about study design. These decisions—which vary depending on type of study—could include the research question, the hypotheses, the variables to be studied, avoiding potential sources of bias, and the methods for collecting, classifying, and analyzing data. Researchers’ decisions at various points along this path can contribute to non-replicability. Poor study design can contribute to avoidable non-replicability include not recognizing or adjusting for known biases, not following best practices in terms of randomization, poorly designed materials and tools (ranging from physical equipment to questionnaires to biological reagents), confounding in data manipulation, using poor measures, or failing to characterize and account for known uncertainties. Errors In 2010, economists Carmen Reinhart and Kenneth Rogoff published an article that showed if a country’s debt exceeds 90 percent of the country’s gross domestic product, economic growth 10 For an example of one study of this issue, see https://osf.io/sd5u7/ [January 2019]. 81

Prepublication copy, uncorrected proofs. slows and declines slightly (0.1 percent). These results were widely publicized and used to support austerity measures around the world (Herdon et al., 2013). However, in 2013, with access to Reinhart and Rogoff’s original spreadsheet of data and analysis (which the authors had saved and made available for the replication effort), researchers reanalyzing the original studies found several errors in the analysis and data selection. One error was an incomplete set of countries used in the analysis that established the relationship between debt and economic growth. When data from Australia, Austria, Belgium, Canada, and Denmark were correctly included, and other errors were corrected, the economic growth in the countries with debt above 90 percent of gross domestic product was actually +2.2 percent, rather than -0.1. In response, Reinhart and Rogoff acknowledged the errors, calling it “sobering that such an error slipped into one of our papers despite our best efforts to be consistently careful.” Reinhart and Rogoff said that while the error led to a “notable change” in the calculation of growth in one category, they did not believe it “affects in any significant way the central message of the paper.”11 The Reinhart and Rogoff error was fairly high profile and a quick Internet search would let any interested reader know that the original paper contained errors. Many errors could go undetected or are only acknowledged through a brief correction in the publishing journal. A 2015 study looked at a sample of more than 250,000 p-values reported in eight major psychology journals over a period of 28 years. The study found that many of the p-values reported in papers were inconsistent with a recalculation of the p-value and that in one out of eight papers this inconsistency was large enough to affect the statistical conclusion (Nuijten et al., 2016). Errors can occur at any point in the research process: measurements can be recorded inaccurately, typographical errors can occur when inputting data, and calculations can contain mistakes. If these errors affect the final results and they are not caught prior to publication, the research may be non-replicable. Unfortunately, these types of errors can be difficult to detect. In the case of computational errors, transparency in data and computation may make it more likely that the errors can be caught and corrected. For other errors, such as mistakes in measurement, errors might not be detected until and unless a failed replication that does not make the same mistake indicates that something was amiss in the original study. Incomplete Reporting of a Study During the course of research, researchers make numerous choices about their studies. When a study is published, some of these choices are reported in the methods section. A methods section often covers what materials were used, how participants or samples were chosen, what data collection procedures were followed, and how data were analyzed. The failure to report some aspect of the study—or to do so in sufficient detail—may make it difficult for another researcher to replicate the result. For example, if a researcher only reports that she “adjusted for comorbidities” within the study population, this does not provide sufficient information about how exactly how the comorbidities were adjusted, and it does not give enough guidance for future researchers to follow the protocol. Similarly, if a researcher does not give adequate information about the biological reagents used in an experiment, a second researcher may have difficulty replicating the experiment. Even if a researcher reports all of the critical information about the 11 Available: https://archive.nytimes.com/www.nytimes.com/interactive/2013/04/17/business/17economix- response.html [January 2019]. 82

Prepublication copy, uncorrected proofs. conduct of a study, other seemingly inconsequential details that have an effect on the outcome could remain unreported. Just as reproducibility requires transparent sharing of data, code, and analysis, replicability requires transparent sharing of how an experiment was conducted and the choices that were made. This allows future researchers, if they wish, to attempt replication as close to the original conditions as possible. Box 5-3 A Note on Generalizability At times, selective variation in the conditions of the experiment will be the goal. When results are consistent across studies that used slightly different methods or conditions, it strengthens the validity of the results. To generalize the results, a systematic variation of the important parameters and variables would be conducted with the aim of learning the limits of their effects and improving the characterization of uncertainties. Experiments conducted under the same conditions may run the risk of finding “truths” that are valid only in the narrow experimental context. For example, in animal research, it has long been known that the environmental conditions in which the animals live can have an impact on the outcome of experiments. Because of this, animal researchers have attempted to standardize environments in order to increase comparability between studies and reduce the need to replicate studies involving animals (Richter et al., 2009). However, a 2009 study suggests that such standardization is actually the cause of non-replicability, rather than the cure. The authors of this study reported that environmental standardization may compromise replicability by “systematically increasing the incidence of results that are idiosyncratic to study- specific environmental conditions” (Richter et al., 2009). In other words, studies that are performed in such highly standardized environments result in “local ‘truths’ with little external validity.” (Richter et al., 2009). Fraud and Misconduct At the extreme, sources of non-replicability that do not advance scientific knowledge— and do much to harm science—include misconduct and fraud in scientific research. Instances of fraud are uncommon, but can be sensational. Despite its infrequent occurrence and regardless of how highly publicized cases may be, the fact that it is uniformly bad for science means that it is worthy of attention within this study. Researchers who knowingly use questionable research practices (QRPs) with the intent to deceive are committing misconduct or fraud. It can be difficult in practice to differentiate between honest mistakes and deliberate misconduct because the underlying action may be the same but the intent is not. Reproducibility and replicability emerged as general concerns in science around the same time as research misconduct and detrimental research practices were receiving renewed attention. Interest in reproducibility and replicability and misconduct was spurred by some of the same trends and a small number of widely publicized cases in which discovery of fabricated or falsified data was delayed, and the practices of journals, research institutions, and individual labs were implicated in enabling such delays (National Academies of Sciences, Engineering, and Medicine, 2017; Levelt Committee et al., 2012). 83

Prepublication copy, uncorrected proofs. In the case of Anil Potti at Duke University, a researcher using genomic analysis on cancer patients was later found to have falsified data. This experience prompted the study and the report, Evolution of Translational Omics: Lessons Learned and the Way Forward (Institute of Medicine. 2012), which in turn led to new guidelines for omics research at the National Cancer Institute. Around the same time, in a case that came to light in the Netherlands, social psychologist Diederick Stapel had gone from manipulating to fabricating data over the course of a career with dozens of fraudulent publications. Similarly, highly publicized concerns about misconduct by Cornell University professor Brian Wansink highlight how consistent failure to adhere to best practices for collecting, analyzing, and reporting data—intentional or not—can blur the line between helpful and unhelpful sources of non-replicability. In this case, a Cornell faculty committee ascribed to Wansink: “academic misconduct in his research and scholarship, including misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship”.12 A subsequent report, Fostering Integrity in Research (National Academies of Sciences, Engineering, and Medicine, 2017), emerged in this context, and several of its central themes are relevant to questions posed in this report. According to the definition adopted by the U.S. federal government in 2000, research misconduct is fabrication of data, falsification of data, or plagiarism “in proposing, performing, or reviewing research, or in reporting research results” (Office of Science and Technology Policy, 2000, p. 76262). The federal policy requires that research institutions report all allegations of misconduct in research projects supported by federal funding that have advanced from the inquiry stage to a full investigation, and to report on the results of those investigations. Other detrimental research practices (see National Academies of Sciences, Engineering, and Medicine, 2017) include failing to follow sponsor requirements or disciplinary standards for retaining data, authorship misrepresentation other than plagiarism, refusing to share data or methods, and misleading statistical analysis that falls short of falsification. In addition to the behaviors of individual researchers, detrimental research practices also include actions taken by organizations, such as failure on the part of research institutions to maintain adequate policies, procedures, or capacity to foster research integrity and assess research misconduct allegations, and abusive or irresponsible publication practices by journal editors and peer review. Just as information on rates of non-reproducibility and non-replicability in research is limited, knowledge about research misconduct and detrimental research practices is scarce. Reports of research misconduct allegations and findings are released by the National Science Foundation Office of Inspector General and the Department of Health and Human Services Office of Research Integrity (see National Science Foundation, 2018d). As discussed above, new analyses of retraction trends have shed some light on the frequency of occurrence of fraud and misconduct. Allegations and findings of misconduct increased from the mid-2000s to the mid-2010s but may have leveled off in the past few years. Analysis of retractions of scientific articles in journals may also shed some light on the problem (Steen et al., 2013). One analysis of biomedical articles found that misconduct was responsible for more than two-thirds of retractions (Fang et al., 2012). As mentioned earlier, a wider analysis of all retractions of scientific papers found about one-half attributable to misconduct 12 See: http://statements.cornell.edu/2018/20180920-statement-provost-michael-kotlikoff.cfm [April 2019]. 84

Prepublication copy, uncorrected proofs. or fraud (Brainard, 2018). Others have found some differences according to discipline (Grieneisen and Zhang, 2012). One theme of Fostering Integrity in Research is that research misconduct and detrimental research practices are a continuum of behaviors (National Academies of Sciences, Engineering, and Medicine, 2017). While current policies and institutions aimed at preventing and dealing with research misconduct are certainly necessary, detrimental research practices likely arise from some of the same causes and may cost the research enterprise more than misconduct does in terms of resources wasted on the fabricated or falsified work, resources wasted on following up this work, harm to public health due to treatments based on acceptance of incorrect clinical results, reputational harm to collaborators and institutions, and others. No branch of science is immune to research misconduct, and the committee did not find any basis to differentiate the relative level of occurrence in various branches of science. Some but not all researcher misconduct has been uncovered through reproducibility and replication attempts, that is, the self-correcting mechanisms of science. From the available evidence, documented cases of researcher misconduct are relatively rare, as suggested by a rate of retractions in scientific papers of approximately 4 in 10,000 (Brainard, 2018). CONCLUSION 5-4: The occurrence of non-replicability is due to multiple sources, some of which impede and others of which promote progress in science. The overall extent of non-replicability is an inadequate indicator of the health of science. 85

Prepublication copy, uncorrected proofs.

Next: 6 Improving Reproducibility and Replicability »
Reproducibility and Replicability in Science Get This Book
×
Buy Prepub | $69.00 Buy Paperback | $65.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!