National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 7 Confidence in Science

« Previous: 6 Improving Reproducibility and Replicability
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 117
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 118
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 119
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 120
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 121
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 122
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 123
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 124
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 125
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 126
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 127
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 128
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 129
Suggested Citation:"7 Confidence in Science." National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. doi: 10.17226/25303.
×
Page 130

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Prepublication copy, uncorrected proofs. 7 CONFIDENCE IN SCIENCE The committee was asked to “draw conclusions and make recommendations for improving rigor and transparency in scientific and engineering research.” Certainly, reproducibility and replicability play an important role in achieving rigor and transparency, and for some lines of scientific inquiry, replication is one way to gain confidence in scientific knowledge. For other lines of inquiry, however, direct replications may be impossible due to the characteristics of the phenomena being studied. The robustness of science is less well represented by the replications between two individual studies than by a more holistic web of knowledge reinforced through multiple lines of examination and inquiry. In this chapter, the committee illustrates a spectrum of pathways to attain rigor and confidence in scientific knowledge, beginning with an overview of research synthesis and meta-analysis, and then citing illustrative approaches and perspectives from geoscience, genetics, psychology, and big data in social sciences. The chapter concludes with consideration of public understanding and confidence in science. When results are computationally reproduced or replicated, confidence in robustness of the knowledge derived from that particular study is increased. However, reproducibility and replicability are focused on the comparison between individual studies. By looking more broadly and using other techniques to gain confidence in results, multiple pathways can be found to consistently support certain scientific concepts and theories while rejecting others. Research synthesis is a widely accepted and practiced method for gauging the reliability and validity of bodies of research, although like all research methods, it can be used in ways that are more or less valid (de Vrieze, 2018). The common principles of science—gathering evidence, developing theories and/or hypotheses, and application of logic—allow us to explore and predict systems that are inherently not replicable. We use several of these systems below to highlight how scientists gain confidence when direct assessments of reproducibility or replicability are not feasible. RESEARCH SYNTHESIS As we note throughout this report, studies purporting to investigate similar scientific questions can produce inconsistent or contradictory results. Research synthesis addresses the central question of how the results of studies relate to each other, what factors may be contributing to variability across studies, and how study results coalesce or not in developing the knowledge network for a particular science domain. In current use, the term research synthesis describes the ensemble of research activities involved in identifying, retrieving, evaluating, synthesizing, interpreting, and contextualizing the available evidence from studies on a particular topic and comprises both systematic reviews and meta-analyses. For example, a research synthesis may classify studies based on some feature and then test whether the effect size is larger for studies with or without the feature compared with the other studies. The term meta-analysis is reserved for the quantitative analysis conducted as part of research synthesis. Although the terms used to describe research synthesis vary, the practice is widely used, in fields ranging from medicine to physics. In medicine, Cochrane reviews are systematic reviews 117

Prepublication copy, uncorrected proofs. that are performed by a body of experts who examine and synthesize the results of medical research.1 These reviews provide an overview of the best available evidence on a wide variety of topics, and they are updated periodically as needed. In physics, the Task Group on Fundamental Constants performs research syntheses as part of its task to adjust the values of the fundamental constants of physics. The task group compares new results to each other and to the current estimated value, and uses this information to calculate an adjusted value (Mohr, Newell, and Taylor, 2016). The exact procedure for research syntheses varies by field and by the scientific question at hand; the following is a general description of the approach. Research synthesis begins with formal definitions of the scientific issues and the scope of the investigation and proceeds to search for published and unpublished sources of potentially relevant information (e.g., study results). The ensemble of studies identified by the search is evaluated for relevance to the central scientific question, and the resulting subset of studies undergoes review for methodological quality, typically using explicit criteria and the assignment of quality scores. The next step is the extraction of qualitative and quantitative information from the selected studies. The former includes study-level characteristics of design and study processes; the latter includes quantitative results, such as study-level estimates of effects and variability overall and by subsets of study participants or units or individual–level data on study participants or units (see Institute of Medicine, 2011, Ch 4). Using summary statistics or individual-level data, meta-analysis provides estimates of overall central tendencies, effect sizes or association magnitudes, along with estimates of the variance or uncertainty in those estimates. For example, the meta-analysis of the comparative efficacy of two treatments for a particular condition can provide estimates of an overall effect in the target clinical population. Replicability of an effect is reflected in the consistency of effect sizes across the studies, especially when a variety of methods, each with different weaknesses, converge on the same conclusion. As a tool for testing whether patterns of results across studies are anomalous, meta-analyses have, for example, suggested that well-accepted results in a scientific field are or could plausibly be largely due to publication bias. Meta-analyses also test for variation in effect sizes and, as a result, can suggest potential causes of non-replicability in existing research. Meta-analyses can quantify the extent to which results appear to vary from study to study solely due to random sampling variation or to varying in a systematic way by subgroups (including sociodemographic, clinical, genetic and other subject characteristics), as well as by characteristics of the individual studies (such as important aspects of the design of studies, the treatments used, and the time period and context in which studies were conducted). Of course, these features of the original studies need to be described sufficiently to be retrieved from the research reports. For example, a meta-analytic aggregation across 200 meta-analyses published in the top journal for reviews in psychology, Psychological Bulletin, showed that only 8 percent of studies had adequate statistical power; variation across studies testing the same hypothesis was very high, with 74 percent of variation due to unexplained heterogeneity; and reporting bias overall was low (Stanley et al, 2018). In social psychology, Malle (2006) conducted a meta-analysis of studies comparing how actors explain their own behavior with how observers explain it and identified an unrecognized confounder—the positivity of the behavior. In studies that tested positive behaviors, actors took 1 For an overview of the Cochrane, see www.cochrane.org [January 2019]. 118

Prepublication copy, uncorrected proofs. credit for the action and attributed it more to themselves than did observers. In studies that tested negative behaviors, actors justified the behavior and viewed it as due to the situation they were in more than did observers. Similarly, meta-analyses have often shown that the association of obesity with various outcomes (e.g., dementia) depend on the age in life at which the obesity is considered. Systematic reviews and meta-analyses are typically conducted as retrospective investigations, in the sense that they search and evaluate the evidence from studies that have been conducted. Systematic reviews and meta-analysis are susceptible to biased data sets, for example, if the scientific literature on which a systematic review or a meta-analysis is biased due to publication bias of positive results. However, the potential for a prospective formulation of evidence synthesis is clear and is beginning to transform the landscape. Some research teams are beginning to monitor the scientific literature on a particular topic and conduct periodic updates of systematic reviews on the topic.2 . Prospective research synthesis may offer a partial solution to the challenge of biased data sets. Meta-research is a new field that involves evaluating and improving the practice of research. Meta-research encompasses and goes beyond meta-analysis. As Ioannidis et al. (2015) aptly argued, meta-research can go beyond single substantive questions to examine factors that affect rigor, reproducibility, replicability, and, ultimately, the truth of research results across many topics. CONCLUSION 7-1: Further development in and the use of meta-research would facilitate learning from scientific studies. These developments would include the study of research practices such as research on the quality and effects of peer review of journal manuscripts or grant proposals, research on the effectiveness and side effects of proposed research practices, and research on the variation in reproducibility and replicability between fields or over time. GEOSCIENCE What distinguishes geoscience from much of chemistry, biology, and physics is its focus on phenomena that emerge out of uncontrolled natural environments, as well as its special concern with understanding past events documented in the geologic record. Emergent phenomena on a global scale include climate variations at Earth’s surface, tectonic motions of its lithospheric plates, and the magnetic field generated in its iron-rich core. The geosystems responsible for these phenomena have been active for billions of years, and the geologic record indicates that many of the terrestrial processes in the distant geologic past were similar to those that are occurring today. Geoscientists seek to understand the geosystems that produced these past behaviors and to draw implications regarding the future of the planet and its human environment. While one cannot replicate geologic events, such as earthquakes or hurricanes, scientific methods are used to generate increasingly accurate forecasts and predictions. Emergent phenomena from complex natural systems are infinite in their variety; no two events are identical and, in this sense, no event repeats itself. Events can be categorized according 2 In the broad area of health care research, for example, this approach has been adopted by Cochrane, an international group for systematic reviews, and by U.S. government organizations such as the Agency for Healthcare Research and Quality and the U.S. Preventive Services Task Force. 119

Prepublication copy, uncorrected proofs. to their statistical properties, however, such as the parameters of their space, time, and size distributions. The satisfactory explanation of an emergent phenomena requires building a geosystem model (usually a numerical code) that can replicate the statistics of the phenomenon by simulating the causal processes and interactions. In this context, replication means achieving sufficient statistical agreement between the simulated and observed phenomena. Understanding of a geosystem and its defining phenomena is often measured by scientists’ ability to replicate behaviors that were previously observed (retrospective testing) and predict new ones that can be subsequently observed (prospective testing). These evaluations can be in the form of null-hypothesis significance tests (e.g., expressed in terms of p-values) or in terms of skill scores relative to a prediction baseline (e.g., weather forecasts relative to mean-climate forecasts). In the study of geosystems, reproducibility and replicability are closely tied to verification and validation.3 Verification confirms the correctness of the model by checking that the numerical code correctly solves the mathematical equations. Validation is the process of deciding whether a model replicates the data-generating process accurately enough to warrant some specific application, such as the forecasting of natural hazards. Hazard forecasting is an area of applied geoscience in which the issues of reproducibility and replicability are sharply framed by the operational demands for delivering high-quality information to a variety of users in a timely manner. Federal agencies tasked with providing authoritative hazard information to the public have undertaken substantial programs to improve reproducibility and replicability standards in operational forecasting. The cyberinfrastructure constructed to support operational forecasting also enhances capabilities for exploratory science in geosystems. Natural hazards—from windstorms, droughts, floods, and wildfires to earthquakes, landslides, tsunamis, and volcanic eruptions—are notoriously difficult to predict because of the scale and complexity of the geosystems that produce them. Predictability is especially problematic for extreme events of low probability but high consequence that often dominate societal risk, such as the “500-year flood” or “2,500-year earthquake.” Nevertheless, across all sectors of society, expectations are rising for timely, reliable predictions of natural hazards based on the best available science.4 A substantial part of applied geoscience now concerns the scientific forecasting of hazards and their consequences. A forecast is deemed scientific if meets five criteria: 1. formulated to predict measurable events; 2. respectful of physical laws; 3. calibrated against past observations; 4. as reliable and skillful as practical, given the available information; and 5. testable against future observations. To account for the unavoidable sources of non-replicability (i.e., the randomness of nature and lack of knowledge about this variability), scientific forecasts must be expressed as 3 The meanings of the terms verification and validation, like reproducibility and replicability, differ among fields. Here we conform to the usage in computer and information science. In weather forecasting, a model is verified by its agreement with data—what is here called validation. 4 For example, the 2015 Paris Agreement adopted by the U.N. Framework Convention on Climate Change, specifies that “adaptation action . . .should be based on and guided by the best available science.” The California Earthquake Authority is required by law to establish residential insurance rates that are based on “the best available science” (Marshall, 2018, p. 106). 120

Prepublication copy, uncorrected proofs. probabilities. The goal of probabilistic forecasting is to develop forecasts of natural events that are statistically ideal—the best forecasts possible given the available information. Progress towards this goal requires the iterated development of forecasting models over many cycles of data gathering, model calibration, verification, simulation, and testing. In some fields, such as weather and hydrological forecasting, the natural cycles are rapid enough and the observations are dense and accurate enough to permit the iterated development of system-level models with high explanatory and predictive power. Through steady advances in data collection and numerical modeling over the past several decades, the skill of the ensemble forecasting models developed and maintained by the weather prediction centers has been steadily improved (Bauer et al., 2015). For example, forecasting skill in the range from 3 to 10 days ahead has been increasing by about 1 day per decade; that is, today’s 6-day forecast is as accurate as the 5-day forecast was 10 years ago. This is a familiar illustration of gaining confidence in scientific knowledge without doing repeat experiments. GENETICS One of the principal tools to gain knowledge about genetic risk factors for disease is a genome-wide association study, or GWAS. It is an observational study of a genome-wide set of genetic variants with the aim of detecting which variants may be associated with the development of a disease, or more broadly, associated with any expressed trait. The GWAS studies can be complex to mount, involve massive data collection, and require application of a range of sophisticated statistical methods for correct interpretation. The community of investigators undertaking genome-wide association studies have adopted a series of practices and standards to improve the reliability of their results. These practices include a wide range of activities:  efforts to ensure consistency in data generation and extensive quality control steps to ensure the reliability of genotype data;  genotype and phenotype harmonization;  a push for large sample sizes through the establishment of large international disease consortia;  rigorous study design and standardized statistical analysis protocols, including consensus building on controlling for key confounders, such as genetic ancestry/population stratification, the use of stringent criteria to account for multiple testing, and the development of norms for conducting independent replication studies and meta-analyzing multiple cohorts;  a culture of large-scale international collaboration and sharing of data, results and tools, empowered by strong infrastructure support; and  an incentive system, which is created to meet scientific needs and is recognized and promoted by funding agencies and journals, as well as grant and paper reviewers, for scientists to perform reproducible, replicable, and accurate research. For a description of the general approach taken by this community of investigators, see Lin (2018). 121

Prepublication copy, uncorrected proofs. PSYCHOLOGY The idea that there is a “replication crisis” in psychology has received a good deal of attention in professional and popular media, including the New York Times, the Atlantic, the National Review, and Slate. However, there is no consensus within the field on this point. Some researchers believe that the field is rife with lax methods that threaten validity, including low statistical power, failure to clarify between a priori and a posteriori hypothesis testing, and the potential for p-hacking (e.g., Pashler and Wagenmakers, 2012; Simmons et al., 2011). Other researchers disagree with this characterization and have discussed the costs of what they see as misportraying psychology as a field in crisis, such as the possible chilling effects of such claims on young investigators and an overemphasis on Type I errors (false positives) at the expense of Type II errors (false negatives), and failing to discover important new phenomena; Fanelli, 2018; Fieldler et al., 2012). Yet others have noted that psychology has long been concerned with improving its methodology, and the current discussion of reproducibility is part of the normal progression of science. An analysis of experimenter bias in the 1960s is a good example, especially as it spurred the use of double-blind methods in experiments (Rosenthal, 1979). In this view, the current concerns can be situated within a history of continuing methodological improvements as psychological scientists continue to develop better understanding and implementation of statistical and other methods and reporting practices. One reason to believe in the fundamental soundness of psychology as a science is that a great deal of useful and reliable knowledge is being produced. Researchers are making numerous replicable discoveries about the causes of human thought, emotion, and behavior (Shiffrin et al. 2018). To give but a few examples, research on human memory has documented the fallibility of eyewitness testimony, leading to the release of many wrongly convicted prisoners (Loftus, 2017). Research on “overjustification” shows that rewarding children can undermine their intrinsic interest in desirable activities (Lepper and Henderlong, 2000). Research on how decisions are framed has found that more people participate in social programs, such as retirement savings or organ donation, when they are automatically enrolled and have to make a decision to leave (opt out), compared with when they have to make a decision to join (opt in) (Jachimowicz et al., 2018). Increasingly, researchers and governments are using such psychological knowledge to meet social needs and solve problems, including improving educational outcomes, reducing government waste from ineffective programs, improving people’s health, and reducing stereotyping and prejudice (Walton and Wilson, 2018; Wood and Neal, 2016). It is possible that accompanying this progress are lower levels of reproducibility than would be desirable. As discussed throughout this report, no field of science produces perfectly replicable results, but it may be useful to estimate the current level of replicability of published psychology results and ask whether that level is as high as the field believes it needs to be. Indeed, psychology has been at the forefront of empirical attempts to answer this question with large-scale replication projects, in which researchers from different labs attempt to reproduce a set of studies (see Table 5-1). The replication projects themselves have proved to be controversial, however, generating wide disagreement about the attributes used to assess replication and the interpretation of the results. Some view the results of these projects as cause for alarm. In his remarks to the committee, for example, Brian Nosek observed: “The evidence for reproducibility [replicability] has fallen 122

Prepublication copy, uncorrected proofs. short of what one might expect or what one might desire.” (Nosek, 2018). Researchers who agree with this perspective offer a range of evidence.5 First, many of the replication attempts had similar or higher levels of rigor (e.g., sample size, transparency, preregistration) as the original studies, and yet many were not able to reproduce the original results (Cheung et al., 2016; Ebersole et al., 2016; Eerland et al., 2016; Hagger et al., 2016; Klein et al., in press; O’Donnell et al., 2018; Wagenmakers et al., 2016). Given the high degree of scrutiny on replication studies (Zwaan et al., 2018), it is unlikely that most failed replications are the result of sloppy research practices. Second, some of the replication attempts have focused specifically on results that have garnered a lot of attention, are taught in textbooks, and are in other ways high profile—results that one might expect have a high chance of being robust. Some of these replication attempts were successful, but many were not (e.g., Hagger et al., 2016; O’Donnell et al., 2018; Wagenmakers et al., 2016). Third, a number of the replication attempts were collaborative, with researchers closely tied to the original result (e.g., the authors of the original studies or people with a great deal of expertise on the phenomenon) playing an active role in vetting the replication design and procedure (Cheung et al., 2016; Eerland et al., 2016; Hagger et al., 2016; O’Donnell et al., 2018; Wagenmakers et al., 2016). This has not consistently led to positive replication results. Fourth, when potential mitigating factors have been identified for the failures to replicate, these are often speculative and yet to be tested empirically. For example, failures to replicate have been attributed to context sensitivity and that some phenomena are simply more difficult to recreate in another time and place (van Bavel et al., 2016). However, without prospective empirical tests of this or other proposed mitigating factors, the possibility that the original result is not replicable remains a real possibility. And fifth, even if a substantial portion (say, one-third) of failures to replicate are false negatives, it would still lead to the conclusion that the replicability of psychology results falls short of the ideal. Thus, to conclude that replicability rates are acceptable (say, near 80 percent), one would need to have confidence that most failed replications have significant flaws. Others, however, have a quite different view of the results of the replication projects that have been conducted so far, and offer their own arguments and evidence. First, some replication projects have found relatively high rates of replication: for example, Klein et al. (2014) replicated 10 of 13 results. Second, some high-profile replication projects (e.g., Open Science Collaboration, 2015) may have underestimated the replication rate by failing to correct for errors and by introducing changes in the replications that were not in the original studies (e.g., Bench et al., 2017; Etz and Vandekerckhove, 2016; Gilbert et al., 2015; Van Bavel et al., 2016). Moreover, several cases have come to light in which studies failed to replicate because of methodological changes in the replications, rather than problems with the original studies, and when these changes were corrected, the study replicated successfully (e.g., Alogna et al., 2014; Luttrell et al., 2017; Noah et al., 2018). Finally, the generalizability of the replication results is unknown, because no project randomly selected the studies to be replicated, and many were quite selective in the studies they chose to try to replicate. 5 For a list of replication studies in psychology, see http://curatescience.org/#replications-section [January 2018]. 123

Prepublication copy, uncorrected proofs. An unresolved question in any analysis of replicability is what criteria to use to determine success or failure. Meta-analysis across a set of results may be a more promising technique to assess replicability, because it can evaluate moderators of effects as well as uniformity of results. However, meta-analysis may not achieve sufficient power given only a few studies. Despite opposing views about how to interpret large-scale replication projects, there seems to be an emerging consensus that it is not helpful, or justified, to refer to psychology as in a state of “crisis.” Nosek put it this way in his comments to the committee: “How extensive is the lack of reproducibility in research results in science and engineering in general? The easy answer is that we don’t know. We don’t have enough information to provide an estimate with any certainty for any individual field or even across fields in general.” He added, “I don’t like the term ‘crisis’ because it implies a lot of things that we don’t know are true.” Moreover, even if there were a definitive estimate of replicability in psychology, no one knows the expected level of non-replicability in a healthy science. Empirical results in psychology, like science in general, are inherently probabilistic, meaning that some failures to replicate are inevitable. As we stress throughout this report, innovative research will likely produce inconsistent results as it pushes the boundaries of knowledge. Ambitious research agendas that, for example, link brain and behavior, genetic and environmental influences, computational models and empirical results, and hormonal fluctuations with emotions necessarily yield some dead ends and failures. In short, some failures to replicate can reflect normal progress in science, and they can also highlight a lack of theoretical understanding or methodological limitations. Whatever the extent of the problem, scientific methods and data analytic techniques can always be improved, and this discussion follows a long tradition in psychology of methodological innovation. New practices, such as checks on the efficacy of experimental manipulations, are now accepted in the field. Funding proposals now include power analyses as a matter of course. Longitudinal studies no longer just note attrition (participant dropout), but instead routinely estimate its effects (e.g., intention-to-treat analyses). At the same time, not all researchers have adopted best practices, sometimes failing to keep pace with current knowledge (Sedlmeier and Gigerenzer’s, 1989). Only recently are researchers starting to systematically use power calculations in research reports or to provide online access to data and materials. Pressures on researchers to improve practices and to increase transparency have been heightened in the past decade by new developments in information technology that increase public access to information and scrutiny of science (Lupia, 2017). SOCIAL SCIENCE RESEARCH USING BIG DATA With close to 7 in 10 Americans now using social media as a regular news source (Pew, 2018), social scientists in communication research, psychology, sociology, and political science routinely analyze a variety of information disseminated on commercial social media platforms, such as Twitter and Facebook, how that information flows through social networks, and how it influences attitudes and behaviors. Analyses of data from these commercial platforms may rely on publicly available data that can be scraped and collected by any researcher without input from or collaboration with industry partners (model 1). Alternatively, industry staff may collaborate with researchers and provide access to proprietary data for analysis (such as code or, underlying algorithms) that may not be made available to others (model 2). Variations on these two basic models will depend on the type of intellectual property being used in the research. 124

Prepublication copy, uncorrected proofs. Both models raise challenges for reproducibility and replicability. In terms of reproducibility, when data are proprietary and undisclosed, the computation by definition is not reproducible by others. This might put this kind of research at odds with publication requirements of journals and other academic outlets. An inability to publish results from such industry partnerships may in the long term create a disincentive for work on datasets that cannot be made publicly available and increase pressure from within the scientific community on industry partners for more openness. This process may be accelerated if funding agencies only support research that follows the standards for full documentation and openness detailed in this report. Both models also raise issues with replicability. Social media platforms, such as Twitter and Facebook, regularly modify their application programming interfaces (APIs) and other modalities of data access, which influences the ability of researchers to access, document, and archive data consistently. In addition, data are likely confounded by ongoing A/B testing6 and tweaks to underlying algorithms. In model 1, these confounds are not transparent to researchers and therefore cannot be documented or controlled for in the original data collections or attempts to replicate the work. In model 2, they are known to the research team, but because they are proprietary they cannot be shared publicly. In both models, changes implemented by social media platforms in algorithms, APIs, and other internal characteristics over time make it impossible to computationally reproduce analytic models and to have confidence that equivalent data for reproducibility can be collected over time. In summary, the considerations for social science using big data of the type discussed above illustrate a spectrum of challenges and approaches toward gaining confidence in scientific studies. In these and other scientific domains, science progresses through growing consensus in the scientific community of what counts as scientific knowledge. At the same time, public trust in science is premised on public confidence in the ability of scientists to demonstrate and validate what they assert is scientific knowledge. In the examples above, diverse fields of science have developed methods for investigating phenomena that are difficult or impossible to replicate. Yet, as in the case of hazard prediction, scientific progress has been made as evidenced by forecasts with increased accuracy. This progress is built from the results of many trials and errors. Differentiating a success from a failure of a single study cannot be done without looking more broadly at the other lines of evidence. As noted by Goodman et al., (2016, p. 3): “[A] preferred way to assess the evidential meaning of two or more results with substantive stochastic variability is to evaluate the cumulative evidence they provide.” CONCLUSION 7-2: Multiple channels of evidence from a variety of studies provide a robust means for gaining confidence in scientific knowledge over time. The goal of science is to understand the overall effect or inference from a set of scientific studies, not to strictly determine whether any one study has replicated any other. 6 A/B testing is a randomized experiment with two variants that includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. 125

Prepublication copy, uncorrected proofs. PUBLIC PERCEPTIONS OF REPRODUCIBILITY AND REPLICABILITY The congressional mandate that led to this study expressed the view that “there is growing concern that some published research results cannot be replicated, which can negatively affect the public’s trust in science.” The statement of task for this report reflected this concern, asking the committee to “consider if the lack of replication and reproducibility impacts . . . the public’s perception” of science (see Box 1-1, in Chapter 1). This committee is not aware of any data that have been collected that specifically address how non-reproducibility and non-replicability have affected the public’s perception of science. However, there are data about topics that may shed some light on how the public views these issues. These include data about the public’s understanding of science, the public’s trust in science, and the media’s coverage of science. Public Understanding of Science When examining public understanding of science for the purposes of this report, at least four areas are particularly relevant: factual knowledge, understanding of the scientific process, awareness of scientific consensus, and understanding of uncertainty. Factual knowledge about scientific terms and concepts in the United States has been fairly stable in recent years. In 2016, Americans correctly answered an average of 5.6 of the 9 true-or- false or multiple-choice items asked on the Science & Engineering Indicators surveys. This number was similar to the averages from data gathered over the past decade. In other words, there is no indication that knowledge of scientific facts and terms has decreased in recent years. It is clear from the data, however, that “factual knowledge of science is strongly related to individuals’ level of formal schooling and the number of science and mathematics courses completed” (National Science Foundation, 2018e, p. 7-35). Americans’ understanding of the scientific process is mixed. The Science & Engineering Indicators surveys ask respondents about their understanding of three aspects related to the scientific process. In 2016, 64 percent could correctly answer two questions related to the concept of probability, 51 percent provided a correct description of a scientific experiment, and 23 percent were able to describe the idea of a scientific study. While these numbers have not been declining over time, they nonetheless indicate relatively low levels of understanding of the scientific process and suggest an inability of “[m]any members of the public . . . to differentiate a sound scientific study from a poorly conducted one and to understand the scientific process more broadly” (Scheufele, 2013, p. 14041). Another area in which the public lacks a clear understanding of science is the idea of scientific consensus on a topic. There are widespread perceptions that no scientific consensus has emerged in areas that are supported by strong and consistent bodies of research. In a 2014 U.S. survey (Funk and Raine, 2015, p. 8), for instance, two-thirds of respondents (67 percent) thought that scientists did “not have a clear understanding about the health effects of GM [genetically modified] crops,” in spite of more than 1,500 peer-refereed studies showing that there is no difference between genetically modified and traditionally grown crops in terms of their health effects for human consumption (National Academies of Sciences, Engineering, and Medicine, 2016a). Similarly, even though there is broad consensus among scientists, one-half of Americans (52 percent) thought “scientists are divided” that the universe was created in a single, violent event often called the big bang, and about one-third thought that scientists are divided on the human causes of climate change (37 percent) and on evolution (29 percent). 126

Prepublication copy, uncorrected proofs. For the fourth area, the public’s understanding about uncertainty, its role in scientific inquiry, and how uncertainty ought to be evaluated, research is sparse. Some data are available on uncertainties surrounding public opinion poll results. In a 2007 Harris interactive poll, for instance, only about 1 in 10 Americans (12 percent) could correctly identify the source of error quantified by margin-of-error estimates. Yet slightly more than one-half (52 percent) agreed that pollsters should use the phrase “margin of error” when reporting on survey results. Some research has shown that scientists believe that the public is unable to understand or contend with uncertainty in science (Besley and Nisbet, 2011; Davies, 2008; Ecklund et al., 2012) and that providing information related to uncertainty creates distrust, panic, and confusion (Frewer et al., 2003). However, people appear to expect some level of uncertainty in scientific information, and seem to have a relatively high tolerance for scientific uncertainty (Howell, 2018). Currently, research is being done to explore how best to communicate uncertainties to the public and how to help people accurately process uncertain information. Public Trust in Science Despite a sometimes shaky understanding of science and the scientific process, the public continues largely to trust the scientific community. In its biannual Science & Engineering Indicators reports, the National Science Board (see, e.g., National Science Foundation, 2018e) tracks public confidence in a range of institutions: Figure 7-1. Over time, trust in science has remained stable—in contrast to other institutions, such as Congress, major corporations, and the press, which have all shown significant declines in public confidence over the past 50 years. With respect to public confidence, science has been eclipsed in public confidence only by the military, during Operation Desert Storm in the early 1990s and since the 9/11 terrorist attacks. FIGURE 7-1: Levels of public confidence in selected U.S. institutions over time. SOURCE: National Science Foundation (2018e, Figure 7-16) and General Social Survey (2018 data from http://gss.norc.org/Get-The-Data). 127

Prepublication copy, uncorrected proofs. In the most recent iteration of the Science & Engineering Indicators surveys (National Science Foundation, 2018e), almost 9 in 10 (88 percent) Americans also “strongly agreed” or “agreed” with the statement that “[m]ost scientists want to work on things that will make life better for the average person.” A similar proportion (89 percent) “strongly agreed” or “agreed” that “[s]cientific researchers are dedicated people who work for the good of humanity.” Even for potentially controversial issues, such as climate change, levels of trust in scientists as information sources remains relatively high, with 71 percent in a 2015 Yale University Project on Climate Change survey saying that they trust climate scientists “as a source of information about global warming,” compared with 60 percent trusting television weather reporters as information sources, and 41 percent trusting mainstream news media. Controversies around scientific conduct, such as “climategate,” have not led to significant shifts in public trust. In fact, “more than a decade of public opinion research on global warming . . . [shows] that these controversies . . . had little if any measurable impact on relevant opinions of the nation as a whole” (MacInnis and Krosnick, 2016, p. 507). In recent years, some scholars have raised concerns that unwarranted attention on emerging areas of science can lead to misperceptions or even declining trust among public audiences, especially if science is unable to deliver on early claims or subsequent research fails to replicate initial results (Scheufele, 2014). Public opinion surveys show that these concerns are not completely unfounded. In national surveys, one in four Americans (27 percent) think that it is a “big problem” and almost half of Americans (47 percent) think it is at least a “small problem” that “[s]cience researchers overstate the implications of their research;” only one in four (24 percent) see no problem (Funk et al., 2017). In other words, “science may run the risk of undermining its position in society in the long term if it does not navigate this area of public communication carefully and responsibly” (Scheufele and Krause, 2019, p. 6). Media Coverage of Science The concerns noted above are exacerbated by the fact that the public’s perception of science—and of reproducibility and replicability issues—is heavily influenced by the media’s coverage of science. News is an inherently event-driven profession. Research on news values (Galtung and Ruge, 1965) and journalistic norms (Shoemaker and Reese, 1996) has shown that rare, unexpected, or novel events and topics are much more likely to be covered by news media than recurring or what are seen as routine issues. As a result, scientific news coverage often tends to favor articles about single-study, breakthrough results over stories that might summarize cumulative evidence, describe the process of scientific discovery, or delineate between systemic, application-focused or intrinsic uncertainties surrounding science, as discussed throughout this report. In addition to being event driven, news is also subject to audience demand. Experimental studies have demonstrated that respondents prefer conflict-laden debates over deliberative exchanges (Mutz and Reeves, 2005). Audience demand may drive news organizations to cover scientific stories that emphasize conflict—for example, studies that contradict previous work— rather than reporting on studies that support the consensus view or make incremental additions to existing knowledge. In addition to what is covered by the media, there are also concerns about how the media cover scientific stories. There is some evidence that media stories contain exaggerations or make causal statements or inferences that are not warranted when reporting on scientific studies. For example, a study that looked at journal articles, press releases about these articles, and the subsequent news stories found that more than one-third of press releases contained exaggerated 128

Prepublication copy, uncorrected proofs. advice, causal claims, or inferences from animals to humans (Sumner et al., 2016). When the press release contained these exaggerations, the news stories that followed were far more likely also to contain exaggerations in comparison with news stories based on press releases that did not exaggerate. Public confidence in science journalism reflects this concern about coverage, with 73 percent of Americans saying that the “biggest problem with news about scientific research findings is the way news reporters cover it,” and 43 percent saying it is a “big problem” that the news media are “too quick to report research findings than may not hold up” (Funk et al., 2017). Implicit in discussions of sensationalizing and exaggeration of research results is the concept of uncertainty. While scientific publications almost always include at least a brief discussion of the uncertainty in the results—whether presented in error bars, confidence intervals, or other metrics—this discussion of uncertainty does not always make it into news stories. When results are presented without the context of uncertainty, it can contribute to the perception of hyping or exaggerating a study’s results. In recent years, the term “replication crisis” has been used in both academic writing (e.g., Shrout and Rodgers, 2018) and in the mainstream media (see, e.g., Yong, 2016), despite a lack of reliable data about the existence of such a “crisis.” Some have raised concerns that highly visible instances of media coverage of the issue of replicability and reproducibility have contributed to a larger narrative in public discourse around science being “broken” (Jamieson, 2018). The frequency and prominence with which an issue is covered in the media can influence the perceived importance among audiences about that issue relative to other topics and ultimately how audiences evaluate actors in their performance on the issue (National Academies of Sciences, Engineering, and Medicine, 2016). However, large-scale analyses suggest that widespread media coverage of the issue is not the case. A preliminary analysis of print and online news outlets, for instance, shows that overall media coverage on reproducibility and replicability remains low, with fewer than 200 unique, on-topic articles captured for a 10-year period, from June 1, 2008, to April 30, 2018 (Howell, 2018). Thus, there is currently limited evidence that media coverage of a “replication crisis” has significantly influenced public opinion. Scientists also bear some responsibility for misrepresentation in the public’s eye, with many believing that scientists overstate the implications of their research. The purported existence of a replication “crisis” has been reported in several high-profile articles in the mainstream media, however, overall coverage remains low and it is unclear whether this issue has reached the ears of the general population. CONCLUSION 7-3: Based on evidence from well-designed and long-standing surveys of public perceptions, the public largely trusts scientists. Understanding of the scientific process and methods has remained stable over time, though is not widespread. The NSF’s most recent Science & Engineering Indicators survey shows that 51 percent of Americans understand the logic of experiments and 23 percent understand the idea of a scientific study. As discussed throughout this report, uncertainty is an inherent part of science. Unfortunately, while people show some tolerance for uncertainty in science, it is often not well communicated by researchers or the media. There is, however, a large and growing body of research outlining evidence-based approaches for scientists to more effectively communicate different dimensions of scientific uncertainty to nonexpert audiences (for an overview, see 129

Prepublication copy, uncorrected proofs. Fischhoff and Davis, 2014). Similarly, journalism teachers and scholars have long examined how journalists cover scientific uncertainty (e.g., Stocking, 1999) and best practices for communicating uncertainty in science news coverage (e.g., Blum et al., 2005). Broader trends in how science is promoted and covered in modern news environments may indirectly influence public trust in science related to replicability and reproducibility. Examples include concerns about hyperbolic claims in university press releases (for a summary, see Weingart, 2017) and false balance in reporting, especially when scientific topics are covered by nonscience journalists: in these cases, the established scientific consensus around issues such as climate change are put on equal footing with nonfactual claims by nonscientific organizations or interest groups for the sake of “showing both sides” (Boykoff and Boykoff, 2004). RECOMMENDATION 7-1: Scientists should take care to avoid overstating the implications of their research and also exercise caution in their review of press releases, especially when the results bear directly on matters of keen public interest and possible action. RECOMMENDATION 7-2: Journalists should report on scientific results with as much context and nuance as the medium allows. In covering issues related to replicability and reproducibility, journalists should help their audiences understand the differences between non-reproducibility and non- replicability due to fraudulent conduct of science and instances in which the failure to reproduce or replicate may be due to evolving best practices in methods or inherent uncertainty in science. Particular care in reporting on scientific results is warranted when: ● the scientific system under study is complex and with limited control over alternative explanations or confounding influences; ● a result is particularly surprising or at odds with existing bodies of research; ● the study deals with an emerging area of science that is characterized by significant disagreement or contradictory results within the scientific community; and ● research involves potential conflicts of interest, such as work funded by advocacy groups, affected industry, or others with a stake in the outcomes. Finally, members of the public and policy makers have a role to play to improve reproducibility and replicability. When reports of a new discovery are made in the media, one needs to ask about the uncertainties associated with the results and what other evidence exists that the discovery might be weighed against. RECOMMENDATION 7-3: Anyone making personal or policy decisions based on scientific evidence should be wary of making a serious decision based on the results, no matter how promising, of a single study. Similarly, no one should take a new, single contrary study as refutation of scientific conclusions supported by multiple lines of previous evidence. 130

Next: References »
Reproducibility and Replicability in Science Get This Book
×
Buy Prepub | $69.00 Buy Paperback | $60.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!