Science is a mode of inquiry that aims to pose questions about the world, arriving at the answers and assessing their degree of certainty through a communal effort designed to ensure that they are well grounded.1 “World,” here, is to be broadly construed: it encompasses natural phenomena at different time and length scales, social and behavioral phenomena, mathematics, and computer science. Scientific inquiry focuses on four major goals: (1) to describe the world (e.g., taxonomy classifications),
1 Many different definitions of “science” exist. In line with the committee’s task, we aim for this description to apply to a wide variety of scientific and engineering studies.
(2) to explain the world (e.g., the evolution of species), (3) to predict what will happen in the world (e.g., weather forecasting), and (4) to intervene in specific processes or systems (e.g., making solar power economical or engineering better medicines).
Human interest in describing, explaining, predicting, and intervening in the world is as old as humanity itself. People across the globe have sought to understand the world and use this understanding to advance their interests. Long ago, Pacific Islanders used knowledge of the stars to navigate the seas; the Chinese developed earthquake alert systems; many civilizations domesticated and modified plants for farming; and mathematicians around the world developed laws, equations, and symbols for quantifying and measuring. With the work of such eminent figures as Copernicus, Kepler, Galileo, Newton, and Descartes, the scientific revolution in Europe in the 16th and 17th centuries intensified the growth in knowledge and understanding of the world and led to ever more effective methods for producing that very knowledge and understanding.
Over the course of the scientific revolution, scientists demonstrated the value of systematic observation and experimentation, which was a major change from the Aristotelian emphasis on deductive reasoning from ostensibly known facts. Drawing on this work, Francis Bacon (1889 ) developed an explicit structure for scientific investigation that emphasized empirical observation, systematic experimentation, and inductive reasoning to question previous results. Shortly thereafter, the concept of communicating a scientific experiment and its result through a written article was introduced by the Royal Society of London.2 These contributions created the foundations for the modern practice of science—the investigation of a phenomenon through observation, measurement, and analysis and the critical review of others through publication.
The American Association for the Advancement for Science (AAAS) describes approaches to scientific methods by recognizing the common features of scientific inquiry across the diversity of scientific disciplines and the systems each discipline studies (Rutherford and Ahlgren, 1991, p. 2):
Scientific inquiry is not easily described apart from the context of particular investigations. There simply is no fixed set of steps that scientists always follow, no one path that leads them unerringly to scientific knowledge. There are, however, certain features of science that give it a distinctive character as a mode of inquiry.
Scientists, regardless of their discipline, follow common principles to conduct their work: the use of ideas, theories, and hypotheses; reliance on
evidence; the use of logic and reasoning; and the communication of results, often through a scientific article. Scientists introduce ideas, develop theories, or generate hypotheses that suggest connections or patterns in nature that can be tested against observations or measurements (i.e., evidence). The collection and characterization of evidence—including the assessment of variability (or uncertainty)—is central to all of science. Analysis of the collected data that leads to results and conclusions about the strength of a hypothesis or proposed theory requires the use of logic and reasoning, inductive, deductive, or abductive. A published scientific article allows other researchers to review and question the evidence, the methods of collection and analysis, and the scientific results.
While these principles are common to all scientific and engineering research disciplines, different scientific disciplines use specific tools and approaches that have been designed to suit the phenomena and systems that are particular to each discipline. For example, the mathematics taught to graduate students in astronomy will be different from the mathematics taught to graduate students studying zoology. Laboratory equipment and experimental methods for studying biology will likely differ from those for studying materials science (Rutherford and Ahlgren, 1991). In general, one may say that different scientific disciplines are distinguished by the nature of the phenomena of interest to the field, the kinds of questions asked, and the types of tools, methods, and techniques used to answer those questions. In addition, scientific disciplines are dynamic, regularly engendering subfields and occasionally combining and reforming. In recent years, for example, what began as an interdisciplinary interest of biologists and physicists emerged as a new field of biophysics, while psychologists and economists working together defined a field of behavioral economics. There have been similar interweavings of questions and methods for countless examples over the history of science.
No matter how far removed one’s daily life is from the practice of science, the concrete results of science and engineering are inescapable. They are manifested in the food people eat, their clothes, the ways they move from place to place, the devices they carry, and the fact that most people will outlive by decades the average human born before the last century. So ubiquitous are these scientific achievements that it is easy to forget that there was nothing inevitable about humanity’s ability to achieve them.
Scientific progress is made when the drive to understand and control the world is guided by a set of core principles and scientific methods. While challenges to previous scientific results may force researchers to examine their own practices and methods, the core principles and assumptions underlying scientific inquiry remain unchanged. In this context, the consideration of reproducibility and replicability in science is intended to maintain and enhance the integrity of scientific knowledge.
Science is inherently forward thinking, seeking to discover unknown phenomena, increase understanding of the world, and answer new questions. As new knowledge is found, earlier ideas and theories may need to be revised. The core principles and assumptions of scientific inquiry embrace this tension, allowing science to progress while constantly testing, checking, and updating existing knowledge. In this section, we explore five core principles and assumptions underlying science:
- Nature is not capricious.
- Knowledge grows through exploration of the limits of existing rules and mutually reinforcing evidence.
- Science is a communal enterprise.
- Science aims for refined degrees of confidence, rather than complete certainty.
- Scientific knowledge is durable and mutable.
A basic premise of scientific inquiry is that nature is not capricious. “Science . . . assumes that the universe is, as its name implies, a vast single system in which the basic rules are everywhere the same. Knowledge gained from studying one part of the universe is applicable to other parts” (Rutherford and Ahlgren, 1991, p. 5). In other words, scientists assume that if a new experiment is carried out under the same conditions as another experiment, the results should replicate. In March 1989, the electrochemists Martin Fleischmann and Stanley Pons claimed to have achieved the fusion of hydrogen into helium at room temperature (i.e., “cold fusion”). In an example of science’s capacity for self-correction, dozens of laboratories attempted to replicate the result over the next several months. A consensus soon emerged within the scientific community that Fleischmann and Pons had erred and had not in fact achieved cold fusion.
Imagine a fictional history, in which the researchers responded to the charge that their original claim was mistaken, as follows: “While we are of course disappointed at the failure of our results to be replicated in other laboratories, this failure does nothing to show that we did not achieve cold fusion in our own experiment, exactly as we reported. Rather, what it demonstrates is that the laws of physics or chemistry, on the occasion of our experiment (i.e., in that particular place, at that particular time), behaved in such a way as to allow for the generation of cold fusion. More exactly,
it is our contention that the basic laws of physics and chemistry operate one way in those regions of space and time outside of the location of our experiment, and another way within that location.”
It goes without saying that this would be absurd. But why, exactly? Why, that is, should scientists not take seriously the fictional explanation above? The brief answer, sufficient for our purposes, is that scientific inquiry (indeed, almost any sort of inquiry) would grind to a halt if one took seriously the possibility that nature is capricious in the way it would have to be for this fictional explanation to be credible. Science operates under a standing presumption that nature follows rules that are consistent, however subtle, intricate, and challenging to discern they may be. In some systems, these rules are consistent across space and time—for example, a physics study should replicate in different countries and in different centuries (assuming that differences in applicable factors, such as elevation or temperature, are accounted for). In other systems, the rules may be limited to specific places or times; for example, a rule of human behavior that is true in one country and one time period may not be true in a different time and place. In effect, all scientific disciplines seek to discover rules that are true beyond the specific context within which they are discovered.
Knowledge Grows Through Exploration of the Limits of Existing Rules and Mutually Reinforcing Evidence
Scientists seek to discover rules about relationships or phenomena that exist in nature, and ultimately they seek to describe, explain, and predict. Because nature is not capricious, scientists assume that these rules will remain true as long as the context is equivalent. And because knowledge grows through evidence about new relationships, researchers may find it useful to ask the same scientific questions using new methods and in new contexts, to determine whether and how those relationships persist or change. Most scientists seek to find rules that are not only true in one specific context but that are also confirmable by other scientists and are generalizable—that is rules that remain true even if the context of a separate study is not entirely the same as the original. Scientists thus seek to generalize their results and to discover the limits of proposed rules. These limits can often be a rich source of new knowledge about the system under study. For example, if a particular relationship was observed in an older group but not a younger group, this suggests that the relationship may be affected by age, cohort, or other attributes that distinguish the groups and may point the researcher toward further inquiry.
Robert Merton (1973) described modern science as an institution of “communalism, universalism, disinterestedness, and organized skepticism.” Science is an ongoing, communal conversation and a joint problem-solving enterprise that can include false starts and blind alleys, especially when taking risks in the quest to find answers to important questions. Scientists build on their own research as well as the work of their peers, and this building can sometimes span generations. Scientists today still rely on the work of Newton, Darwin, and others from centuries past.
Researchers have to be able to understand others’ research in order to build on it. When research is communicated with clear, specific, and complete accounting of the materials and methods used, the results found, and the uncertainty associated with the results, other scientists can know how to interpret the results. The communal enterprise of science allows scientists to build on others’ work, develop the necessary skills to conduct high quality studies, and check results and confirm, dispute, or refine them.
Scientific results should be subject to checking by peers, and any scientist competent to perform such checking has the standing to do so. Confirming the results of others, for example, by replicating the results, serves as one of several checks on the processes by which researchers produce knowledge. The original and replicated results are ideally obtained following well-recognized scientific approaches within a given field of science, including collection of evidence and characterization of the associated sources and magnitude of uncertainties. Indeed, without understanding uncertainties associated with a scientific result (as discussed throughout this report), it is difficult to assess whether or not it has been replicated.
Uncertainty is inherent in all scientific knowledge, and many types of uncertainty can affect the reliability of a scientific result. It is important that researchers understand and communicate potential sources of uncertainty in any system under study. Decision makers looking to use study results need to be able to understand the uncertainties associated with those results. Understanding the nature of uncertainty associated with an analysis can help inform the selection and use of quantitative measures for characterizing the results (see Box 2-1). At any stage of growing scientific sophistication, the aim is both to learn what science can now reveal about the world and to recognize the degree of uncertainty attached to that knowledge.
As researchers explore the world through new scientific studies and observations, new evidence may challenge existing and well-known theories. The scientific process allows for the consideration of new evidence that, if credible, may result in revisions or changes to current understanding. Testing of existing models and theories through the collection of new data is useful in establishing their strength and their limits (i.e., generalizability), and it ultimately expands human knowledge. Such change is inevitable as scientists develop better methods for measuring and observing the world. The advent of new scientific knowledge that displaces or reframes previous knowledge should not be interpreted as a weakness in science. Scientific knowledge is built on previous studies and tested theories, and the progression is often not linear. Science is engaged in a continuous process of refinement to uncover ever-closer approximations to the truth.
CONCLUSION 2-1: The scientific enterprise depends on the ability of the scientific community to scrutinize scientific claims and to gain confidence over time in results and inferences that have stood up to repeated testing. Reporting of uncertainties in scientific results is a central tenet of the scientific process. It is incumbent on scientists to convey the appropriate degree of uncertainty in reporting their claims.
Many scientific studies seek to measure, explain, and make predictions about natural phenomena. Other studies seek to detect and measure the effects of an intervention on a system. Statistical inference provides a conceptual and computational framework for addressing the scientific questions in each setting. Estimation and hypothesis testing are broad groupings of inferential procedures. Estimation is suitable for settings in which the main goal is the assessment of the magnitude of a quantity, such as a measure of a physical constant or the rate of change in a response corresponding to a change in an explanatory variable. Hypothesis testing is suitable for settings in which scientific interest is focused on the possible effect of a natural event or intentional intervention, and a study is conducted to assess the evidence for and against this effect. In this context, hypothesis testing helps answer binary questions. For example, will a plant grow faster with fertilizer A or fertilizer B? Do children in smaller classes learn more? Does an experimental drug work better than a placebo? Several types of more specialized statistical methods are used in scientific inquiry, including methods for designing studies and methods for developing and evaluating prediction algorithms.
Because hypothesis testing has been involved in a major portion of reproducibility and replicability assessments, we consider this mode of statistical inference in some detail. However, considerations of reproducibility and replicability apply broadly to other modes and types of statistical inference. For example, the issue of drawing multiple statistical inferences from the same data is relevant for all hypothesis testing and in estimation.
Studies involving hypothesis testing typically involve many factors that can introduce variation in the results. Some of these factors are recognized, and some are unrecognized. Random assignment of subjects or test objects to one or the other of the comparison groups is one way to control for the possible influence of both unrecognized and recognized sources of variation. Random assignment may help avoid systematic differences between groups being compared, but it does not affect the variation inherent in the system (e.g., population or an intervention) under study.
Scientists use the term null hypothesis to describe the supposition that there is no difference between the two intervention groups or no effect of
a treatment on some measured outcome (Fisher, 1935). A commonly used formulation of hypothesis testing is based on the answer to the following question: If the null hypothesis is true, what is the probability of obtaining a difference at least as large as the observed one? In general, the greater the observed difference, the smaller the probability that a difference at least as large as the observed would be obtained when the null hypothesis is true. This probability of obtaining a difference at least as large as the observed when the null hypothesis is true is called the “p-value.”3 As traditionally interpreted, if a calculated p-value is smaller than a defined threshold, the results may be considered statistically significant. A typical threshold may be p ≤ 0.05 or, more stringently, p ≤ 0.01 or p ≤ 0.005.4 In a statement issued in 2016, the American Statistical Association Board (Wasserstein and Lazar, 2016, p. 129) noted:
While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted. This has led to some scientific journals discouraging the use of p-values, and some scientists and statisticians recommending their abandonment, with some arguments essentially unchanged since p-values were first introduced.
More recently, it has been argued that p-values, properly calculated and understood, can be informative and useful; however, a conclusion of statistical significance based on an arbitrary threshold of likelihood (even a familiar one such as p ≤ 0.05) is unhelpful and frequently misleading (Wasserstein et al., 2019; Amrhein et al., 2019b).
Understanding what a p-value does not represent is as important as understanding what it does indicate. In particular, the p-value does not represent the probability that the null hypothesis is true. Rather, the p-value is calculated on the assumption that the null hypothesis is true. The probability that the null hypothesis is true, or that the alternative hypothesis is true, can be based on calculations informed in part by the observed results, but this is not the same as a p-value.
In scientific research involving hypotheses about the effects of an intervention, researchers seek to avoid two types of error that can lead to non-replicability:
3 Text modified December 2019. In discussions related to the p-value, the original report used “likelihood” rather than “probability” and failed to note that the p-value includes the observed “and more extreme” results (See Section 3.2, Principles of Statistical Inference, Cox, 2006). Although the words probability and likelihood are interchangeable in everyday English, they are distinguished in technical usage in statistics.
4 The threshold for statistical significance is often referred to as p “less than” 0.05; we refer to this threshold as “less than or equal to.”
- Type I error—a false positive or a rejection of the null hypothesis when it is correct
- Type II error—a false negative or failure to reject a false null hypothesis, allowing the null hypothesis to stand when an alternative hypothesis, and not the null hypothesis, is correct
Ideally, both Type I and Type II errors would be simultaneously reduced in research. For example, increasing the statistical power of a study by increasing the number of subjects in a study can reduce the likelihood of a Type II error for any given likelihood of Type I error.5 Although the increase in data that comes with higher powered studies can help reduce both Type I and Type II errors, adding more subjects typically means more time and cost for a study.
Researchers are often forced to make tradeoffs in which reducing the likelihood of one type of error increases the likelihood of the other. For example, when p-values are deemed useful, Type I errors may be minimized by lowering the significance threshold to a more stringent level (e.g., by lowering the standard p ≤ 0.05 to p ≤ 0.005). However, this would simultaneously increase the likelihood of a Type II error. In some cases, it may be useful to define separate interpretive zones, where p-values above one significance threshold are not deemed significant, p-values below a more stringent significance threshold are deemed significant, and p-values between the two thresholds are deemed inconclusive. Alternatively, one could simply accept the calculated p-value for what it is—the probability of obtaining the observed result or one more extreme if the null hypothesis were true—and refrain from further interpreting the results as “significant” or “not significant.” The traditional reliance on a single threshold to determine significance can incentivize behaviors that work against scientific progress (see the Publication Bias section in Chapter 5).
Tension can arise between replicability and discovery, specifically, between the replicability and the novelty of the results. Hypotheses with low a priori probabilities are less likely to be replicated. In this vein, Wilson and Wixted (2018) illustrated how fields that are investigating potentially ground-breaking results will produce results that are less replicable, on average, than fields that are investigating highly likely, almost-established results. Indeed, a field could achieve near-perfect replicability if it limited its investigations to prosaic phenomena that were already well known. As Wilson and Wixted (2018, p. 193) state, “We can imagine pages full of findings that people are hungry after missing a meal or that people are sleepy after staying up all night,” which would not be very helpful “for advancing
5 Statistical power is the probability that a test will reject the null hypothesis when a specific alternative hypothesis is true.
understanding of the world.” In the same vein, it would not be helpful for a field to focus solely on improbable, outlandish hypotheses.
The goal of science is not, and ought not to be, for all results to be replicable. Reports of non-replication of results can generate excitement as they may indicate possibly new phenomena and expansion of current knowledge. Also, some level of non-replicability is expected when scientists are studying new phenomena that are not well established. As knowledge of a system or phenomenon improves, replicability of studies of that particular system or phenomenon would be expected to increase.
Assessing the probability that a hypothesis is correct in part based on the observed results can also be approached through Bayesian analysis. This approach starts with a priori (before data observation) assumptions, known as prior probabilities, and revises them on the basis of the observed data using Bayes’ theorem, sometimes described as the Bayes formula.
Appendix D illustrates how a Bayesian approach to inference can, under certain assumptions on the data generation mechanism and on the a priori likelihood of the hypothesis, use observed data to estimate the probability that a hypothesis is correct. One of the most striking lessons from Bayesian analysis is the profound effect that the pre-experimental odds have on the post-experimental odds. For example, under the assumptions shown in Appendix D, if the prior probability of an experimental hypothesis was only 1 percent and the obtained results were statistically significant at the p ≤ 0.01 level, only about one in eight of such conclusions that the hypothesis was true would be correct. If the prior probability was as high as 25 percent, then more than four of five such studies would be deemed correct. As common sense would dictate and Bayesian analysis can quantify, it is prudent to adopt a lower level of confidence in the results of a study with a highly unexpected and surprising result than in a study for which the results were a priori more plausible (e.g., see Box 2-2).
Highly surprising results may represent an important scientific breakthrough, even though it is likely that only a minority of them may turn out over time to be correct. It may be crucial, in terms of the example in the previous paragraph, to learn which of the eight highly unexpected (prior probability, 1%) results can be verified and which one of the five moderately unexpected (prior probability, 25%) results should be discounted.
Keeping the idea of prior probability in mind, research focused on making small advances to existing knowledge would result in a high replication rate (i.e., a high rate of successful replications) because researchers would be looking for results that are very likely correct. But doing so would have the undesirable effect of reducing the likelihood of making major new discoveries (Wilson and Wixted, 2018). Many important advances in science have resulted from a bolder approach based on more speculative
hypotheses, although this path also leads to dead ends and to insights that seem promising at first but fail to survive after repeated testing.
The “safe” and “bold” approaches to science have complementary advantages. One might argue that a field has become too conservative if all attempts to replicate results are successful, but it is reasonable to expect that researchers follow up on new but uncertain discoveries with replication studies to sort out which promising results prove correct. Scientists should be cognizant of the level of uncertainty inherent in speculative hypotheses and in surprising results in any single study.