Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Prepublication copy, uncorrected proofs. 2 SCIENTIFIC METHODS AND KNOWLEDGE The specific questions posed about reproducibility and replicability in the committeeâs statement of task are part of the broader question of how scientific knowledge is gained, questioned, and modified. In this chapter we introduce concepts central to scientific inquiry by discussing the nature of science and outlining core values of the scientific process. We outline how scientists accumulate scientific knowledge through discovery, confirmation, and correction and highlight the process of statistical inference, which has been a focus of recently publicized failures to confirm original results. WHAT IS SCIENCE? Science is a mode of inquiry that aims to pose questions about the world, arriving at the answers and assessing their degree of certainty through a communal effort designed to ensure that they are well grounded.1 âWorld,â here, is to be broadly construed: it encompasses natural phenomena at different time and length scales, social and behavioral phenomena, mathematics, and computer science. Scientific inquiry focuses on four major goals: to describe the world (e.g., taxonomy classifications), to explain the world (e.g., the evolution of species), to predict what will happen in the world (e.g., weather forecasting), and to intervene on specific processes or systems (e.g., making solar power economical or engineering better medicines). Human interest in describing, explaining, predicting, and intervening in the world is as old as humanity itself. People across the globe have sought to understand the world and use this understanding to advance their interests: long ago, Pacific Islanders used knowledge of the stars to navigate the seas; the Chinese developed earthquake alert systems; many civilizations domesticated and modified plants for farming; and mathematicians around the world developed laws, equations, and symbols for quantifying and measuring. With the work of such eminent figures as Copernicus, Kepler, Galileo, Newton, and Descartes, the scientific revolution in Europe in the 16th and 17th centuries intensified the growth in knowledge and understanding of the world and led to ever more effective methods for producing that very knowledge and understanding. Over the course of the scientific revolution, scientists demonstrated the value of systematic observation and experimentation, which was a major change from the Aristotelian emphasis on deductive reasoning from ostensibly known facts. Drawing on this work, Francis Bacon (1620) developed an explicit structure for scientific investigation that emphasized empirical observation, systematic experimentation, and inductive reasoning to question previous results. Shortly thereafter, the concept of communicating a scientific experiment and its result through a written article was introduced by the Royal Society of London.2 These contributions created the 1 Many different definitions of âscienceâ exist: in line with the committeeâs task, this description aims to apply to a wide variety of scientific and engineering studies. 2 See http://blog.efpsa.org/2013/04/30/the-origins-of-scientific-publishing/ [April 2019]. 21
Prepublication copy, uncorrected proofs. foundations for the modern practice of science, the investigation of a phenomenon through observation, measurement, and analysis and the critical review of others through publication. The American Association for the Advancement for Science (AAAS) describes approaches to scientific methods by recognizing the common features of scientific inquiry across the diversity of scientific disciplines and the systems each discipline studies (AAAS, 1991, p. 2): Scientific inquiry is not easily described apart from the context of particular investigations. There simply is no fixed set of steps that scientists always follow, no one path that leads them unerringly to scientific knowledge. There are, however, certain features of science that give it a distinctive character as a mode of inquiry. Scientists, regardless of their discipline, follow common principles to conduct their work: the use of ideas, theories, and hypotheses; reliance on evidence; the use of logic and reasoning; and the communication of results, often through a scientific article. Scientists introduce ideas, develop theories, or generate hypotheses that suggest connections or patterns in nature that can be tested against observations or measurements (i.e., evidence). The collection and characterization of evidenceâincluding the assessment of variability (or uncertainty)âis central to all of science. Analysis of the collected data that leads to results and conclusions about the strength of a hypothesis or proposed theory require the use of logic and reasoning, inductive, deductive, or abductive. A published scientific article allows other researchers to review and question the evidence, the methods of collection and analysis, and the scientific results. While these principles are common to all scientific and engineering research disciplines, different scientific disciplines use specific tools and approaches that have been designed to suit the phenomena and systems that are particular to each discipline. For example, the mathematics taught to graduate students in astronomy will be different than the mathematics taught to graduate students studying zoology. Laboratory equipment and experimental methods will likely be different for those studying biology than for those studying materials science (AAAS, 1991). In general, one may say that different scientific disciplines are distinguished by the nature of the phenomena of interest to the field, the kinds of questions asked, and the types of tools, methods, and techniques used to answer those questions. In addition, scientific disciplines are dynamic, regularly engendering subfields and occasionally combining and reforming. In recent years, for example, what began as an interdisciplinary interest of biologists and physicists emerged as a new field of biophysics, psychologists and economists working together defined a field of behavioral economics. There have been similar interweaving of questions and methods for countless examples over the history of science. No matter how far removed oneâs daily life is from the practice of science, the concrete results of science and engineering are inescapable: they are manifested in the food people eat, their clothes, the ways they move from place to place, the devices they carry and the fact that most people will outlive by decades the average human born before the last century. So ubiquitous are these scientific achievements that it is easy to forget that there was nothing inevitable about humanityâs ability to achieve them. Scientific progress is made when the drive to understand and control the world is guided by a set of core principles and scientific methods. While challenges to previous scientific results may force researchers to examine their own practices and methods, the core principles and assumptions underlying scientific inquiry remain unchanged. In this context, the consideration of 22
Prepublication copy, uncorrected proofs. reproducibility and replicability in science is intended to maintain and enhance the integrity of scientific knowledge. CORE PRINCIPLES AND ASSUMPTIONS OF SCIENTIFIC INQUIRY Science is inherently forward thinking, seeking to discover unknown phenomena, increase understanding of the world, and answer new questions. As new knowledge is found, earlier ideas and theories may need to be revised. The core principles and assumptions of scientific inquiry embrace this tension, allowing science to progress forward while constantly testing, checking, and updating existing knowledge. In this section, we explore five core principles and assumptions underlying science: â Nature is not capricious. â Knowledge grows through exploration and mutually reinforcing evidence. â Science is a communal enterprise. â Science aims for refined degrees of confidence, rather than complete certainty. â Scientific knowledge is durable and mutable. Nature Is Not Capricious A basic premise of scientific inquiry is that nature is not capricious. âScience . . . assumes that the universe is, as its name implies, a vast single system in which the basic rules are everywhere the same. Knowledge gained from studying one part of the universe is applicable to other partsâ (AAAS, 1991, p. 5). In other words, scientists assume that if a new experiment is carried out under the same conditions as another experiment, the results should replicate. In March 1989, the electrochemists Martin Fleischmann and Stanley Pons claimed to have achieved the fusion of hydrogen into helium at room temperature (âcold fusionâ). In an example of scienceâs capacity for self-correction, dozens of laboratories attempted to replicate the result over the next several months. A consensus soon emerged within the scientific community that Pons and Fleischmann had erred and had not in fact achieved cold fusion. Imagine a fictional history, in which the researchers responded to the charge that their original claim was mistaken, as follows: âWhile we are of course disappointed at the failure of our results to be replicated in other laboratories, this failure does nothing to show that we did not achieve cold fusion in our own experiment, exactly as we reported. Rather, what it demonstrates is that the laws of physics or chemistry, on the occasion of our experiment (i.e., in that particular place, at that particular time), behaved in such a way as to allow for the generation of cold fusion. More exactly, it is our contention that the basic laws of physics and chemistry operate one way in those regions of space and time outside of the location of our experiment, and another way within that location.â It goes without saying that this would be absurd. But why, exactly? Why, that is, should scientists not take seriously the fictional explanation above? The brief answer, sufficient for our purposes, is that scientific inquiry (indeed, almost any sort of inquiry) would grind to a halt if one took seriously the possibility that nature is capricious in the way it would have to be for this fictional explanation to be credible. Science operates under a standing presumption that nature follows rules that, however subtle, intricate, and challenging to discern, are consistent. In some systems, these rules are consistent across space and timeâfor example, a physics study should replicate in different countries and in different centuries (assuming that differences in applicable 23
Prepublication copy, uncorrected proofs. factors, such as elevation or temperature, are accounted for). In other systems, the rules may be limited to specific places or times. For example, a âruleâ of human behavior that is true in one country and one time period may not be true in a different time and place. In effect, all scientific disciplines seek to discover rules that are true beyond the specific context within which they are discovered. Knowledge Grows through Exploration of the Limits of Existing Rules and Mutually Reinforcing Evidence Scientists seek to discover rules about relationships or phenomena that exist in nature and ultimately, they seek to describe, explain, and predict. Because nature is not capricious, scientists assume that these rules will remain true as long as the context is equivalent. And because knowledge grows through evidence about new relationships, researchers may find it useful to ask the same scientific questions using new methods and in new contexts, to determine whether and how those relationships persist or change. Most scientists seek not only to find rules that are true in one specific context, but rules that are confirmable by other scientists and are generalizableâ rules that remain true even if the context of a separate study is not entirely the same as the original. Scientists thus seek to generalize their results and to discover the limits of proposed rules. These limits can often be a rich source of new knowledge about the system under study. For example, if a particular relationship was observed in an older group, but not a younger group, this suggests that the relationship may be affected by age, cohort, or other attributes that distinguish the groups and may point the researcher toward further inquiry. Science Is a Communal Enterprise Robert Merton (1973) described modern science as an institution of âcommunalism, universalism, disinterestedness, and organized skepticism.â Science is an ongoing, communal conversation, a joint problem-solving enterprise that can include false starts and blind alleys, especially when taking risks in the quest to find answers to important questions. Scientists build on their own research as well as the work of their peers, and this building can sometimes span generations. Scientists today still rely on the work of Newton, Darwin, and others from centuries past. Researchers have to be able to understand othersâ research in order to build on it. When research is communicated with clear, specific, and complete accounting of the materials and methods used, the results found, and the uncertainty associated with the results, other scientists can know how to interpret the results. The communal enterprise of science allows scientists to build on othersâ work, to develop the necessary skills to conduct high quality studies, and to check results and confirm, dispute, or refine them. Scientific results should be subject to checking by peers; and any scientist competent to perform such checking has the standing to do so. Confirming the results of others, for example, by replicating the results, serves as one of several checks on the processes by which researchers produce knowledge. The original and replicated results are ideally obtained following well- recognized scientific approaches within a given field of science, including collection of evidence and characterization of the associated sources and magnitude of uncertainties. Indeed, without understanding uncertainties associated with a scientific result (as discussed throughout this report), it is difficult to assess whether or not it has been replicated. 24
Prepublication copy, uncorrected proofs. Science Aims for Refined Degrees of Confidence, Rather Than Complete Certainty Uncertainty is inherent in all scientific knowledge, and many types of uncertainty can affect the reliability of a scientific result. It is important that researchers understand and communicate potential sources of uncertainty in any system under study. Decision makers looking to use study results need to be able to understand the uncertainties associated with those results. Understanding the nature of uncertainty associated with an analysis can help inform the selection and use of quantitative measures for characterizing the results; see Box 2-1. At any stage of growing scientific sophistication, the aim is both to learn what science can now reveal about the world and to recognize the degree of uncertainty attached to that knowledge. Scientific Knowledge Is Durable and Mutable As researchers explore the world through new scientific studies and observations, new evidence may challenge existing and well-known theories. The scientific process allows for the consideration of new evidence that, if credible, may result in revisions or changes to current understanding. Testing of existing models and theories through the collection of new data is useful in establishing their strength and their limits (i.e., generalizability), and it ultimately expands human knowledge. Such change is inevitable as scientists develop better methods for measuring and observing the world. The advent of new scientific knowledge that displaces or reframes previous knowledge should not be interpreted as a weakness in science. Scientific knowledge is built on previous studies and tested theories, and the progression is often not linear. Science is engaged in a continuous process of refinement to uncover ever-closer approximations to the truth. BOX 2-1 Scientific Uncertainty and Its Importance in Measurement Science Dictionary definitions of the term uncertainty refer to the condition of being uncertain (unsure, doubtful, not possessing complete knowledge). It is a subjective condition because it pertains to the perception or understanding that one has about the value of some property of an object of interest. In measurement science, measurement uncertainty represents the doubt about the true value of a particular quantity subject to measurement (the âmeasurandâ), and quantifying this uncertainty is fundamental to precise measurements. Uncertainty in measurement is a unifying principle of measurement science; it is a key factor in the work of the national metrology institutes, including the National Institute of Standards and Technology (NIST). NIST, and its 100+ sister laboratories in other countries, quantify uncertainties as a way of qualifying measurements. This practice guarantees the comparability of measurement results worldwide. The work in metrology at national laboratories affects international trade and regulations that assure safety and quality of products, advances technologies to stimulate innovation and to facilitate the translation of discoveries into efficiently manufactured products, and, in general, serves to improve the quality of life. The concepts and technical devices that are used to characterize measurement uncertainty evolve continuously to address emerging challenges as an expanding array of disciplines and subdisciplines in chemistry, physics, materials science, and biology. SOURCE: Adapted from Hanisch and Plant (2018). 25
Prepublication copy, uncorrected proofs. CONCLUSION 2-1: The scientific enterprise depends on the ability of the scientific community to scrutinize scientific claims and to gain confidence over time in results and inferences that have stood up to repeated testing. Reporting of uncertainties in scientific results is a central tenet of the scientific process. It is incumbent on scientists to convey the appropriate degree of uncertainty in reporting their claims. STATISTICAL INFERENCE AND HYPOTHESIS TESTING Many scientific studies seek to measure, explain, and make predictions about a natural phenomenon. Other studies seek to detect and measure the effects of an intervention on a system. Statistical inference provides a conceptual and computational framework for addressing the scientific questions in each setting. Estimation and hypothesis testing are broad groupings of inferential procedures. Estimation is suitable for settings in which the main goal is the assessment of the magnitude of a quantity, such as a measure of a physical constant or the rate of change in a response corresponding to a change in an explanatory variable. Hypothesis testing is suitable for settings in which scientific interest is focused on the possible effect of a natural event or intentional intervention, and a study is conducted to assess the evidence for and against this effect. In this context, hypothesis testing helps answer binary questions. For example, will a plant grow faster with fertilizer A or fertilizer B? Do children in smaller classes learn more? Does an experimental drug work better than a placebo? Several types of more specialized statistical methods are used in scientific inquiry, including methods for the design of studies and methods for developing and evaluating prediction algorithms. Because hypothesis testing has been involved in a major portion of reproducibility and replicability assessments, we consider this mode of statistical inference in some detail. However, considerations of reproducibility and replicability apply broadly to other modes and types of statistical inference. For example, the issue of drawing multiple statistical inferences from the same data is relevant for all hypothesis testing and in estimation. Studies involving hypothesis testing typically involve many factors that can introduce variation in the results. Some of these factors are recognized, and some are unrecognized. Random assignment of subjects or test objects to one or the other of the comparison groups is one way to control for the possible influence of both unrecognized and recognized sources of variation. Random assignment may help avoid systematic differences between groups being compared, but it does not affect the variation inherent in the system (population or an intervention) under study. Scientists use the term ânull hypothesisâ to describe the supposition that there is no difference between the two intervention groups or no effect of a treatment on some measured outcome (Fisher, 1935). A standard statistical test aims to answers the question: If the null hypothesis is true, what is the likelihood of having obtained the observed difference? In general, the greater the observed difference, the smaller the likelihood it would have occurred by chance when the null hypothesis is true. This measure of the likelihood that an obtained value occurred by chance is called the âp-value.â As traditionally interpreted, if a calculated p-value is smaller than a defined threshold, the results may be considered âstatistically significant.â A typical threshold may be p â¤ 0.05 or, more stringently, p â¤ 0.01 or p â¤ 0.005. In a statement issued in 2016, the American Statistical Association Board (Wasserstein and Lazar, 2016, p. 1290) noted: 26
Prepublication copy, uncorrected proofs. [W]hile the p-value can be a useful statistical measure, it is commonly misused and misinterpreted. This has led to some scientific journals discouraging the use of p- values including one which banned its use (Trafimow and Marks, 2015), and some scientists and statisticians recommending their abandonment, with some arguments essentially unchanged since p-values were first introduced. More recently, it has been argued that p-values, properly calculated and understood, can be informative and useful; however, a conclusion of âstatistical significanceâ based on an arbitrary threshold of likelihood (even a familiar one such as p â¤ 0.05) is unhelpful and frequently misleading (Wasserstein, et al, 2019; Amrhein et al., 2019b). Understanding what a p-value does not represent is as important as understanding what it does indicate. In particular, the p-value does not represent the probability that the null hypothesis is true. Rather, the p-value is calculated on the assumption that the null hypothesis is true. The probability that the null hypothesis is true, or that the alternative hypothesis is true, can be based on calculations informed in part by the observed results, but this is not the same as a p-value. In scientific research involving hypotheses about the effects of an intervention, researchers seek to avoid two types of error that can lead to non-replicability: ï· Type I error â a false positive or a rejection of the null hypothesis when it is correct; and ï· Type II error â a false negative or failure to reject a false null hypothesis, that is, allowing the null hypothesis to stand when an alternative hypothesis, and not the null hypothesis, is correct. Ideally, both Type I and Type II errors would be simultaneously reduced in research. For example, increasing the statistical power of a study by increasing the number of subjects in a study can reduce the likelihood of a Type II error for any given likelihood of Type I error.3 Although the increased data that comes with higher powered studies can help reduce both Type I and Type II errors, adding more subjects typically means more time and cost for a study. Researchers are often forced to make tradeoffs in which reducing the likelihood of one type of error increases the likelihood of the other. For example, when p-values are deemed useful, Type I errors may be minimized by lowering the significance threshold to a more stringent level (for example, by lowering the standard p â¤ 0.05 to p â¤0.005)4. However, this would simultaneously increase the likelihood of a Type II error. In some cases, it may be useful to define separate interpretive zones, where p-values above one significance threshold are not deemed significant, p- values below a more stringent significance threshold are deemed significant, and p-values between the two thresholds are deemed inconclusive. Alternatively, one could simply accept the calculated p-value for what it is, the likelihood of obtaining the observed result if the null hypothesis were true, and refrain from further interpreting the results as âsignificantâ or ânot significant.â The traditional reliance on a single threshold to determine significance can incentivize behaviors that work against scientific progress (see section, Publication Bias, in Chapter 5). Tension can arise between replicability and discovery, specifically, between the replicability and the novelty of the results. Hypotheses with low a priori probabilities are less 3 âStatistical powerâ is the probability that a test will reject the null hypothesis when a specific alternative hypothesis is true. 4 The threshold for statistical significance is often referred to as p âless thanâ 0.05; we refer to this threshold as âless than or equal to.â 27
Prepublication copy, uncorrected proofs. likely to be replicated. In this vein, Wilson and Wixted (2018) illustrated how fields investigating potentially ground-breaking results will produce results that are less replicable, on average, than fields that investigate highly likely, almost-established results. Indeed, a field could achieve near- perfect replicability if it limited its investigations to prosaic phenomena that were already well known. As Wilson and Wixted (2018, p. 193) state: âWe can imagine pages full of findings that people are hungry after missing a meal or that people are sleepy after staying up all night,â which would not be very helpful âfor advancing understanding of the world.â In the same vein, it would not be helpful for a field to focus solely on improbable, outlandish hypotheses. The goal of science is not, and ought not to be, for all results to be replicable. Reports of non-replication of results can generate excitement as they may indicate possibly new phenomena and expansion of current knowledge. Also, some level of non-replicability is expected when scientists are studying new phenomena that are not well established. As knowledge of a system or phenomenon improves, replicability of studies of that particular system or phenomenon would be expected to increase. Assessing the probability that a hypothesis is correct in part based on the observed results can also be approached through Bayesian analysis. This approach starts with a priori (before data observation) assumptions, known as prior probabilities and revises them on the basis of the observed data using Bayesâ theorem, sometimes described as the Bayes formula. Appendix D illustrates how a Bayesian approach to inference can, under certain assumptions on the data generation mechanism and on the a priori likelihood of the hypothesis, use observed data to estimate the probability that a hypothesis is correct. One of the most striking lessons from Bayesian analysis is the profound effect that the pre-experimental odds have on the post-experimental odds. For example, under the assumptions shown in Appendix D, if the prior probability of an experimental hypothesis was only 1 percent, and the obtained results were statistically significant at the p 0.01 level, only about one in eight of such conclusions that the hypothesis was true would be correct. If the prior probability was as high as 25 percent, then more than four of five such studies would be deemed correct. As common sense would dictate and Bayesian analysis can quantify, it is prudent to adopt a lower level of confidence in the results of a study with a highly unexpected and surprising result than in a study for which the results were a priori more plausible: for an example, see Box 2-2. Highly surprising results may represent an important scientific breakthrough, even though it is likely that only a minority of them may turn out over time to be correct. It may be crucial, in terms of the example in the previous paragraph, to learn which of the eight highly unexpected (prior probability, 1 percent) results can be verified and which one of the five moderately unexpected (prior probability, 25 percent) results should be discounted. BOX 2-2 Pre-Experimental Probability: An Example The importance of pre-experimental probability can be illustrated by considering a hypothetical case of an experiment involving homeopathy. Suppose a homeopathic practitioner is convinced of the basic principle of homeopathyâthat extremely dilute solutions of a substance can effectively treat ailments related to the substance. His theory is that when homeopathy fails, it is either because the treatment solution has been adulterated (e.g., by using imperfectly distilled water) or it is not sufficiently dilute to produce the desired effect. He designs 28
Prepublication copy, uncorrected proofs. an experiment to test the efficacy of a 1 percent solution that is then diluted 1 to 100, and then each subsequent dilution similarly diluted by 1 to 100 for a total of 1,000 dilutions. To avoid possible bias in the conduct of the experiment, the homeopathic practitioner enlists a researcher who, like the patients in the study, is unaware of whether any particular patient is receiving the dilution or pure distilled water (so-called double-masked or double-blind study design). The study comparing this final dilution to pure distilled water finds a difference favoring the dilution. The practitioner believes it was plausible, even likely, because he was predisposed to that conclusion. For a chemist schooled in the physical reality of her discipline, the theory is unfounded and the experimental result would barely affect her conclusion that the likelihood that the conclusion is true is close to zero. The practitioner and the chemist may agree on every aspect of the study and its analysis yet reach diametrically different estimates of the likelihood that the scientific conclusion is correct based on their prior beliefs and assumptions, independent of this study. These differing conclusions illustrate the importance of considering the results of any single study in the context of other results, particularly if the results are inherently surprising. This is an important step toward building a body of evidence on which to make a conclusion and not being swayed by one novel, and perhaps unreliable, result. Keeping the idea of prior probability in mind, research focused on making small advances to existing knowledge would result in a high replication rate (i.e., a high rate of successful replications) because researchers would be looking for results that are very likely correct. But doing so would have the undesirable effect of reducing the likelihood of making major new discoveries (Wilson and Wixted, 2018). Many important advances in science have resulted from a bolder approach based on more speculative hypotheses, although this path also leads to dead ends and to insights that seem promising at first but fail to survive after repeated testing. The âsafeâ and âboldâ approaches to science have complementary advantages. One might argue that a field has become too conservative if all attempts to replicate results are successful, but it is reasonable to expect that researchers follow up on new but uncertain discoveries with replication studies to sort out which promising results prove correct. Scientists should be cognizant of the level of uncertainty inherent in speculative hypotheses and in surprising results in any single study. 29
Prepublication copy, uncorrected proofs.