Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Prepublication copy, uncorrected proofs. APPENDIX D: Using Bayes Analysis for Hypothesis Testing After a study is conducted that produces a scientific conclusion, what is the likelihood that the conclusion is correct? In the case of research that involves hypothesis testing, the scientific result may point to the null or to the alternative hypothesis. An estimate of the likelihood that the scientific conclusion is correct is represented by the post-experimental (a posteriori) probability of, or the odds favoring, the particular hypothesis. These odds or, equivalently, the probability, can be obtained from the Bayes formula. For purposes of exposition, it is convenient to express the Bayes formula using likelihood ratios in the simplified context of observing a data point xo and using it to test null hypothesis H0 versus the alternative hypothesis H1, as shown in Equation E1. For a study comparing two groups, H0 would typically be the hypothesis of no difference between the groups, H1 would specify a difference of a particular size, and the observed data point would be the difference in the group means. In mathematical terms, the Bayes formula is represented as follows, Equation D.1: | | . | | In this representation: P[H1|xo] is the probability that H1 (the alternative hypothesis) is correct given the observed findings (xo). P[H0|xo] is the probability that H0 (the null hypothesis) is correct given the observed findings (xo). P[H1] is the prior (pre-experimental1) probability of H1. P[H0] is the prior (pre-experimental) probability of H0. f[xo|H1] is the likelihood of xo under the alternative hypothesis, assumed here to follow a normal distribution. f[xo|H0] is the likelihood of xo under the null hypothesis, assumed here to follow a normal distribution. It is assumed that P[H0] + P[H1] = 1.0, which also implies P[H0|xo] + P[H1|xo] = 1.0. The ratio of P[H1] to P[H0] is the prior odds favoring the alternative hypothesis H1, while the ratio of P[H1|xo] to P[H0|xo] is the posterior odds favoring the alternative hypothesis. The ratio of f[xo|H1] to f[xo|H0] is called the Bayes factor. In words, the Bayes formula (equation D.1) shows that the post-experimental odds favoring a hypothesis depends on the pre-experimental odds favoring the hypothesis and the relative likelihood of observing the results when the hypothesis is true, in comparison with to the relative likelihood when the hypothesis is false. 1 Pre-experimental or prior probability may also be referred to as âa priori probability.â We chose to use âprior probabilityâ throughout this appendix. 183

Prepublication copy, uncorrected proofs. The p-value, in classical statistics, is defined as the probability of finding an observed, or more extreme, result under the assumption that the null hypothesis is true. The p-value is thus related to the expression f [xo|H0] in that the p-value represents one or both outer segments of the curve defined by the possible, observed results when the null hypothesis is true. It is assumed that the possible results under the null hypothesis (H0) and under the alternative hypothesis (H1) have normal distributions with the same variance (ï³2) but different means (ï).2 In the case of a two- group comparison, the mean under the null hypothesis would be zero and under the alternative hypothesis non-zero. With these assumptions, and the p-value calculated on the basis of the results observed in an experiment, one can apply a Bayesian approach to estimate the post-experimental odds favoring the alternative or the null hypothesis based on the pre-experimental odds and the measured p-value. The pre- and post-experimental odds may be equivalently expressed as probabilities.3 In principle, under a Bayesian approach, the alternative hypothesis (H1) may take on any value (indicating the distance of its mean from the H0 mean) and any prior (pre-experimental) probability (subject to the constraint that the sum of the probabilities of the null and the alternative equal 1.0). The pre-experimental probabilities of the hypotheses (P[H0] and P[H1] in Bayes formula) reflect the prior expectation that a hypothesis would be true. If an inference from a study is very surprising, this means that the pre-experimental probability of the corresponding hypothesis was low. If a particular inference was highly anticipated, this indicates that the pre-experimental probability of its corresponding hypothesis was high. As noted above, in a study that compares an experimental and a control group, the null hypothesis specifies that the difference between the means of the experimental and control group is zero. The alternative hypothesis can specify that the difference in means can take on any pre- experimental value reflecting the degree of effect that the experimenter posits. For example, there may be a threshold for action that the experimenter identifies, and the experimenter wishes to test whether this threshold has been exceeded. For purposes of illustration here, consider an alternative hypothesis where the underlying mean effect size (ï1) is the same as the effect size actually observed (xo). This for expository purposes only; this choice of value for the mean effect size of the alternative hypothesis illustrates the maximum degree to which the observed results of the study can diminish the post-experimental probability of the null hypothesis. Put another way, if the observed results happened to coincide with the mean value of the previously chosen alternative hypothesis, one would obtain the maximum possible change in the a posteriori (post-experiment) probability of the experimental hypothesis in comparison with the null hypothesis. One can use the ratio of the two likelihood functions (for H1 and for H0) at the observed results (xo) to estimate the odds favoring the more likely (higher) hypothesis given the observed 2 The height of a normal curve is defined as: , where ï is the mean and ï³ is the standard deviation (ï³ is the variance); ï° is a constant representing the ratio of the circumference to the diameter of a circle and 2 is ï 3.14159; and e is the base of natural logarithms and is ï 2.718282. 3 To convert from odds to probabilities, divide the odds by one plus the odds. To convert from a probability to odds, divide the probability by one minus that probability. An odds ratio of 3 (or 3 to 1 in favor) thus converts to a probability of 3 ï¸ 4 = 0.75. 184

Prepublication copy, uncorrected proofs. effect. This ratio, , at the observed effect, gives the Bayes factor that pertains to this study. When we specify an effect size that generates a particular p-value under the assumption that the null hypothesis is correct, such as p = 0.05 for H0, this determines the Bayes factor ratio for that , . p-value, namely, , in the case of p = 0.05. , . Under the assumption of normal distributions with the same variance for the data under the null and alternative hypotheses, with ï1 = xo, the Bayes factor does not depend on the specific values of ï0, ï1, or ï³ (the means or standard deviation of the distributions). Rather, the Bayes factor reduces to a function of the standard deviate units (z) that correspond to the specified p- â value in each case: specifically, under these assumptions, the Bayes Factor = . The z-score for any specified p-value may be found in any table of standard normal probabilities: for example, a p-value of 0.05 corresponds to z = 1.645. Under our assumptions, the Bayes factor for p = 0.05 . â is = 3.87. Importantly, only if one also knows the pre-experimental (prior) odds favoring the experimental hypothesis, expressed as , can one calculate the post-experimental likelihood that the alternative hypothesis is true on the basis of the results of a specific study. In principle, one would want to specify the prior odds without knowing the specific results of the study, based only on knowledge obtained prior to the study. One is expressing the odds as favoring the experimental or alternative hypothesis, but one could equivalently use the same results to estimate the post-experimental odds favoring the null hypothesis based on pre-experimental odds of . Consider the case in which a study produces results with a one-tailed test of statistical significance at p = 0.05, and the pre-experimental likelihood that the experimental hypothesis is true was 25 percent. With these assumptions, the post-experimental probability that the experimental hypothesis is true rises only to about 56 percent, see Table E-1 (Posterior odds in favor of 1.289 are equivalent to a probability of about 56 percent.) One of the most striking lessons from Bayesian analysis is the profound effect of the pre- experimental odds that a hypothesis is true on the post-experimental odds. Similar calculations show, for example, how the a posteriori probability of a disease following a positive test result depends crucially on the prior probability. For any given level of statistical significance observed in a study, the likelihood that an inference is correct can vary widely depending on the likelihood it was true before the experiment. For example, if an experiment resulted in a one-sided p-value of 0.01, the post-experimental probabilities the hypothesis is true range from about 13 percent, if the prior likelihood was as low as 1 percent, to nearly 94 percent, if the prior likelihood was as high as 50 percent. Another way of thinking about this is that if one had done a series of studies in which the prior probability of each experimental hypothesis was only 1 percent, and the results were statistically significant at the 0.01 level, only about one in eight of those study results would be likely to hold up as true. In contrast, if the prior probability was as high as 25 percent, then the post-experimental probability would rise to about 83 percent, and one would expect more than four of five such studies to hold up over time. It is clearly inappropriate to apply the same confidence to the results of a study with a highly unexpected and surprising result as in a study in which the results were a priori more plausible. If one quantifies the prior expectations, then Bayes formula can be used to calculate the appropriate adjustment to the post-experimental probabilities. 185

Prepublication copy, uncorrected proofs. If study results are significant only at the 0.05 level (rather than 0.01 level), then the post- experimental probabilities of the experimental hypothesis (P[H1|xo]) would range from under 4 percent, when the pre-experimental probability was 1 percent, to nearly 80 percent, when the pre- experimental probability was 50 percent. Comparisons across levels of significance show the degree to which more statistically significant results affect the likelihood that the experimental hypothesis is correct: see Tables D-1, D-2, D-3, and D-4. However, the effect of the observed level of statistical significance is indirect, affected by sample size and variance, and mediated by the Bayes factor and the prior probabilities of the null and experimental hypotheses. 186

Prepublication copy, uncorrected proofs. TABLE D-1 Posterior Odds Based on Bayes Formula for p = 0.05, 1-Sided Test, z 1.645 Prior P[H P[H Posterior [H1] [H0] Odds 1|xo] 0|xo] Odds P[H1] P[H1|xo] / / P[H0] P[H0|xo] 0.010 0.03 0.96 0.039 .01 .99 8 2 0.053 0.16 0.83 0.204 .05 .95 9 1 0.111 0.30 0.69 0.430 .1 .9 1 9 0.250 0.49 0.50 0.967 .2 .8 2 8 0.333 0.56 0.43 1.289 .25 .75 3 7 0.429 0.62 0.37 1.658 .3 .7 4 6 0.667 0.72 0.27 2.579 .4 .6 1 9 1.000 0.79 0.20 3.868 .5 .5 5 5 NOTES: In this table: , . . Bayes factor: 3.868427157 3.87 , . . . â Bayes factor (simplified calculation): 3.87 187

Prepublication copy, uncorrected proofs. TABLE D-2 Posterior Odds Based on Bayes Formula for p = 0.025, 1-Sided Test, z 1.96 Prior P[H P[H Posterior [H1] [H0] Odds 1|xo] 0|xo] Odds P[H1] P[H1|xo] / / P[H0] P[H0|xo] 0.010 0.06 0.93 0.069 .01 .99 5 5 0.053 0.26 0.73 0.359 .05 .95 4 6 0.111 0.43 0.56 0.758 .1 .9 1 9 0.250 0.63 0.36 1.707 .2 .8 1 9 0.333 0.69 0.30 2.275 .25 .75 5 5 0.429 0.74 0.25 2.926 .3 .7 5 5 0.667 0.82 0.18 4.551 .4 .6 0 0 1.000 0.87 0.12 6.826 .5 .5 2 8 NOTES: In this table: , . . Bayes factor: 6.826417419 6.83. , . . . â Bayes factor (simplified calculation): 6.83. 188

Prepublication copy, uncorrected proofs. TABLE D-3 Posterior Odds Based on Bayes Formula for p = 0.01, 1-Sided Test, z 2.325 Prior P[H P[H Posterior [H1] [H0] Odds 1|xo] 0|xo] Odds P[H1] P[H1|xo] / / P[H0] P[H0|xo] 0.010 0.13 0.86 0.151 .01 .99 2 8 0.053 0.44 0.55 0.789 .05 .95 1 9 0.111 0.62 0.37 1.666 .1 .9 5 5 0.250 0.78 0.21 3.748 .2 .8 9 1 0.333 0.83 0.16 4.997 .25 .75 3 7 0.429 0.86 0.13 6.425 .3 .7 5 5 0.667 0.90 0.09 9.994 .4 .6 9 1 1.000 0.93 0.06 14.991 .5 .5 7 3 NOTES: In this table: , . . Bayes factor: 14.99121706 15.0. , . . . â Bayes factor (simplified calculation): 14.9. 189

Prepublication copy, uncorrected proofs. TABLE D-4 Posterior Odds Based on Bayes Formula For p = 0.005, 1-Sided Test, z 2.575 Prior P[H P[H Posterior [H1] [H0] Odds 1|xo] 0|xo] Odds P[H1] P[H1|xo] / / P[H0] P[H0|xo] 0.010 0.21 0.78 0.279 .01 .99 8 2 0.053 0.59 0.40 1.453 .05 .95 2 8 0.111 0.75 0.24 3.067 .1 .9 4 6 0.250 0.87 0.12 6.900 .2 .8 3 7 0.333 0.90 0.09 9.201 .25 .75 2 8 0.429 0.92 0.07 11.829 .3 .7 2 8 0.667 0.94 0.05 18.401 .4 .6 8 2 1.000 0.96 0.03 27.602 .5 .5 5 5 NOTES: In this table: , . . Bayes factor: 27.60199354 27.6. , . . . â Bayes factor (simplified calculation): 27.5. 190

Prepublication copy, uncorrected proofs. If the observed results produce a p-value equal to 0.005 and the prior probability of the experimental hypothesis is 0.25, then the post-experimental probability that the experimental hypothesis is true is about 90 percent. It is reasoning such as this (using different assumptions in applying the Bayes formula) that led a group of statisticians to recommend setting the threshold p- value to 0.005 for claims of new discoveries (Benjamin et al., 2018). One drawback with this very stringent threshold for statistical significance is that it would fail to detect legitimate discoveries that by chance had not attained the more stringent p-value in an initial study. Regardless of the threshold level of p-value that is chosen, in no case is the p-value a measure of the likelihood that an experimental hypothesis is true. When the prior probability of an experimental hypothesis (P[H1]) is 0.3 (meaning its pre- experimental likelihood of being true is about 1 in 3) and the p-value is 0.05, Table E.1 shows the post-experimental probability to be about 62 percent (posterior odds favoring H1 of 1.658 are equivalent to a probability of about 62 percent). If replication efforts of studies with these characteristics were to fail about 40 percent of the time, one would say this is in line with expectations, even assuming the studies were flawlessly executed. When a study fails to be replicated, it may be because of shortcomings in study design or execution, or it may be related to the boldness of the experiment and surprising nature of the results, as manifested in a low pre-experimental probability that the scientific inference is correct (Wilson and Wixted, 2018). For this reason, failures to replicate can be a sign of error, may relate to variability in the data and sample size of a study, or they may signal investigatorsâ eagerness to make important, unexpected discoveries and represent a natural part of the scientific process. Without losing sight of the importance of errors in experimental design and execution or instances of fraud as sources of non-replicability, this excursion into Bayesian reasoning demonstrates how non-replicability can reflect the probabilistic nature of scientific research and be an integral part of progress in science. Just as it would be wrong to assume that any particular instance of non-replicability indicates a fundamental problem with that study or with a whole branch of science, it is equally wrong to ignore sources of non-replicability that are avoidable and the result of error or malfeasance. It is incumbent on those who produce scientific results to use sound research design and technique and to be clear, precise, and accurate in depicting the uncertainty inherent in their results; those who use scientific results need to understand the limitations of any one study in demonstrating that a scientific hypothesis is more or less likely to be correct. REFERENCES Benjamin, D.J.et al., [Daniel J. Benjamin, James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth A. Bollen, BjÃ¶rn Brembs, Lawrence Brown, Colin Camerer, David Cesarini, Christopher D. Chambers, Merlise Clyde, Thomas D. Cook, Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Eferson, Ernst Fehr, Fiona Fidler, Andy P. Field, Malcolm Forster, Edward I. George, Richard Gonzalez, Steven Goodman, Edwin Green, Donald P. Green, Anthony Greenwald, Jarrod D. Hadfeld, Larry V. Hedges, Leonhard Held, Teck Hua Ho, Herbert Hoijtink, Daniel J. Hruschka, Kosuke Imai, Guido Imbens, John P. A. Ioannidis, Minjeong Jeon, James Holland Jones, Michael Kirchler, David Laibson, John List, Roderick Little, Arthur Lupia, Edouard Machery, Scott E. Maxwell, Michael McCarthy, Don Moore, Stephen L. Morgan, Marcus MunafÃ³, Shinichi Nakagawa, Brendan 191

Prepublication copy, uncorrected proofs. Nyhan, Timothy H. Parker, Luis Pericchi, Marco Perugini, Jef Rouder, Judith Rousseau, Victoria Savalei, Felix D. SchÃ¶nbrodt, Thomas Sellke, Betsy Sinclair, Dustin Tingley, Trisha Van Zandt, Simine Vazire, Duncan J. Watts, Christopher Winship, Robert L. Wolpert, Yu Xie, Cristobal Young, Jonathan Zinman and Valen E. Johnson] (2018). Nature Human Behaviour, 2, 6-10. Wilson, B.M., and Wixted, J.T. (2018). The Prior Odds of Testing a True Effect in Cognitive and Social Psychology. Advances in Methods and Practices in Psychological Science, 1(2), 186-197. 192