Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Intuitive Judgment and the Evaluation of Evidence Dale Grif f in Stanford University
Intuitive Judgment and ache Evaluation of Evidence Dale Griffin Stanford University What I do wash to maintain-- and it is here Mat the scientific auit;llde becomes imperative-- is that insight, untested and unsupported. is an insufficient guarantee of truth, in spite of the fact Mat much of the most impor- tant truth is fast suggested by its means. (Bertrand Russell, 1969, p.l6) Intuitive judgment is often misleading. Natural processes of judgment and decision-making are subject to systematic errors. The formal strucn~res of science-- objective measurement, statistical evaluation and strict research design --are designed to minimize the effect of such errors. Relying only on intuitive methods of assessing evidence may lead to faulty beliefs about the world, and may make those beliefs difficult or impossible to change. When very important decisions are to be made, the absence of a weD-defined formal strategy is apt to prove costly. The systematic biases which plague our gathering and evaluation of evidence are adaptive on an individual level, in that they increase the ease of decision-making and protect our emotional well- being. Such benefits, however, may be purchased at a high price. In the context of beliefs and deci- sions about national policy issues, me speedy and conflict-free resolution of uncertainty is not adaptive when the cost is poor utilization of Me evidence. Bertrand Russell (1969) argued that science needs both intuition and logic, the first to generate (and appreciate) ideas and the second to evaluate their truth. But problems arise when intuitive processes replace logic as the arbiter of troth Especially in the arena of public policy, evaluation deci-
2 signs must be based on grounds that can be defined, described, and publicly observed." This paper examines the risks of assessing evidence by subjective judgment. Specific examples win focus on the difficulty of assessing claims for techniques designed to enhance human performance, especially those related to parapsychological phenomena. More generally, the themes will include why personal experience is not a trustworthy source of evidence, why people disagree about beliefs despite access to the same evidence, and why evidence so rarely leads to belief change. The underlying mes- sage is ~is: The checks and balances of formal science have developed as protection against the unreli- ability of unaided human judgment. Overview of the analysis 4.' Organized science can be modeled as a fonnalized extension of the ways that humans naturally learn about the world (KeBy, 1955) 2 In order to predict and control Weir environment, people gen- erate hypotheses about what events go together, and then gather evidence to test these hypotheses. If the evidence seems to support the current belief, the working hypothesis is retained; o~envise it is rejected. ~ Or in Me words of Dawes (1980, p. 68) "In a wide variety of psychological contexts, sys- tematic decisions based on a few explicable and defensible principles are superior to intuitive decisions-- because they work better, because they are not subject to conscious or unconscious biases on the part of the decision maker, because they can be explicated and debated, and because their basis can be underset by those most affected by them." 2 Since there is no single model of "formal science", my references will be to tile most consen- sual features: experimental method and quantitative measurement and analysis. The specific contrasts between intuition-and formal methods win involve only the feamres of extensional probability and experimental design that can be found in standard introductory e~cpenmental texts (e.g. Carlsmi~, EDswonh & Aronson, 1976; Freedman, Pisan~ & Pumes, 1978; Neale & Leibert, 19801. The contrast between intuitive and scientific methods is not meant to imply that scientific pro- cedure is always motivated by rational processes (Broad & Wade, 1982~. But scientific methods do attempt to minimize me impact of the biases that strike laypeople and scientists alike. A good account of current philosophic criticisms of the social sciences can be found In Fiske & Shweder (19861.
3 Science adds quantitative measurement to this process. This measurement can be explicitly recorded and the strength of me evidence for a particular hypothesis can be objectively taRied. The key difference between intuitive and scientific mess is that Me measurement and analysis of the scientific investigation are publicly available, while intuitive hypothesis-testing takes place inside one person's mind. Recent psychological research has examined ways In which intuitive judgment departs from fonnal models of analysis-- and in focusing on such "errors" and "biases", this research has pinpointed some natural mechanisms of human judgment. In particular, attention has been focused on heuristic "shortcuts" that make our judgments more efficient, and on protective defenses that maintain me emo- tional state of He decision-maker. The first section of this paper win examine Be costs associated with our mental shortcuts (information-processing or co=dve biases). The second section win discuss the problems caused by our self-protective mechanisms (motivations biases). The third section win discuss how bow these types of biases come into play when we are caned upon to evaluate evidence that has passed through some mediator: press, TV, or government. This source of information has special properties in that we must evaluate both the source of the evi- dence and the quality of the evidence. In Be final section, the benefits of formal research win be demonstrated.
4 Problems in Evaluating Evidence I: Inform~on-processing busses The invesUgabon of costive biases in judgment has followed the tradition of the study of perceptual illusions. Much that we know about the organization of We human visual system, for exam- pie, comes from tile study of situations in which our eye and brain are "fooled" into seeing something Mat is not there (Gregory, 19701. The most remarkable capacity of the human perceptual system is that it can take in an array of ambiguous information and construct a coherent, meaningful representa- don of the world. But we generally do not realize how subjective this cons~cdon is. Perception seems so immediate to us that we fee! as if we are taking In a copy of the true world as it exists. Cog- n~tive judgments have the same feeling of "true"-- it is difficult to believe mat our personal experience does not perfectly capture the objective world. I The systematic biases I win be discussing throughout this section operate at a basic and I automatic level. Controlled psychological experimentation has given us many insists into these processes beneath our awareness and beyond our control. The conclusions of these expenments are consistent: these processes are set up to promote efficiency and a sense of confidence. Efficient short- cuts are set up to minimize computation and avoid paralyzing uncertainty. But the short-cuts also lead to serious flaws In our inferential processes, and the illusions of objec~vi~,r and cercau~ty prevent us from recognizing the need for using formal methods when the decision is important. In the Muller-L`yer visual illusion, the presence of opposite-facing anowheads on two lines of He same length makes one look longer man the other (see Figure I). But when we have a ruler, we can check that they are the same length, and we believe Me formal evidence, rather than Mat of our fallible visual system. With cognitive biases, the analogue of the ruler is not clear. Against what should we validate our judgmental system?
s The tradinorm! comparison: Clinical versus statistical prediction The most common- standard against which human judgment has been measured is the efficiency of actuarial, or statistical, prediction. In the 1950's, researchers began tO compare how well expert intuition compared with simple statistical combining nobles in predicting mental health prognoses and over personnel outcomes. Typically, such studies involved giving several pieces of information-- such as personality and aptitude test scores-- about a number of patients or job applicants tO a panel of experts. Each of these clinical judges would give their opinion about the likely outcome of each case. The actuarial predic- tions were obtained by a simple statistical "best fit" procedure that defined some mathematical way of combining the pieces of information, arid determined the cutoff score that would separate "health" from "pathology" or job "success" from "failure". The predictions from the human judges and the statistical models were then compared win the actual outcomes. The clinical judges involved in these studies were exceedingly confident that statistical models based on obvious relationships could not capture the subtle strategies that they had developed over years of personal experience. But not only were the actuarial predictions superior to the expert intui- tions, many studies indicated "that the amount of professional training and experience of the judge does not relate to his judgmental accuracy" (Goldberg, 1968, p. 484). These actuarial models were not highly sophisticated mathematical formulae that went beyond the computational power of human judges. ~stead, the simplest models were the most effective. For example, when clinical psychologists attempted to diagnose psychotics on the basis of their MMPI profile, simply adding up four scales (the choice of the "best fit" criterion) led to better prediction than the expert judgment of the best of the 29 clinicians (Goldberg, 19651.
6 c - A minor upheaval in clinical psychology occurred in reaction to Meehl~s (1955) monograph which reviewed a number of studies demons~at~ng the superiority of objective statistical prediction to We clinical intuition most often used to make judgments of prognosis. Meehl's review was followed by a flood of publications illustrating mat simple prediction mesons based on me objective tabulation of relationships were almost always superior to expert clinical intuition in diagnosing brain damage, categorizing psychiatric patients, predicting criminal recidivism, and predicting colt ege success (e.g. Kelly & Fiske, 1951~. These analyses of clinical judgment in We 1950's were tile first to pinpoint many of We weaknesses of human intuition that are the subject of me first part of this paper. The most important aspect of the cl~rucal-statistical prediction debate is Tat the clinicians involved were very confident in their intuitive judgment. This combination of demonstrably sub- optimal judgments and continued confidence of me judges set the stage for the two themes of the judg- ment literature: What is wrong with human judgment? and Why don't people naturally realize the limi- tanons of human intuitive judgment? Another aspect of this debate Cat is still ~mponant today is the strong reaction of the pro- ponents of human intuition. Some go so far as to define rational judgment by what humans do (Cohen, 19811. In particular, many clinicaBy-onented theorists fear Cat an emphasis on measurable outcomes dehumanizes social science. Though Mechl was supportive of clinical work, and the point of his ard- cle was to change the focus of clinical psychology from prediction and categonzation to Dewy, his conclusions received vindent criticism. The problem, the critics of statistical pm~ichon argued, was Hat clinical intuition deals with deeper holistic integrations that cannot be reflected in p~ichon equa- dons. Many opponents of reducuon~sm even question Me validity of quantitative evaluation.
7 This position fails to understand how much we can learn about human judgment from foBow- ing up on the observed superiority of quantitative prediction. In what pan of He decision process are we deficient? How important are these deficiencies? If we use formal quandtadve models as "measunng sticks", then seven areas of comparison are suggested. First, does our ~ntui~ave choice of evidence match up to Hat of "madness" formal models? Second, do we reuieve and combine information as wed as these models do? Third, how accurately do we follow ~ rules of statistics when we try to evaluate the combined information? Fourth, how wed do we learn from experience? Finally, how can we protect ourselves against these esters? ]- Intuition versus forma] models: selecting the information One reason for the superiority of statistical judgment is Hat it utilizes info~manon based on the observed quantifiable relationship between the predictors and the outcome. A prediction equation sums by identifying Hose predictors that are mean~ngfi~ in a purely statistical sense. Survival tables for insurance companies, for example, are created by collecting information on many dimensions of poss~- ble relevance. The obvious vanables-- sex, weight, ethnicity-- are represented, but so are others, such as family size or income, that are chosen simply because Hey are stadsticaDy related to life span. Humans cannot attend to and measure every part of the social or physical environment, and cannot observe the interrelationships of every pan. Instead, we must have some method of choosing a subset of He available data to monitor most closely. Generally, we rely on pre-exisdng thrones to guide our attention. Our confidence In our intuition prevents us from apprec~aung the power of such theories to determine the results of the data collection. When we attend only to confirming evidence, it becomes very hard ~ disprove a theory.
8 The conp~r~n buss-- Even in basic cognitive processes, mere are costs to the thec~y~nven search for information. Humans tend to learn only one schematic representation of a problem-- and then reapply that representation in an inflexible mamer to subsequent problems (Duncker, 1945; Luchins, 1942). Often the tendency to apply a familiar schema tO a new problem causes those win prior training tO miss easier, more efficient ways to solve the problem. People Ming to solve logical puzzles doggedly set Out to prove their hypothesis by searching Out confirming examples, when they would be much more efficient if Hey would search for disconfirming examples (Wason, 1960). It seems much more natural to search for examples Hat "fit" with He Meow being tested, Han to search for items that would disprove the theory. The most dramatic example of theory~nven data collection is the problem of the "self- fillfilling prophecy" (Merton, 1948~. This phrase has now become part of popular culture and refers to He way that our theories can actually cause others to act towards us in the way that we expect. The classic work by Rosenthal and his colleagues on this topic is reviewed in detail ~ another paper in this senes, and so win be touched upon only briefly. Especially well-known is the study by Rosenthal and Jacobson (1973) entitled Pygmalion in the Classroom. Teachers were given false information on the expected achievement of some of their students. Based on the expectations created by this information, the teachers went on to treat the ran- domly selected "late-bloomers" so differently that these students scored especially highly on subsequent achievement tests. . The standard wisdom is that such demonstrations point out the absolute necessity of employing experimenters "blind" to the hypothesis in scientific research. When experimented know how the sub- jects in a particular condition "should" behave, it is impossible not to give unconscious clues to the subjects. But in everyday experience we always have some guiding theory or stereotype. We are not
9 blind to our expectations or theories about how people will behave. Snyder (1981) has examined how people investigate theories in social situations. People who try to determine if others are extroverted ask questions about extroverted qualities-- and discover that most people are extroverts. People who try to determine if others are introverted ask about introverted qualities-- and discover that most people are introverts. Men who believe that they are having a phone conversation with an attractive woman talk in an especially friendly way. When they do this, their unseen woman parmer responds In friendly and "attractive" ways. Everyone is familiar with the vicious competitor who is certain that it is a "dog-eat-dog" world. Studies of competitive games reveal that these people have beliefs about the world that cause others to act in a way that maintains those very beliefs (Kelley & Stahelski,1970). Aggressive competitors in these studies believed that they had to "get" their opponent before their opponents got them. Sure enough, their opponents responded to their aggressive moves with aggressive countermoves, "proving" the compenOve theory of human nature. Such biases do not need to come from strong long-standing theories, they can be created within one situation. When people observe a contestant start out with a string of correct answers, they assimilate the rest of his or her performance to their first impression. A person who starts out wed is judged more intelligent than a person who gets the same total number of answers correct but starts out poorly (Jones, Rock' Shaver' Goe~als & Ward, 1968). This research provides one answer to the question: Why do people remain confident in Me validity of poor theories? If you hold a theory strongly and confidently, then your search for evidence will be dominated by those attendon-gemng events mat confien your theory.
10 I. The conp~mu~n bias: An exon~le-- People bias their search for data about Me social world in such a way as to support their existing beliefs. This is We starting point of the unreliability of personal expenence. It is easy to see how social relations can be biased by precasting stereotypes about ~ndivi- duals and groups, but it is less obvious Cat our coBechon of more objective data is distort by prior theones. An elegant and relevant demonstration of the assimilation of objective evidence to theory was Marines and Kamman's (1980) conceptual replication of the famous Stanford Research Inshtute's remote viewing experiments (Tar" ~ Pu~off, 1974~. In this example, formal statistical methods meet "head to head" with intuitive judgment Remote viewing is the name given to the ability to "see" a sewing mat is physically removed from the viewer. In Me ong~nal SRI study, viewers descnbed their impressions of a number of targets and sketched an outline of those impressions. Judges then rated how weU each descnption matched the possible targets. The accuracy of viewing in these experiments was remarkable. (Maws and Kamman believed that the judges actually knew-- through clues about the date and order of targets-- which drawings and impressions went with which targets. But that is not essential to this story). Marks and Kamman ran 35 remote viewing studies of Weir own and discovered that despite the fact that both Me subjects and the judges were confident Mat they could make the correct matchups, Me results were never stai~shcaDy significant. The judges themselves, using objective criteria to rank how well each viewing transcnpt matched each target were convinced that they had found strong and convincing matches-- yet formal analysis did not corroborate this belief. To understand die nature of the theory-confirming process, Marics and Kamm an accompanied a subject to me venous targets and watched what they can "subjective validation" take place. "When (a subject) goes to see the target sight after finishing his descnption, he tends to notice the matching elements and ignore the nonmatclung elements. Equally, when the judge compares ~anscnpts to the target and makes a relative judgment, he can easily make up his mind mat a particular transcript is the correct one and fall into the same
11 hap: he win validate his hypothesis by attending strongly to the matching elements. The fact is Mat am target can be matched to arm description to some degree....With subjective validation on Weir side, we are not surprised if a wave pemon, unfamiliar wad the power of subjective validation, visits a location wad a description fresh in his mind-- any descnphon-- he win easily and effordes~y find that Me descnphon win match" (Marks & Kamm an, 1980, pp. 24-25~. Marks and Kamman conclude that it is impossible for people to fairly judge evidence when they think they know the nuth. That is why formal science insists on setting the criteria for "success" before the data is collected, and using statistical models to determine me likelihood of the outcome. In our personal experience we can find matches that fit our theory as long as we are able to choose among the presented evidence. We may try a new method of enhancing athletic performance and then after the tnal we can search for domains in which our perfom~ance is improved. Perhaps we didn't non faster or feel sponger, but we did fee} a little more alert or vigorous. When can we ever use unaided personal experience without bias? We can never be sure. Sometimes the biases of attention and memos are random and "average out" to give accurate conclu- sions. But because we are not aware of these biased processes as Key occur, we cannot know when to Dust our intuitive judgments. Personal experience is the only source of evidence in many areas of life, but it can never be decisive when pitted against objective quantitative evidence.3 3 This is not to say that objective quantitative evidence should necessarily overgrow strongly held theones. When a theory parsimoniously explains a whole range of emp~ncal phenomena, it is appropriate to be conservative about accepting new objective evidence that challenges it. The classic example of this is Me reluctance of physicists to accept the "non-replications" of We Michelson-Moriey experiment canted out by D.C. Miller between 1902 and 1906 (Polanyi, 1962~. Notice the key difference between holding a useful explanatory theory (itself partly empirically based) in Me face of objective contradictory evidence and trusting intuitive data-ga~enng over objective measurement.
12 .~ .! 2- Intuition versus formal models: memory versus counting Not only do we "choose" a subset of infomlanon to pay attention to, there am qualities of the info'Tnadon itself that bias our attention. Especially in He area of social perception, psychologists have demonstrated how people or things that are "eye-catching" are seen as having more influence than are over people or things who act the same way but are less noticeable (Taylor tic Fiske, 19781. The effect of salience-- the tendency to notice some things more than omen because they are brighter, louder, unique, or noticeable In some other way-- underlies many of Me failures of human judgment of probability and improbability. In baseball "you can't hit what you can't see"; in judging likelihood "you can't count what you don't notice". The greatest advantage of research over intuition is He objective quantitative tabulation of results. AU the evidence collected, whether vivid and exciting, or dun and tedious, is recorded so Hat the strength of relationships can be measured. But human experts have to combine impressions from memory. And some things come to mind more easily than others. What happens when we have to relieve instances to use to make judgments? Do we retrieve unbiased samples of our experiences and knowledge? Some things In our memory are more noticeable and vivid than others, just as some things in our environment are more salient Han others. Often the things that are most vivid in memory are the "good stones". Most often such good stones are about individual cases, and are much more memorable than the bland summary statistics we get from formal analysis (Nisbett ~ Ross, 1980~. Consider this every- day example: You want to buy a new car. The several thousand people who write into Consumer Reports provide you with the aggregate statistic that a Volvo is extremely reliable and efficient. How- ever, when you tell your dentist that you are thinking of buying a Volvo he tells you a handful of hilarious stones about the terrible trouble his brother-in-law had with his Volvo. Suddenly, you are
13 much less certain about buying a Volvo, Cough your thousand case data-base has been incremented by only one. Furler, when you lie in bed thinking about the pros and cons of buying a Volvo, the vivid stones of the dentist's brother-~n-law's car steaming, screeching and moaning come to mind wad greater impact than do me blue and white dots from Consumer Reports. The shortcut strategy that defines the most likely alte~nadve as that one which most easily comes to mind is the "availability heunstic" Crversky ~ Kahneman, 1973~. In many cases mere is a correlation between Dings Mat are most numerous in the world and Rose things Cat come to mind most easily. But vividness introduces a systematic bias when things are easily remembered for some other reason than actual frequency. An excellent example of this is the bias introduced by the media's proclivity for me exciting and bizarre event. People think that more deaths are caused by tomados man by asthma, Cough asthma kiUs roughly 9 times as many people as tomados (SIovic, F~schhoff & Lichtenste~n, 1982~. People think accidents kin more people than strokes. They do not. Both accidents and tornados are widely reported and are vivid enough to stay In the memory of the reader. They are more cognitively available and so seem more likely. In addinon to deciding how common a certain event is, people often judge the likelihood that a certain case belongs to a cereal category. A statistical model treats these two problems the same way, by counting. Are suicides more numerous than homicides? Counting says yes, but the shortcut of bringing instances to mind suggests (incorrectly) that homicides are more common. Is a bespectacled, mild-mannered student of love poetry, more likely to be a Buck Diver or an Ivy League Classics pro- fessor? Counting the (proportionally few) poetry students among the vast legions of Duck drivers is apt to come up with a greater number Can the total number of Ivy League Classics professors.
14 Humans don't make judgments of conditional probability by counting the number in each category that fit the conditions. This laborious routine goes beyond a reasonable use of our cognitive capacity. Instead, we can make amazingly quick judgments of how well a case fits a category by com- puting the similarity of the case to the category. This intuitive method is the "representativeness heuris- dc" (Kahneman tic Tversky, 1982a). Judging likelihood by similarity leads to predictable enors In judgments. First, we tend to ignore the background statistical likelihood, or He "base rate" of me category. An He truck driver example, we compute me similarity between the case descupoon amd the prototypes of tmck drivers and classic professors-- we do not think of the much greater number of truck dnvers. Second, when the individuai~ng evidence is given (the fact Hat the target person was meek and liked poetry), we do not consider the validity of the evidence but only how wed it "represents" the category prototype. This problem of He "diagnosiicity" of individuating information is explicitly taken into account in a statistical prediction model. Some piece of evidence that is present in almost an cases Of one category, and absent in almost all cases of the other, is highly diagnostic-- it differentiates one categoric from another. In one study (Tvenky & Kahneman, 1982b), students were asked to use the results of projec- nve personality tests to predict He future career choice of a pemon. In one condition, they wem told that people predicting on me basis of such personality sketches were usually highly accurate. In another condition, the student judges were told that the personality sketches rarely led to accurate pred- ictions. In He first condition, me judges were aware mat Hey had valid infonnation about the particular case, and so could safely assume that the sketches contained diagnostic information about future career choice. In the second condition, He judges were aware that the infonnation had lime validity, and should have been influenced by how many people (~e base rate) entered each profession.
15 The results Mowed Hat me students predictions were based entirely on the s~milanty of me personality description to the stereotype of the profession. The predictions were the same In both con- didons: the lack of diagnostic infom~abon did not prefers We students from using Be invalid infonna- don Hey had. A similar problem arises with specificity: a descnphon becomes more representative of a category or a causal model as it becomes more detailed, yet each added detail makes it statistically more unlikely. When people imagine scenarios, me more complicated and concrete Hey imagine the scene, the more probable they judge that the particular outcome will occur. One group of international policy experts was asked lo estimate the probability that the U.S.S.R. would invade Poland AND the U.S. would break off diplomatic relations with the Soviet Union within the next year. On average, they gave this combination events of events a 4% probability. Another group of policy expens was asked to estimate the probability that the U.S. would break off diplomatic relations with the Soviet Union within He next year. This was judged to have a probability of I%. CTversky & Kahneman, 1983~. When the policy experts had a comb~nadon of events that caused Hem to create a plausible causal scenario, they did the impossible: judged that a combination of events was more likely Han one of its components. In particular, the theories behind innovative practices seem more compelling the more they are fleshed out with specific descriptions and possible causal Ielationships. We evaluate me quality of the "story" we have made up, but fail to stop and consider that as the explanation depends on more and more details He less likely it is to be completely tree. Skeptics of the paranormal are on the firmest logical grounds when they consider base rates and diagnosticity. The base rate of fraud and cheating In the history of commercial psychic
16 phenomena-- from He 19th century British spiritualists to Un Geller-- is remarkably high. 4 At the satne time, the ability of observers to detect cheating (the diagnosticity of their personal expenence) has been remarkably low. This is a state of affairs Mat should lead to caution when Owing conclu- sions on the basis of the most impressive demonstration. Yet each of us has a powerful illusion Cat our personal experience is valid and our conclusions about a demonstration am diagnostic of their truth. Many biases of personal experience work together to make us overly impressed by exciting novel events. The likelihood of experiencing some phenomenon unknown to scientific laws is extremely small-- and neither our perceptions nor our sources of information about Me situation are perfectly reliable. If He a priori likelihood of a novel explanation is very low and He sources of info~madon (assurances of others of honesty of setup, accuracy of machinery and so on) are not absolutely certain few experiences should not be enough to markedly change our beliefs in the unlikelier of the expla- nation. But the events are vivid, we are wining to make broad inferences on the basis of a few memor- able cases, we attribute the results to the salient cause without possible base rate explanations that would reveal simple unsure sing explanations for He results-- that is, we jump to conclusions. 4 The rate of deliberate fraud found in modem laboratory research on psychic phenomena may well be no greater than Hat found in any other discipline dependent upon successful findings for grant support. However, commercial exploitations of alleged psychic ability--the evidence wad which He lay population is most familiar--are highly likely to be fraudulent (Gardner, 19811.
Our cognitive efficiencies lead to these systematic mistakes in combining info~manon and judg- ing likelihood and causality. Similarly' once we have come to some summary statement, we are less likely to be conservative about what our judgment means Man formal models would suggest An important part of judgment is moving beyond me immediame case or example, and generalizing our conclusions to some larger domain. 3- Intuition versus formal models: the generalizability of sample conclusions Informal inference is prone to systematic bias in data coUecdon-- either from simple costive attentional biases or from biases in retrieving information. In order to keep the effects of this biased sampling procedure minimal, there are some basic cautions Mat must be taken in Me treannent of the data. The evidence collected will lead to false conclusions unless the peculiandes of the sample are noted, and we understand what "population", or "Sue" state of the world, our sample represents. Making inferences from a sample to a population involves deciding whether a small amount of evidence can be generalized to support a principle in the larger world. The first limit on generalization is the bias of the sample. Has it been randomly chosen from Me larger population? If not, we cannot know about what pop~adon we have learned. But people are not cautious about generalizing from samples of Mown bias or unknown makeup. "Consider, for example, the high school disciplinary officer who is convinced mat Be school is filled with drug users or the choral director who marvels at the musicality of the students in the school....Often one is handicapped by one's location in a social system...which can mislead by produc- ing large and compellingly consistent samples of evidence nonetheless hopelessly tainted by biased selection" (Nisbett & Ross, 1980, p. Sg). Any proponent of a social or technological innovation namr- ally comes into contact with that subgroup of people who feel dlat they have benefited from He
18 innovation-- rarely is such a proponent cautious enough to limit his or her faith until a random sample is exam~ned.5 People are often as likely to make strong inferences about a population of people when they are given no information about how the sample was chosen as when they are told the sample was chosen randomly or to be representative of the larger group (e.g. Hamill, Wilson ~ ~~sbett, 1980). In one particularly unsettling set of experiments, subjects were willing to make strong inferences about the attitudes of the population of prison guards and about the attitudes of the population of welfare recipients, on the basis of one case study, even when they were explicitly told that the case study was atypical. The second limit in generalizing from a sample of evidence is the reliability of that sample-- which depends on the size of the sample. This is explicitly taken into account in statistical models, which are very cautious about conclusions made on the bases of a few cases. But even experts seem to have an intuition that relies on a "universal sampling distribution" which leads to the same faith in summaries based on 50 or 1000 cases. (Kahneman and Tversky, 1982a). This leads to an underappre- ciation of the reliability of very large samples, and a drastic overuse of information gained from talking to a few people. Conclusions based on personal experience are quite often made from the few cases that come to mind as particularly memorable. s Rarely is it possible to test large-scale meaningful innovations by clean, controlled experi- mentation utilizing random sampling, of course. Consider the Head Start intervention program for intel- lectually disadvantaged children. Scientific assessment of this intervention has involved the "messy" real-life program in place. But note that the problems with this evaluation are public knowledge-- we know enough about the threats to the validity of such studies to qualify our conclusions, and incor- porate uncertainty into future policy. But the personal experience of any one person involved in this program would likely lead to strong beliefs about its efficacy without the appropriate caution-- and without publicly observable methods of evaluation.
19 At the heart of intuitive misunderstandings about chance and sampling theory is a widespread confusion about randomness. People believe the "laws" of chance are stricter in local regions than they actually are: processes occurring by chance alone can look quite systematic in any particular sample. "Subjects act as if every segment of the random sequence must reflect the true proportion: if the sequence has strayed from the population proportion, a corrective bias in the other direction is expected. This has been called the gambler's fallacy'' (Kahneman and Tversky, 1982a, p. 24). Kahneman and Tvenky argue that the laws of chance are replaced In subjective estimates of likelihood by the rule of representativeness. The probability of an uncertain event is determined by the extent to which it is similar in essential properties tO itS parent population. People believe that true random samples cannot have long runs of one kind or regular patterns of altemation. This is caused by the intuition that every part of a run must look "similar" to a prototype of randomness. People win even bet on a less likely outcome when it looks subjectively random (all parts are representative of the population proportion and none is tOO regular) than Me more likely patterned altemative. If people believe that random processes cannot produce sequences that look systematic, how do they respond to patterns in random data? They search for more meaningful causes than chance alone. Gamblers and professional athletes become superstitious and attribute the good or bad luck to some part of their behavior or clothing. Even lower animals such as pigeons "discover" contingencies that don't exist and show their own form of superstitious behavior (Skinner, 1948). When food is delivered at a random schedule, the pigeons at first try tO control the delivery by pecking at the food dispenser. When the food comes, the pigeon can be in the middle of any action, since there is no relation between its action and the food. But it win continue to repeat the action that co-occurred with feeding time-- and eventually its efforts will be "rewarded" by more food. This strengthens the behavior, and it iS kept up because it appean tO
20 be success~. A leal-life example of Me tendency to discover patterns in random data is the "hot hand" phenomenon in professional basketball. The hot hand phenomenon is the compelling perception Hat some players have "hot sneaks" and have runs of successes and failure. Researchers examined Be shooting records of a number of outstanding NBA played and detem~ned Cat Be number of "runs" of successful shots did not depart from what could have been predict by chance models (Gilovich, Val- lone, and Tve~ky, 1986~. (This does not deny the we element of skin, but assumes a chance mode} based on a Even probability of success for each person individually). Fans (and the players themselves) "perceived" that a successful shot was mote likely to be fol- lowed by another successful shot, while a failure was more likely to be followed by another failure. When university players were offered bets contingent on their ability to predict the results of their next shot, their bets showed stung evidence of their belief In Be "hot hand" but Heir performance offered no evidence of its validity. The lesson for evaluating quantitative evidence by subjective means is clear: these basketball records are comparable to the results of a trial of some new method of performance enhancement Even if there is no systematic structure In the data, people win see meaningful pauems~ shout any theories other than the belief that random processes must look random. The most serious flaw in our understanding of randomness is the over pretation of coin- cidence. In order tO decide whether an event or collection of events is "unlikely", we must somehow compute a sample space-- a list of an the over ways that the occasion could have turned OUt. Then we must decide which outcomes are comparable. If we have a bag with 99 green balls and 1 red ball, the total sample space is made up of 100 balls. The probability is small, just 1 in 100, that we can pull the red ball Out of the bag on our first try. That is by planning our success-- defining He part of the
21 sample space Mat we will call success-- we have divided the 100 balls into two sets of comparable events: a successful reuieval of We red bad or We unsuccessful retrieval of any one of the 99 green bans. Real life resembles me unplanned case. We reach into We bag and pun out a red ban, and then are impressed by the unlikelihood of Rat act. But without plowing and defining a success' we have Only one set of comparable events. EACH ball is equally unlikely or equally likely. Any PARTICU- LAR green ban is as undikely as the red one. That is the simple probabilistic cn~que of coincidence that is familiar to everyone, and seems mte when compared to the richness of daily experience. The eminent stanshcim (and magicians Persi Diamonds (1978) reviewed a more sophisticated view of coincidence in his critique of ESP research. Before we attribute unusual and starving events to synchron~city-- unseen laws and relationships Mat govern our behavior-- we would do wed to ponder what Diacon~s cans the problem of "multiple end- points". Multiple end points are the many ways that a surpns~ng intersection of events can occur. It seems more like a miracle man a chance event when our neighbor in a Pans hotel turns out to be a classmate from first-grade. The probability of this intersection of elementary events-- being in Pans, being in this hotel, and meeting this long-Iost friend-- is indeed small. But the intersection of a union of elementary events-- being abroad in a large tourist city, meeting some friend In am monument, res- taurant or hotel-- is not so unlikely (Falk, 1981~. Not only do we focus on the single event because its suspense value makes it salient and memorable, but it has the special property that it happened to us. "One's uniqueness in one's own eyes makes all the components of an event that happened to oneself seem like a singular combination. It is difficult to perceive one's own adventure as just one element in a sample space of people,
22 meetings, places, and times. The hardest would probably be to perceive oneself as one replaceable ele- ment among many others" (Falk, 1981, p. 24). The occurrence of some "small-world" coincidences becomes Me norm in a very large world. In most of these Instances-- such as dreams of disaster followed by the dead of a reladve-- there is simply no way of defining the sample space and determining the likelihood of the event hap- pen~ng "simply by chance". Such phenomena can not be proof either of the existence or nonexistence of paranormal events. Yet it is just these very personal experiences that cause many people to reject objective measurement in favor of their own subjective impassions. 4- How much do we learn from experience? it.. Given these harsh cynicisms of the way that humans collect analyze and judge evidence, more optimistic readers will ask: So how did we get to the moon? Science and technology have progressed precisely because of the use of formal methods, strict record-keeping and repeatable demonstrations. Inmitive judgments based on personal experience do not have these structures for reaming from expen- ence. Instead, human judgment seems desired as much for protecting the ego of the dec~sion-maker as generating accurate predictions and assessments. Consider the "hindsight bias", also tenned the "knew-it-ah-along" effect 0:ischhoff, 1975~. A group of physicians is given a list of symptoms and a list of possible diseases and are asked to rate the likelihood mat a person with those symptoms would have each of me diseases. Another group of phy- sicians is given the same two lists, except that they are also told which disease the patient really had. The doctors who know the right answer judge that the "correct" disease is very likely given Hose symptoms, while Hose who do not Glow He par~c~ar answer are quite uncertain (Arkes, Worunann, SaviBe & Harkness, 1981~.
23 This demonstration is easily repeatable with marty other professions who need to make judg- ments and predictions-- once people are aware of the correct answer, they are certain that they would have known that even if uninformed. We build a network of relationships around the correct answer, and when the answer is "taken away" in imagination, the network of relations built on that answer is Still Were-- making ~ logic of Me correct Dower mom obvious ~ anyone. When the answers seem obvious to us Her the fact, we believe that our intuitive abilities are being confirmed. People in virtually all circumstances and professions (except hose-racing handi- cappers and weather forecasters, who receive repeated objective feedback) are much more confident in their judgments and predictions than their performance would justify. One of the few ways to attenu- ate this overconfidence is to explicitly ask the decision-makers to liSt the ways they might be wlong-- for people will only consider the confimmatory evidence, unless prodded. (Koriat, Lichtenstein LO Fischhoff, 1980). ~ Even when we team that our beliefs were formed on the basis of completely false infommation, we cannot "adjust" them back to their original point. Demonstrations indicate that if we falsely tell one group of people that they are showing high aptitude for a task, and tell another group that they are showing low aptitude for the task, the two groups will come up with explanations for their high or low showing. But when the subjects are told that the feedback they received was actually random and had no relation to their actual perfommance, the "high" group still believes it is relatively good at the task and the "low" group still believes it is relatively poor (Ross, Lepper ~ Hubbard, 1975). The original infommation has been taken away, but the causal explanations remain. The explanations were created on the basis of false infommation, but humans are so good at composing causal scenarios that the 6 Built into this paradigm is the not-so-subtle hint that people are usually overconfident, but this is one of the most important messages that decision analysts can give decision-makers.
24 explanations are still given with confidence. A prime source of our confidence in our own judgments is our perceived ability to irltfOSpeCt and examine Me evidence on which our decisions are based. This makes us fee} Cat we can determine whether we are biased and emotionally involved or evaluating objectively. Psychological studies indi- cate that this compeUing feeling of pnvileged access to our decision processes is quite exaggeratedly many cases, we seem to have lime more information about how we make decisions Man an e~cpen- enced observer of our behavior. In a series of behavioral studies (Nisbett Bc Wilson' 1977), psycholo- ~sts manipulated a number of dimensions in an attempt to shift Weir subjects preferences. Some man~pulai~ons-- order of presentation, verbal suggestion, "warmth" of a stimulus person-- measurably affected the subjects' judgments; others-- con distractions, reassurances about safety-- did not But the p~cipants were unable to introspectively detem~ne which influences affect ~em; instead they made up theories about why they preferred me objects Hey did. These explanadons were wit accurate reflections of the manipulations but were based on guesses Mat were similar to those that outside observers made. A series of more biological studies supported these conclusions. These ~nvestiganons used spit-bran padents who had lost the connection between the right and left hemispheres (Gazzan~ga, 1985~. Certain shapes or pictures were flashed to We right hemisphere which could interpret the pic- tures but could not communicate verbally. When the verbal left hemisphere was asked to explain We behavioral reactions to me pictures, We verbal explanations were utter guesswork The "creative" explanations were-offered and accepted by the patient automai~caBy without discomfort. As the author explains, "The left-bra~nts costive system needed a theory and instantly supplied one that made sense given the information it had on this particular task" (Gazzaniga, 1985, p. 72~. Human judgment allows little room for uncertainty; it is set up to explain the world-- and to prevent the anxiety that comes with
25 uncertainty. Yet acknowledging and understanding uncertainty is the essence of formal decision- making. It signals the need for caution, for hedging one's bets, for searching for more information, and for relying on background base-rates. 5- Protecting against information-processing biases Research design is Me formal sure through which smerdists protect themselves from me biasing effects of prior Keynes, salient evidence, compeBing subsem of evidence and other natural pit- fans that beset researcher whether lay or professional. The process of testing hypotheses Cough per- sonal experience leads to certain common violations of me basic tenets of research design. A simple but non-obvious rule of correct research design is that relationships can only be sum ported by examining all four possible outcomes of a success/fai} trial (see Figure 2). How do we test He ability of a biofeedback advice to improve creativity? The usual intuitive assessment of the efficacy of He device is to compare how many times the device led to improvement ~its) and how many umes it led to worsened performance (misses). In order to analyze the results stadsticaDy, however, we must also examine cells C and D: the number of times that performance was improved or decremented when we did not use He device. The use of this last cell is pari~culady puking to people who are not familiar win the logic of research design. Any do we need to tally He number of thus on which we did not use He device and did not improve? S~mply-because we need an four cells in order to get proportions of success for those times when we use He device and those times when we don't. If success happens a large propomon of times when the device is not used, Hen He apparent relationship between He device and He successful out- come can be discounted.
26 The natural tendency to search for "hits" (examining the manipulation-present success-preseM cell) leads to the development of "illusory correlations"-- strong beliefs in the existence of certain con- tingent relationships that do not actually exist. Even clinical psychologists have been shown to main- tain invalid diagnostic theories based on cultural stereotypes (Chapman ~ Chapman, 1982~. This occurred precisely because the clinicians were overly impressed by "successful" pairings of a symptom and a diagnostic outcome and did not notice the many Dials where me relationship did not hold. Clini- cians were struck by the number of trials where paranoid patients drew staring eyes, but did not con- sider the number of Dials where non-paranoid padents drew staring eyes to be relevant. ~usorv correlations caused by He focus on "hits" seems to be particularly difficult to avoid when we are working with tnal by trial observation (Ward & Jenkins, 1965~. ~ a study which exam- ined the ability of people to intuitively determine the contingency between ran-seeding and rain, sub- jects correctly used the strategy of comparing proportions of success (with cloud-seeding versus without cloud-seeding) only if Hey first saw on overall summary table of an four cells. However, if the subjects had experienced trial by trial presentation of evidence before they had seen the summary table, they persisted in using incorrect strategies based.primarily on the confirming cases. Another clear lesson for decision-makers: look at He summary of the whole dataset, don't get caught up In the excitement of getting a personal feel for the results-- and if you're testing yourself, keep an objective tally of all 4 possible outcomes. Another essence facet of research design that is neglected in He search for evidence through personal expenense is He need for experimental consul. Valid conclusions can only come when data is also collected on occasions when the manipulation of interest is not used. Control is used in an experiment to ensure that the only thing that differs between the "present" and "absent" conditions is the manipulation itself. If we are testing the efficacy of a biofeedback device to improve creaDvi~, we
27 cannot compare the performance of a person attached to the biofeedback machine with the performance of another person without biofeedback who is in another room. The presence of the machine may affect the performance of the person, the person attached to the machine knows that he or she is in the special performance condition, and the researcher knows who is un He special performance condition. In order to find He proportion of successes Hat win occur even without our chosen man~pula- tion, we must include a control group that differs from the experimental group only in the one critical variable.7 In order to make sure there are no other variables operating (such as placebo or Hawthorne effects, discussed below), we must treat the two conditions completely alike in all other ways-- neither the researcher administering the treatment nor the subject receiving the treatment can know (or guess through cues) the experimental condition. Of course, this never occurs in personal experience-- as sub- ject or researcher, we are always invoked. A good analogue of the problems of control in our daily inferences is He problem of expen- mentation in medicine. When new types of surgery come along, physicians are reluctant to give the treatment only tO a randomly chosen subset. Instead they tend to give the new surgeries tO the patients who would seem to benefit the most. The results of such teals often seem very impressive, especially compared to the survival rates of those who do not receive the surgery. However, those receiving the surgery start out differently on health variables than those who do not, know they are receiving treat- ment, and are cared for by staff who know they are receiving special treatment. Gilbert, McPeek & MosteBer (1978~- compared such uncontrolled field teals with randomized experimental trials and 7 Obviously, this is a gross simplification of MiD's canon of causal analysis: vary me factors one at a time, keeping an others constant. As Wimsatt (1986) points out, when this is followed mechanistically, important interdependencies among variables are ignored. Most techniques proposed to improve human perfonnance are actually combinations of a number of distinct interventions. In such cases, conceptual combinations of variables must be identified that are distinct from attention and demand effects in order for 'evaluation' to be sensibly applied.
28 An it Liz discover that about 50% of such innovations in surgery were either of no help or actually caused harm. The necessity for placebo controls has also been best demonstrated In He area of medical research. When patients are given a drug of no medical value, a substantial proportion will improve simply because of their belief in the efficacy of the drug. A resew drug or treatment must be compared to a placebo to test whether it is of greater value than simple suggestion The analogue to the placebo effect In industnal research is He Haw~ome effect, so named because of an ~nves~gation into mesons of improving the efficiency of the workers at the Hawthorne plant of the Westem Electric Company (Roethlisberger & Dickson, 1939~. The investigators found that every alteration in working conditions led to improved efficiency-- not because the changes in conditions affected productivity but because the attention of the researchers improved the workers' morale. Similar problems plague research into psychotherapy. Conned groups must offer supportive attention without the actual psychotherapy in order to be a valid comparison ~ the group receiving the actual therapy. Informal examination of evidence gained though personal experience is subject to flaws both In the gathering and in the analysis of He data. We cannot be "blind" to our theories when collecting He data, and we always know whether each data point collected supports or weakens the evidence for a theory. Without c are fin consideration of research design, people cannot help but bias the sample of data Hey collect The risks of gathering evidence through personal experience: an example Pan of He responsibility of the Committee on Techniques for the Enhancement of Human Per- fonnance has been to make site visits to various research establishments. A number of these visits have been to groups who have not yet subjected Heir bones to rigorous expenmental test, but who have
29 developed considerable enthusiasm and faith in the validity of their phenomena. Though these product developers are for the most part research scientists, before they begin controlled experimental evalua- dons of their products they use the same processes of intuition as the "lay" ~ntui~ve sciendsL An evaluation Visit by some scientists on the committee to Sieve Backster's laboratory in San Diego led to critical reports By Niched Thompson of Stanford Umve~ty md Day H - m of He University of Oregon) that highlighted many of He problems mentioned in the first part of this paper. Most visitor to this laboratory come away impressed wad He first-hand evidence Hey have gathered. But the scientific evaluations of Thompson and Hyman indicate two general problems Cat prevent any conclusions about the phenomena under study: He post-hoc selection of outcome cntena, and the opportunity for theory~nven data generation by both experimenter and subject. One of me most provocative paradigms under study in the Backster laboratory is He reaction of human leucocytes that have been removed from the body to the emotional experiences of the donor. The subject's cheek cells are placed in a solution, and the electrical potential of the solution Is condnu- ally monitored and related to the emotional reactions of the subject. The post-hoc data selection identified by Hyman involves the practice of identifying a substantial electrical reaction recorded from the solution and then asking the subject whether he or she had experienced an emotional reaction at that time. This ignores the base rate of emotional reactions in general, focuses only on the "hits", and leads to an illusory correlation between the "sign" (the in vitro leucocyte response) and the "cause" (the human emotional response). Further, Here is no a pnon definition of emotional response, so me sample space of successful outcomes is defined after He fact, making probability estimates meaning- less.
30 What makes the subjectivity of the cotena even more serious is Me fact that both the expen- menter and the subject know the hypothesis being tested, and have an opportunity to search for success on each teal. When the e~cpenmer~er asks the subject if he or she can remember some emotional reac- don about the time of the electrical response of the solution, We subject is free to search until he or she can find some subjective validation of We suggestion. It is not enough tO forewarn the investigators and the subjects about the dangem of personal validation, for people are unable to "adjust" Weir conclusions after an experience--personal experience iS tOO powerful, and beliefs remain even after their evidential basis is questioned. There is no training Hat can "inoculate" researchers against seeing what they want or expect to see. Biases cannot be com- pletely removed-- but they can be minimized by research methodology that can be publicly scn~t~n~zed and objectively evaluated. Problems in Evaluating Evidence lI: Motivationalfactors in judgment Not only are our perceptions biased towards our expectations, but we also actively distort our perceptions in order to see what we want to see. In general, expectations and desires are highly corre- lated and distortions in judgment cannot be strictly attributed to one or the other. A healthy ego seems be correlated with the use of self-enhancing distortions (Alloy ~ Abramson, 1977~. It is in fact the depressed who show a lack of protective distortions. Depressives seems to be more willing to accept that Hey do not have control over random events, while not s show an "illusion of control". The illusion of conned is me belief Hat a person has control over chance events that have personal relevance (Langer, 1982). Anecdotal examples from gambling are easy to generate: dice players believe their throwing styles are responsible for high numbers or low numbers and Las Vegas casinos may Came their dealers for runs of bad luck
31 (Goffman. 1967). People will wager more before they have tossed the dice than after the toss but before Me result is disclosed. (Strickland. Lewicki & Katz,1966). In a series of studies, Langer condoned that people believed their probability of winning a game of chance was greatest when irrelevant details were introduced mat reminded Me participant of games~of skill. Allowing the played to choose Weir own lottery numbers, or Educing a "schnook" as a competitor-- Without chap - g He obviously chance nature of the outcome-- made people more confident In their likelihood of ~- mng. While depressives appear to be less vulnerable to this illusion, nonnal people Will see ~em- selves "in control" whenever they can find some reason. In a study of mental telepathy, Ayeroff and Abelson (1976) found no evidence for the existence of ESP. They did, however, manipulate the amount of "skill demls" In the situation. They found Mat when subjects were able to choose Weir own occult symbol to send, and when the sender and receiver were able to discuss their commun~cadve technique, Hey believed that they were operating at a success rate Wee times the chance rate. But when Hey were arbitrarily assigned a symbol and had no particular involvement In the task, they believed they were operating at about the chance rate. A similar experiment used a psychokinesis task to test this hypothesis (Benassi, Sweeney, & Drevno, Ig79~. Again, no significant ESP results were obtained, but subjects' beliefs in Heir ability to influence the movement of a die did vat ~ active involvement In He task and win practice at He task Most often we do have He opportunity to surround our activities with the trappings of skis. This moi~vabon to feel in control, combined with He tendency to notice and remember successful confirming outcomes, adds another probable and plausible cause of the terxiency of people to see · . meaning In chance events.
32 The abilitr of people to give themselves the benefit of Me doubt was more directly examined in a sene~s of studies on We phenomenon of "self~ecephon" (Quattrone ~ Cheeky, 19843. Subjecm are told that certain uncontrollable indices diagnose whether they have a heart that is likely to cause trouble later in life. They are then given an opportunity to take part in the diagnostic task which will indicate which type of heart they have. The task is painful and unpleasant-- but subjects who believe mat continuing with me task indicates a healthy heart report lime pain and continue wad the task for a long period. Those who believe that sensitivity to pain indicates a healthy heart find themselves unable to bear the discomfort for more than a minute. A few of the participants are aware that they are "cheating" on We diagnosis, but only Pose who are not aware of Weir own motivation are confident that Hey have me hearty type of heart These people are not deceiving the investigator, they are deceiving themselves. Continuing with the painful task does not cause their heart to be of a certain type; but in order for them to believe the diag- nosis, they have to remain unaware that they are controlling the outcome. When we really want to believe something, we have to convince ourselves first. Intro spechon cannot discriminate between Hose times when we hold a belief because we want to and those Eves when we hold a belief because the evidence is incontrovertible. Probably the most powerful force motivating our desire to protect our beliefs--from others' attacks, from our own questioning, and from the challenge of new evidence-- is commitment. Commitment as motivation Modem social psychology came to public consciousness with ache development of Leon Festinger's (1957) theory of cognitive dissonance. This theory caught public imagination both because .~ - of its provocative real-life applications and because of the way it explained human irrationality in
33 simple and reasonable terms. Cognitive dissonance explains apparendy irrational acts not in terms of troubled personality types but in terms of a general human "need" for consistency. People feel unpleasantly aroused when two cognitions are dissonant-- when they contradict one another-- or when behavior is dissonant with a stated belief. To avoid this unpleasant arousal, people win often react to disconfim~ng evidence by strengtherung Weir beliefs -and creating more consonant explanations. This Dive to avoid dissonance is especially strong when Me belief has led to public commitm - . In a dramatic field study of this phenomenon, Fesdnger and two colleagues (E;estinger, Riecken & Schachter, 1956) joined a messianic movement to examine what would happen lo the group when the "end of Me world" did not occur as scheduled. A woman In the midwestem U.S. who claimed to be In contact with aliens in flying saucers had gathered a group of supporters who were convinced that a great flood would wash over the earth on December 2l, 1955. These supporter made great sacrifices to be ready to be taken away by me flying saucers on that day, and suffered public ridicule. If the flood did not occur and me flying sauced did not arrive, the members of the group would individually and coBect~vely fee} great dissonance between their beliefs and Me actual events. Fesdnger et al (1956) felt that the members of Me group had three alternatives: they could give up their beliefs and restore consonance; they could deny Me reality of the evidence that the flood had not come; or they could alter the meaning of the evidence to make it congruent unto the rest of their belief system. Public commitment made it urdikely that ~ members of the Coup would deny their beliefs. The evidence of me existence of the Brooded world was mo obvious to be repassed or denied. Therefore, the psychologically "easiest" resolution was to make the evidence congruent with the prior beliefs. No flying saucers arrived, no deluge covered the earth,- but a few hours after the appointed time, the communication medium received a message: the earth had been spared due to the efforts of the faithful group. The "disconfinnai~on" had fumed into a "confinnai~on" (Alcock, 1980, p. 56~.
34 When we am committed to a belief, it is unpleasant to even consider that contradictory evi- dence may be true. In this sense, it is generally easier to be a skeptic in the face of novel evidence; skeptics may be overly conservative, but they are rarely held up to ridicule. Researchers exposed beli- evers in and skeptics of ESP to either a "successful" or an "unsuccessful" demonstration of ESP (Russell ~ Jones, 1980). Skeptics recalled both demonstrations accurately, but believers showed dis- torted memories of the unsuccessful teal. In a follow-up study, the researchers also measured arousal, since dissonance is claimed to operate through unpleasant arousal. In this study, believers who success- fully distorted their memories In the "ESP disproven" condition suffered less arousal than believers who remembered the result. Overcoming our discomfort and actually considering the truth of some threatening idea or evi- dence does not always lead to a weakening of our commitment. Batson (1975) studied the reaction of teenage girls who were committed Christians to scholarly attacks on the divinity of Christ. He found that only those girls who gave some credence to the evidence became more religious as a result of their exposure to the attacks. Those who thought about the evidence became sufficiently distressed to be motivated to resolve the dissonance by strengthening their own beliefs. Though most of the research on dissonance theory has involved a~tud~nad or emotional com- mitment, financial commitment to an enterprise sets up a similar psychological system. A person with an economic stake in the worth of some process may behave in a way indistinguishable from someone with an emotional stake in a belief. The process of "sewing" a product, even if that product is an idea, involves public commitment to a position. Such a situation makes it unpleasant to consider that one's stated position may be invalid and may increase the strength of one's belief in the efficacy of the pro- duct.
35 Self-percephon theory also provides mechanisms by which financial commitment can lead to a sponger belief Self-perception theory (Bem, 1972) builds on Me evidence mat people do not always have privileged access to their own mobvabons or low-level decision processes. Instead, this theory claims that people infer their own motivation by observing their own behavior. Wring the process of seDing a product, a person obliges Is on clams for mat product-- md mess me sheen is content to conclude that he or she is motivated only by the money-- win likely conclude ~ he or We has very good reason to believe in the quality of We produce Beyond perceptual and judgmental biases, misunderstandings of chance phenomena and motivated distortions, lies me essential reason why personal expenence cannot be decisive: We can never determine the Due cause of our behavior or our experience. When an experimenter manipulates variables, he or she is briefly omniscient. He or she can alter one vanable (or even a comb~nabon of r variables) between groups and observe me effect A singe actor relying on ~ntui~ve judgment cannot accurately estimate what it would be like to be in the other group. If we think we have evidence that a process is useful and we have a financial stake in the outcome, we may attempt to "discount" for our financial involvement and assess how much of our belief is due only to me evidence mat we have gathered. Or if we have an emotional investment In an ideology we may try to determine how much of our outage at an opponent is caused by our emotional reaction and how much by me obviously spe- cious set of arguments present. But we have no way of reliably examining our intemal processes, and psychologist have found it su~pns~ngly easy to manipulate preferences, and choices without the awareness of We actor (Nisbett B: Schachter, 1966~. r
36 Problems in Evaluating Evidence III: Mediated Evidence In modern society much of He infonnabon that we evaluate is already processed-- by televi- sion, by Be print media, by schools, by govemment, or by research scientists themselves. Our infoIma- tional and motivational biases operate on this processed evidence just as Hey operate on He evidence from our own eyes. The primary difference is that we start out with a suspicion Mat we are not receiv- ing a complete or representative set of infoImadon, but information Hat has been selected by others who may or may not share our goals and values. This suspicion can foster a heady skepticism, a demand for converging evidence from a number of sources, but it can also give us a mason for ignor- ~ng evidence that contradicts our belief. Much of the infonnai~on mat reaches us through a processing medium is technically complex-- and we must evaluate the credibility of He channel through which it passed. While media presentations may do little to sway the beliefs of the committed, they have a powerful effect on the formation of beliefs in the uncommitted. This is appropriate to the extent that people are able to evaluate second-hand information rationally. But aB the attentional biases that are active in our personal experience are doubly pernicious when we evaluate processed evidence because the media further emphasizes the vivid, emotionally gripping aspects of information while ignoring or downplaying cautions and unexciting statistical summaries. The fiat step In a reasonable evaluation of evidence Hat is channeled though some medium is to assess the source credibility. A common-sense finding In research on persuasion and abode change is that people change their attitudes more in response to a comm~canon from an unbiased expert source (Hoviand, lards, & KeDey, 1953~. But this same research tradition also revealed me "sleeper effect", a phenomenon in which He infonnation received is separated In memos from its source (Hov- land, Lumsdaine, & Sheffield, 19491. Thus while the claims of an unreliable source are immediately
37 discounted, the information obtained may become part of the general knowledge of the the recipient. In the classic demonstration of this phenomenon, students were given persuasive arguments about the use of nuclear power and were told that the source of the arguments was either Pravda or an American nuclear scientist. The only students who showed immediate attitude change were those who read the statements attributed to the sc~enust, We credible trusted soume. A delayed measurement however' showed that Pravda had as much effect on attitude change once the source was forgotten. 8 A serious problem with the dual role of media as entertainer and informer is the tendency to stress the excitement of the message and ignore the credibility of the source. For example the San Francisco Chronicle, in its generally respectable book review section, carried a review of "The Serpent and the Rainbow" (Teish, 1986). This book, described in the review as "an account of a Harvard scientist's search for the ingredients of the Haitian zombi formula", is discussed in terms of its impor- tance for both academic an~pology and medicine. A cursory or even careful reading of the article reveals that science has established the existence of Voodoo zombie! But who is the reviewer? The small punt on the second page of the article reveals the source is "Lusiah Teish' a priest of me Voudou". Exacerbating the problems associated with me media's "excitement" criterion is the human ten- dency to retain the beliefs created by discredited evidence. This is especially relevant because of the great prevalence of fraud involved in the sewing of paranormal phenomena After some unexplainable case study receives media publicity, there is a great rush to explain it on the basis of new theories- most commonly by reference to quantum mechanics and non-deterministic physics. Then the original evidence is found to be a fraudulent attempt to gain attention or wealth (see.Gardner, 1981) but the 8 More recent research (Cook and Flay, 1978) has set out specific conditions under which the sleeper effect is likely to be found.
38 belief in the new transcendent physics remains. Some recent examples of Me "vividness" criteria in media reports are me press coverage given to me~-bending children (e.g. Defty, Washington Post, March 2, 1980) and Me tremendous attention given the Columbus, Ohio, poltergeist (Safran, Reader's Digest, December 1984; San Francisco Ch~n- icle, March 7, 1984, from Associate Press). Both stones developed Trough extremely unreliable per- sonal e~cpenence (Rand), 1983; Kurtz, 1984b) and demonstrate Me way Mat personal reports fit the requirements of the media better than caution or rigor. Experiment analysis is rarely as ~Tamanc or newsworthy as personal reports, especially since rigorous analysis emphasizes a cautious conservative approach. Follow-up stones on the "debunking" of these phenomena rarely receive comparable atten- non to the first excited reports. The public television program Nova is regarded as one of the best popular Deannen~ of scientific affairs in any comm~cabon medium. Yet its program on ESP has been vilified by skeptics of paranormal phenomena (Kurtz, 1984b). It tned to show both sides of Me issue-- it included dramatic "recreations" of the most famous ESP experiments and interviews with critics of ESP who proposed altemadve explanations of these experiments. The recreated stones were more exciting and vivify memorable than die interviews. The enthusiasm and hopefulness of the believed was more gripping Can the skeptics' "accenmabon of Me negative". What were the producers of Nova to do about the fact that what made a good story also was memorable and persuasive-- even though these elements were irrelevant to what was Due? In this case, they went for the good story.
39 Perceptual bimes and mediated information People with strong preexisting beliefs an rarely affected by any presentation of evidence. Instead, they manage to find some confirmation in an presentations. The "biased assimilation" of evi- dence relevant to our beliefs is a phenomenon Mat seems obviously tme of others, but sometimes difficult to believe in ourselves. Consider a classic social psychological study of students t perceptions of me annual Pr~nceton-Darunouth football game. (Hastorf and Cantril, 1954~. Students from the opposing schools watched a movie of the rough 1951 football game and were asked to carefully record all infractions. The two groups ended up with different scorecards based on the same game. Of course, this is not remarkable at all. We see this in sports enthusiasts and political partisans every day. But what is worth noting is Hat the students used objective dial by trial recording techniques and they sod saw different games if they were on different sides. This is a clue to me reason that people cannot understand why others continue to disagree with them, even after they have been shown the "truth". We construct our perceived world on the basis of expectations and theones, and then we fall to mice this constructed nature of the world into account When we tactic about the same "facts" we may not be arguing on the basis of the same construed evi- dence. This is especially important when we are faced with interpreting mixed evidence. In abr ost all real-world cases, evidence does not come neatly packaged as "pro" or "con", and we have to interpret how each piece of evidence supports each side. In a more-recent extension of this idea, social psychologists at Stanford University presented proponents and opponents of capital punishment with some studies that purported to show that deter- rence worked, and some studies apparently showing that capital punishment had no deterrence effect (Lord, Ross & pepper, 1979~. They reasoned that common sense must dictate mat mixed evidence should lead to a decrease in certainty in the beliefs of both partisan groups. But if partisans accept
40 supportive evidence at face value, critically scrutinize contradictory evidence, and constnle ambiguous evidence according to their thrones' bow sides might actually strengthen Heir beliefs on Be basis of me mixed evidence. 'The answer was clear In our subjects assessment of He pertinent deterrence series. Both groups believed that the methodology that had yielded evidence supportive of Heir view had been clearly supenor, both In its relevance and freedom from artifact, to the methodology Hat had yielded non-suppo~ve evidence. In fact, however, the sum jects were evaluating exactly the same designs and procedures, win only the purposed results vaned....To put the matter more bluntly, He two opposing groups had each con- s~ued me "box-score" vis a vis empirical evidence as 'one good study supporting my view, and one lousy study supporting He opposite view'-- a state of affects that seem- ingly justified He maintenance and even the strengthening of Heir particular viewpoint" (Ross, 1986, p. 14). This result leads to a sense of pessimism for Pose of us who Link that "truth" comes fimm the objective scientific collection of data, and from a solid replicable base of research. Giving He same mixed evidence to two opposing groups may drive me partisans fader apart How is ~ntellecmal and emotional rapprochement possible? One possible source of optimism comes from related work by Ross and his colleagues (Ross, Lepper ~ Hubbard, 1975) in which He experimenters gave subjects false infom~adon about Heir abil- ity on some task. After subjects built up a theory to explain this ability, the experimenters discredited the ong~nal information, but the subjects retained a weaker form of the theory they had built up. The only forth of debriefing that effectively abolished He (inappropnate) theory involved telling He sum jects about He perseverance phenomenon itself. This debriefing about He actual psychological process involved finally allowed the subjects to remove He effect of the false information. Biased assimilation may be weakened In a similar way: when we understand that our most "objective" evaluations of evi- dence involves such bias, we may be more able to understand Hat our opponents may are reasonable people.
41 Another reaction to processed evidence is the perception of hostile media bias. Why should politicians from bow ends of the spectrum believe that the media is particularly hostile to their side? At first glance, this widespread phenomenon seems to contradict assimilative biases-- often, we don't react to stones in the press by selectively choosing supportive evidence, instead we perceive that the news story is deliberately slanted In favor of evidence against our side. Ross and colleagues speculated Hat the same biasing constn~al processes are at work. A partisan has a Aged construction of the truth that lines up with his or her beliefs, and when "evenhanded" evaluations am p~senmd' Hey seem to stress the questionable evidence for the opposition. Support for these speculations came from studies on the news coverage of both me 1980 and 1984 presidential election and the 1982 "heist Massacre" (Vallone, 1986; Vallone, Ross & Lepper, 1985~. These issues were chosen because there were actively involved partisans on bow sides avail- able. The opposing parties watched clips of television news coverage. Not only did they disagree about the validity of the facts presented, and about He likely beliefs of the producers of me program, but Hey acted as if they saw different news clips. "Viewers of He the same 30-minute videotapes reported that the other side had enjoyed a greater proportion of favorable facts and references, and a smaller proportion of negative ones, than their own side" (Ross, 1986, p. Age. However, objective viewers tended to rate the broadcasts as relatively unbiased. These "objective" viewed were defined by the experimenters as those without personal involvement or sing options about the issues. But the partisans themselves-- if Hey are involved in college football, the capital punishment debate, party politics or the Arab-Israeli conflict-- claim to be evaluating me evidence on its own meets. And In a sense they are: They evaluate He quality of He evidence as Hey have constructed it in their mind. It is the illusion of "direct perception" mat is the fatal battier to understanding why others disagree with us. To He extent Hat we "fin in" ambiguities
42 in the information given we can find interpretations that make the evidence fit our model. Because scientific practice demands public definition of concepts, measures and phenomena, personal construc- ~ . dons are minimized and meaningfi~1 debate can take place. But when we rely on casual observation personal experience and entertaining narratives as sources of evidence, we have too much room to create our own persuasive consnual of We evidence. Problems in Evaluating Evidence [Y: The Effect of Formal Research Formal research structure and quantitative analysis may not be me only, or bests route to "understanding" problems. Often, an in-depth qualitative familiarity with a subject area is necessary to truly grasp the nature of a problem. But in all public policy programs, a private understanding must be followed by a public demonstraizon of the efficacy of the program. Only quantitative analysis leads to such a demonstration, and only quanthtadve evidence will force partisans to take the other side seriously. The effect of the acceptance of this argument can be seen in different ways in two domains: parapsychological research, and medicine. The effect of the rejection of this argument can be seen in the development of the human potential movement Modern parapsychology is almost entirely an expenmen~ science, as any cursory look Hugh its influential journals will demonstrate. Articles published in the Journal of Parapsychology or the Journal of the Society for Psychical Research explicitly discuss the statistical assumptions and con- trolled research design used in their studies. Most active parapsychological researchers believe that the path to scientific acceptance lies Hugh Be adoption of rigorous experimental method. Robert Jahn, formerly dean of eng~neenng and applied sciences at Princeton University and an active experimenter in this field, argues that "further careful srudy of this formidable field seems justified, but only within Be context of very well conceived and technically impeccable experiments of
43 large data-base capability, with disciplined attention to the pertinent aesthetic factors, and with more constructive involvement of the critical community" (Jahn, 1982, quoted in Hyman, 1985, p. 4). This attitude has not caused the traditional scientific institutions to embrace parapsychology, so what have parapsychologists gained from it? Parapsychologists have now amassed a large literature of experiments, and this compendium of studies and results can now be assessed using the language of science. Discussions of the status of parapsychological theories can be argued on the evidence: quantified, explicit evidence. As it stands, the evidence for psychic phenomena is not convincing to most traditional scientists (Hymen, 1981). But critical discussions of the evidence can take place on the basis of specifiable problems, and not only on the basis of beliefs and attitudes (e.g. the exchange between Hyman and Honorton on the qual- ity of the design and analysis of the psi ganzleld experiments, starting with Hym an, 1977; and Honor- mn, 1979~. In direct contrast to this progression is the attitude of the human potential movement towards evaluation and measurement. Kurt Back (1972) titled his personal history of the human potential movement "Beyond Words" but it could have been just as accurately called "Beyond Measurement". He begins his book and his history with an examination of the roots of the movement in the post-war enthusiasm for applied psychology. Academic psychologists and sociologists were anxious lo measure the increase in efficiency that would result from group educational activities. They examined group productivity, the solidarity and cohesion of the groups themselves, as well as the well-being of the group members. Few measurable changes were found, and this led the Search-oriented scientists to either lose interest in these group phenomena or to lose interest in quantitative measurement. Many of those involved in the group experiments-- even some of the scientist who began with clearly experimental
44 1 outlooks-- were cauBlt up in the phenomenology. the experience of the group processes. Back describes many influendal workers in this movement who started out with keen beliefs Hat controBed experiments And groups presses would reveal si - ficant observable effects. When these were not forthcoming, Me believers made two claims: Be effects of group processes were too subtle, diffuse and holistic to be measured by reductionist science, and the only evidence Cat really mattered was subjective experience-- the individual case was Be only level of interest, and this level could never be cape by extemal "objective" measurements. "Believing the language of the movement, one might look for research, proof, and the acceptability of disproof. In fact, Be followers of the movement are quote immune to rational argument or persuasion. The experience they are seeking exists, and Be believ- ers are happy in their closed system which shows them mat Hey alone have We insights and emotional beliefs....Seen in this light, the history of sensitivity Wing is a stnlggle to get beyond science" (Back, p. 204~. f The dangers in trying to get beyond science in an important policy area ale best described by an example fiom surgical medicine. This example is often used in introductory statistics' classes because it demonstrates that good research really matters in He world. It shows how opinions based On personal experience or even uncontrolled research can cause the adoption or conirnuadon of dangerous policies. One treatment for severe bleeding caused by cinhosis of the liver is to send the blood through a portacaval shunt. This operation is time-consuming and risky. Many studies (at least 50), of varying sophistication, have been undertaken to determine if the benefits outweigh the risks. (These studies are reviewed in Grace, Muench, and Chalmers, 1966; He statistical meaning is discussed in Freedman, Pisan~ & Pumes, 1978~. The message of the studies is clear: ache poorer studies exaggerate He benefits of the surgery. Seventy-five percent of He studies without control groups (24 out of 32) were very enthusiastic about
45 Me benefits of the shunt. In the studies which had control groups which were not randomly assigned, 6790 (10 out of 15) were very enthusiastic about We benefits. But none of the studies win random assignment to con=} and experiment groups had results ~ led to a high degree of enthusiasm. Three of these studies showed Me shunt to have no value whosoever. In the experiments without controls, the physicians were acc~den~y biasing me outcome by including only the most healthy patients in the study. In the experiments with nonr~ornized consuls, Me physicians were accidentally biasing the outcome by assigning the poorest patients to the control group that did not receive the shunt. Only when the confound of patient heals was removed by ran- domization was it clear that the nsicy operation was of little or no value. Good research does matter. Even physicians, highly selected for ~ntehiga~ce and highly trained in intuitive assessment, were misled by their daily experience. Because the formal studies were publicly available, and because the quality of the studies could be evaluated on the basis of their exper- ~men~ method, the overall conclusions were decisive. Until the human potential movement agrees on the importance of quantitative evaluation, it will remain split into factions based on ideologies main- tained by personal experience. Formal research mesons are not the only or necessarily best way to team about the true state of Nanette. But good research is We only way to ensure that real phenomena will drive out illusions. The story of the "discovery" of N-rays in France in 1903 reveals how even physics, Me hardest of the hard sciences, could be led astray by subjective evaluation (Broad & Wade, 1982, p. Ilk. This "new" fonn of X-rays made sparks brighten when viewed by Me naked eye. The best physical scientists in France accepted this brealc~ough because Hey wanted to believe In it. It took considerable logical and experiment effort to convince the scientific establishment that the actual phenomenon was self- decepi~on. Good research can disconfirm theones, subjective judgment rarely does.
46 it, In his clique of the use of poor msearch practices, Pitfalls of Human Research, Baker (1976) points out that many flaws of nab inference can creep into scientific research. "The validity and generaliz~bili~ of experiments can be significantly impeded by making mom explicit be pitfalls Mat are integral to their planning...and by keeping the pitfalls in full view of researchers who conduct experiment studies" (pp. 90-91~. While sctendsm and scientific methods are not immune to the flaws of subjective judgment, good research is designed to minimize the Compact of these problems. The proper use of science in public policy involves replacing a "person-onented" approach win a "method-onented" approach (Hammond, 19781. When cndcs or supporters focus on We person who is setting policy criteria, the debate involves the bias and motivations of the people involved. But attempts to precisely define the variables of interest and to gamer data that relate to these variables focus the adversarial debate on He quality of He methods used This "is sc~endficaBy defensible not because it is flawless (it isn't), but because it is readily subject to scientific cndcism" (Hammond, 197S, p. 135~. Intuitive Judgment and the evaluation of evidence: A spry Personal experience seems a compelling source of evidence because it involves tile most basic processing of information: perception, abandon, and memory storage and retrieval. Yet while we have great confidence In me accuracy of our subjective impressions, we do not have conscious access to ~ actual processes involved. Psychological expenmenmion has revealed that we have too much confidence In our own accuracy and objeci~vi~. Humans are designed for quick Winking rather than accurate thinking. Quick, confident assessment of~evidence is adaptive when hesitation, uncertainty and self-doubt have high costs. But natural shortcut me~ods are subject to systematic ears and our intros- pechve feelings of accuracy are misleading.
47 These errors of intuitive judgment lead people to search out confimning evidence, to interpret mixed evidence in ways that confirm their expectations, and to see meaning in chance phenomena. This same biased processing of information makes it very difficult to change our beliefs and to under- stand the point of view of those with opposing beliefs. These errors and biases are now well- documented by psychologists and decision theorists, and the improvement of human judgment is of central concern In current research. The long-tem~ response to this knowledge requites broad educa- tional programs in basic statistical inference, and formal decision-making, such as Pose props and examined by venous authors in Kahneman et al (1982~. Already, business schools include "de- biasing" procedures in their programs of formal decision-making. But with the complex technological nature of our society, most researchers believe that some instruction in how to be a better consumer of information should start in public schools. The immediate response should be a renewed commitment to formal slouches in deciding important policy, and a new realization that personal experience cannot be decisive in forming such policy. As Gilbert, Light and MosteBer (1978) point out in their review of the efficacy of social inno- vations, only tnue experimental trials can yield knowledge that is reliable and cumulative. While for- mal research is slow and expensive, and scientific knowledge increases by tiny increments, tile final result is impressively useful. Perhaps most important, explicit public evidence is our best hope for moving toward a consensus on appropriate public policy.
Footnote 2. (Indicator2 should be added to Page 51, line 11, r=.1~2) After preparation of this paper we learned of a possible problem in the randomization procedures employed by the investigator contributing the largest number (9) of Ganzield studies to the set of 28 summarized in this section. Accordingly we constructed Table 4a to investigate the effect on the mean and median effect sizes of omitting all the studies conducted by this investigator, The top half of Table 4a shows this effect when we Omitting the , from .28 to .26 and does not change the median effect size which remains at .32. The lower half of Table 4a shows this effect when we consider the 10 investigators as the unite of analysis. Omitting the investigator in question lowers the mean effect size from .23 to .22 but- raises the median effect size from .32 to .34. It seems clear that the questioned randomization of the 9 studies of this investigator cannot have contributed substantially to an inflation of the overall effect size. consider the 28 studies, as; the unite of analysis. 9 Questioned studies ~ owers the mean effect size