Read "Enhancing Human Performance: Issues, Theories, and Techniques, Background Papers (Complete Set)" at NAP.edu

« Previous: Front Matter

Page 1 Cite

Suggested Citation:"Part I. Issues of Theory and Methodology." National Research Council. 1988. Enhancing Human Performance: Issues, Theories, and Techniques, Background Papers (Complete Set). Washington, DC: The National Academies Press. doi: 10.17226/778.

Page 2 Cite

Page 3 Cite

Page 4 Cite

Page 5 Cite

Page 6 Cite

Page 7 Cite

Page 8 Cite

Page 9 Cite

Page 10 Cite

Page 11 Cite

Page 12 Cite

Page 13 Cite

Page 14 Cite

Page 15 Cite

Page 16 Cite

Page 17 Cite

Page 18 Cite

Page 19 Cite

Page 20 Cite

Page 21 Cite

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Page 29 Cite

Page 30 Cite

Page 31 Cite

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Page 134 Cite

Page 135 Cite

Page 136 Cite

Page 137 Cite

Page 138 Cite

Page 139 Cite

Page 140 Cite

Page 141 Cite

Page 142 Cite

Page 143 Cite

Page 144 Cite

Page 145 Cite

Page 146 Cite

Page 147 Cite

Page 148 Cite

Page 149 Cite

Page 150 Cite

Page 151 Cite

Page 152 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

PART I. Issues of Theory and Methodology

Human Performance Research: An Overview Monica J. Harris and Robert Rosenthal Harvard University

Table of Contents Interpersonal Expectancy Effects eeeeeeeeee.~ee.~1 Definition. eeeeeee.~eeeeeeeeeeeeaeee.eeeeeeee.eeeee.eeeeeeee.~.e.1 Evidence for Expectancy Effects eee.eee.~eeeeee..eee.~.eeaeeeae.~.e2 Methodological Implications of Expectancy Effects ee.~.ee.~.e.~3 Mediation of Interpersonal Expectancy Effects e.~.eeeee.~.eeeee5 Ran: i ~ I .C$:tl-.C ~ The Four-Factor Theory e.~eeaee.~eeeee.~ee..eee.~. Meta-Analysis of Expectancy Mediation eeee..eeeee Human Performance Technologies and Expectancy Effects ee e. Research on Accelerated Learning... Neurolinguistic Programming... ee Imagery and Mental Practiced... Biofeedback......... Parapsychology eeeeeeeeeseeeeae.~41 Situational Taxonomy of Human Performance Technologies 52 Sugges t ions for Future Research . e.~. 57 Expectancy Con tro 1 Des i gns . . . . . e.~.e. 57 Central ~ for Expectancy Effects .. ~ ~ ...... Expectancies ant the Enhancement of Human Performance.... Conclusion................................ Ref erence se~~eea~~ ~~e ^~.ea.~62 ooe.65

4 interpretation of results: Increasing ransom noise merely makes it more difficult to obtain significant results, but increasing systematic bias can result in completely erroneous cone fusions . Experimenter expectancy effects are a potential source of problems for any research area, but they may be especially influential in more recent research areas lacking well-established findings. This is because the first studies on a given treatment or technique are typically carries out by creators or proponents of the technique who tend to hold very positive expectations for the efficacy of the technique. It is not until later that the technique may be investigated by more impartial or skeptical researchers, who may be less prone to expectancy effects operating to favor the technique. Many of the human performance technologies of interest in the present paper are relatively recent innovations, and thus may be especially susceptible to expectancy effects. In principle, expectancy effects could be investigated by introducing expectations as a manipulation in addition to the independent variable of theoretical interest. This method, which will be described in detail later, allows the direct comparison of the magnitudes of the effects due to the phenomenon and effects due to expectancies. Another approach, perhaps even richer theoretically, is to examine directly the processes underlying expectancy effects as they occur in various areas. In same areas, such as the area of teacher expectancy effects, a considerable amount of research has been conducted in this manner, and there is now a good general understanding of what variables are important in mediating teacher expectancies. However, in other areas, such as the human technologies of interest here, this background research is lacking. The best that can be done in such cases is: (a) to

Interpersonal Expectancy Effects and Human Performance Research Monica J. Harris and Robert Rosenthal Humans have long tried to surmount their traditional limitations and to increase their performance. Long ago such efforts were aided by social institutions of religion, proto-medicine, and magic. More recently, Such efforts have been aides by social institutions of science ant its associated technologies. Systematic programs have been developed with such aims as improving communication, accelerating learning, ant increasing conscious control over physiological processes. Because the promise of enhancing human performance is so appealing, considerable resources, both in terms of time ant money, are being invested in these programs. The time has come to step back ant evaluate human performance technologies so that resources may be directed more appropriately. The purpose of this paper is to aid in such an evaluation. We will focus specifically on the possible influence of interpersonal expectancy effects on several human performance technologies. The paper advances in three steps: First, we describe the methodological, theoretical, ant empirical issues relevant to the study of expectancy effects, including how expectancy effects are mediated. Second, we describe each of several types of human performance research and speculate on the extent to which expectancy effects may be responsible for the experimental results. Finally, we discuss more generally how the literature on expectancy effects can be applied to the development and evaluation of human performance technologies. Interpersonal Expectancy Effects Def inition An interpersonal expectancy effect occurs when a person (A), acting in accordance with a set of expectations, treats another person (B.) in such a

2 manner as to elicit behavior that tends to confirm the original expectations (Rosenthal, 1966, 1976~. For example, a teacher who believes that certain pupils are especially bright may act more warmly toward them, teach them more material, and spend more time with them. Over time, such a process could result in greater gains in achievement for those students than would have occurred otherwise. The concept of an expectancy effect was first introduced by Merton (1948) in his discussion of the self-fulfilling prophecy, which he defined as "a false definition of the situation evoking a new behavior which makes the originally false conception come true" (p. 195~. The first systematic application of the concepts of expectancy effects in the field of psychology came in the 1960s with a program of research on experimenter expectancy effects (e.g., Rosenthal, 1963~. This research demonstrated that the experimenter's hypothesis may act as an unintended determinant of experimental results. In other words, experimenters may obtain the results they predicted not because the relationship exists as predicted in the real world but because the experimenters expected the sub jec ts to behave as they did . Evidence for Interpersonal Expectancy Effects Although originally fraught with controversy, the existence of interpersonal expectancy effects is no longer in serious doubt. In 197B, Rosenthal ant Rubin reported the results of a meta-analysis of 345 studies of expectancy effects. A meta-analysis is the quantitative combination of the results of a group of studies on a given topic. This meta-analysis showed that the probability that there is no relationship between experimenters' expectations and their subjects' subsequent behavior is less than .0000001. The practical importance of expectancy effects was also substantial; the mean

effect size of expectancy effects across the 345 studies was equivalent to a correlation coefficient of .33. This meta-analysis also investigated the importance of expectancy effects within a wide variety of research domains. There were eight categories of expectancy studies: reaction time experiments, inkblot tests, animal learning, laboratory interviews, psychophysical judgments, learning and ability, person perception, and everyday situations or field studies. Although effect sizes varied across categories, the importance of expectancy effects within each category was firmly established. These results suggest that expectancy effects may occur in many different areas of behavioral research and emphasize the importance of taking into account the possibility of expectancy effects when designing and conducting studies. Although initially focused on the psychological experiment as the domain of interest, research on expectancy effects turned quickly to other domains where expectancy effects might be operating, domains such as teacher-student, employer-employee, and therapist-client interactions. Over the years, research interest has also turned from merely documenting the existence of expectancy effects to delineating the processes underlying expectancy effects. Methodological Implications of Expectancy Effects Experimenter expectancy effects are a source of rival hypotheses in accounting for experimental results. In other words, a given result could be causes not by the independent variable under investigation but rather by the experimenter' a expectation that such a result would be obtained. As rival hypotheses, expectancy effects can be considered a threat to the internal validity of a study; they are a source of systematic bias rather than random error. Consequently, expectancy effects present a serious danger to the

4 interpretation of results: Increasing random noise merely makes it more difficult to obtain significant results, but increasing systematic bias can result in completely erroneous cone fusions . Experimenter expectancy effects are a potential source of problems for any research area, but they may be especially influential in more recent research areas lacking well-established findings. This is because the first studies on a given treatment or technique are typically carried out by creators or proponents of the technique who tend to hold very positive . expectations for the efficacy of the technique. It is not until later that the technique may be investigated by more impartial or skeptical researchers, who may be less prone to expectancy effects operating to favor the technique. Many of the human performance technologies of interest in the present paper are relatively recent innovations, and thus may be especially susceptible to expectancy effects. ~ . In principle, expectancy effects could be investigated by introducing expectations as a manipulation in addition to the independent variable of theoretical interest. This method, which will be described in detail later, allows the direct comparison of the magnitudes of the effects due to the phenomenon and effects due to expectancies. Another approach, perhaps even richer theoretically, is to examine directly the processes underlying expectancy effects as they occur in various areas. In Some areas, such as the area of teacher expectancy effects, a considerable amount of research has been conducted in this manner, ant there is now a good general understanding of what variables are important in mediating teacher expectancies. However, in other areas, such as the human technologies of interest here, this background research is lacking. The best that can be done in such cases is: (a) to

Us analyze the situations of interest, (b) to determine whether mediating mechanisms shown to be important in traditional research areas are likely to be present in the new areas, and (c) to estimate the extent to which expectancy effects could be influential in the new area. The present paper undertakes such an analysis. Mediation of Interpersonal Expectancy Effects Basic Issues - A primary question of interest with respect to expectancy effects is the question of mediation: How are one person's expectations communicated to another person so as to create a self-fulfilling prophecy? This question in turn can be broken down into two components. The first component is the differential behaviors that are displayed by the expecter as a result of holding differential expectancies (the expecter-behavior link). For example, in what ways do teachers treat their high expectancy students differently? The second component is the differential behaviors that are associated with actual change in expectee behavior and self-concept (the behavior-outcome link). For example, what teacher behaviors result in better academic performance by the students? Both these aspects are critical in understanding expectancy mediation, for even if we could show an enormous effect of expectancy on expecter behavior (e.g., teachers smile more at high expectancy students), that behavior would not be important in expectancy mediation unless it actually impacted on the expectee to create better outcomes (e.g., being smiled at leads to better grates). The Four-Factor "Theory" ~ , Rosenthal (1973a, 1973b) proposed a four-factor "theory" of the mediation of teacher expectancy effects. In this view, four broad groupings of teacher

6 behaviors are hypothesized to be involved in teacher expectancy effects. The first factor is climate, referring to the warmer 60cioemotional climate that teachers may create for their high expectancy students. This factor includes warmth communicated in both verbal and nonverbal channels. The second factor, feedback, refers to teachers' tendency to give more differentiated feedback to high expectancy students. The third factor, input, refers to the tendency to teach more material and more difficult material to high expectancy students. The fourth factor is output, or the tendency for teachers to spend more time with high expectancy students and provide them with greater opportunities for responding. Although the four factor theory was originally proposed to account for the mediation of teacher expectancy effects, it seems reasonable to think that these factors may also operate in other domains where expectancy effects may be operating. Meta-analysis _ Expectancy Mediation The question of how expectancy effects are mediated is ultimately an empirical one. Luckily, many studies address the mediation of expectancy effects, and we have conducted a meta-analysis of this literature (Harris & Rosenthal, 1985). Essentially, we read all the studies we could find that examined expectancy mediation (resulting in an initial pool of 180 studies) and classified them according to the mediating variables that were investigated. This resulted in 31 mediating behaviors each of which was examined in at least four studies. We then computed an overall significance level and effect size for each of the 31 categories, separately for the expectancy-behavior effects and the behavior-outcome effects. The results of this meta-analysis pointed to the practical importance of 16 behaviors in mediation: negative climate, physical distance, input,

7 positive climate, off-task behavior, duration of interactions, frequency of interactions, asking questions, encouragement, eye contact, smiles, praise, accepting students' ideas, corrective feedback, nods, and wait-time for responses. Table 1 summarizes the results of the meta-analysis for these 16 behaviors, presenting the effect sizes for the expectancy-behavior links and the behavior-outcome links separately. An intuitive way of understanding these effect sizes is given by the Binomial Effect Size Display (BESD; Rosenthal & Rubin, 1982). The BE SD expresses correlations in terms of percent increase in "success" rates due to a given "treatment," with the treatment group success rate computed as .50~(r/2) and the control group success rate computed as .50-(r/2). So, for example, the correlation of .21 for Positive Climate can be interpreted using the BE SD as meaning that the percentage of teachers exhibiting above average amounts of Positive Climate will increase from 39.5Z [.50-(.21/2)] for low expectancy students to 60.5: [.50+(.21/2)] for high expectancy students. The other effect sizes can be similarly interpreted. Note that in Table 1 the effect sizes for behavior-outcome relations tend to be larger than the effect sizes for expectancy-behavior relations. One possible reason for this is that expectancies are manifested in myriad ways, meaning that the relationship between expectations and any particular behavior is not likely to be very strong. However, we can more accurately predict a person's response to a particular behavior once we know that a particular behavior has occurred. In other words, if we can condition on the behaviors emitted, we are in a better position to make more accurate predictions. We also presented a summary analysis evaluating the four factor theory. The ten behavior categories with the most studies (and therefore providing the most stable estimates) were reclassified into the four factors of climate, feedback, input, and output. We then computed an overall significance level

8 and effect size for each of the four factors, again separately for the expectancy-behavior and behavior-outcome links. For the expectancy-behavior link, the four factors were highly statistically significant and associated with small to medium effect sizes: climate, r=.20; feedback, r=.13, input, r=.26, and output, r=.19. With respect to the behavior-outcome link, again all four factors were statistically significant , but in terms of effect size, feedback did not seem to be very important: climate, r=.36; feedback, r-.07; input, r=.33; and output, r=.20. Human Performance Technologies and Expectancy Effects We now turn to a more focused discussion of the possible influence of expectancy effects on research on techniques for the enhancement of human performance. In this next section, we (a) describe paradigmatic examples of each of five research areas concerned with improving human performance, and (b) offer opinions about the extent to which expectancy effects may be influencing research results in these areas. The five areas that will be covered are those targeted for evaluation by the Committee on Techniques for the Enhancement of Human Perfo.=ance; these areas are research on accelerated learning, neurolinguistic programming, mental practice, biofeedback, and parapsychology. One caveat should be emphasized in advance: It is not possible for us to conduct meta-analyses of each of these areas; instead, we will have to rely on a light review of each area and focus on some examples of typical experiments. Consequently, we need to stress that our overall assessment is accurate only to the extent that our samples are representative. Meta-analyses of these domains would be of great value and should be undertaken for any domains for which they are not yet available.

9 Research on Accelerated Learning Many techniques for accelerating learning have been recently advanced, techniques that claim to increase the rate or amount of learning by 200-300%. We discuss now one of these methods, the Suggestive-Accelerative Learning and Teaching (SALT) method, and offer our assessment of the extent to which expectancy effects could be responsible for the observed learning gains. The SALT technique, an Americanized version of Lozanov's (1978) Suggestopedia technique, incorporates the use of suggestion techniques and unusual verbal and nonverbal styles of presenting material to accelerate learning. A SALT lesson comprises three primary phases: preliminaries, presentation, and practice. In the preliminary phase, the students are first led through a series of physical relaxation exercises (e.g., stretching, sidebands, head flops, and progressively tensing and relaxing all muscles). Next comes a series of mental relaxation exercises ~ typically guided imagery exercises such as "imagine that you are lying in a meadow watching white clouds going by." The goal of the relaxation procedures is to overcome any emotional or physical barriers to learning that might have arisen from past negative learning experiences. The last part of the preliminary phase is of particular relevance to expectancy effects, for it involves the explicit induction of positive expectancies for learning. The teacher repeatedly stresses to the class that the SALT technique makes learning easy and fun, ant that as long as the students go along with the interesting things the teacher has them do, they will find themselves learning better than they had ever imagined possible. Schuster ~ Gritton (1985) give an example of the communication of positive

10 expec ta t Ions: "Imagine that we have come to the end of totay's lesson and you are now taking the short quiz over the me serial . See yourself looking at the quiz questions; they are easy, you know all the answers! Feel yourself smiling as you write down the answers quickly to all the easy questions. Hear yourself talking to your friends later about how easy learning is in this class..." (p. 19~. Another aspect of this phase is "early pleasant learning restimulation," or inducing a positive attitude toward learning by asking students to remember some prior experience where learning was exciting ant fun, for example, learning to ride a bicycle. In this phase of the SALT technique, then, expectancy effects are not an experimental artifact but rather are an explicit part of the experimental manipulation. Note, however, that these expectations are intrapersonal rather than interpersonal; they are the students' self-expectancies for their performance. The second phase of the SALT process is the presentation of the material. The presentation consists of three sections; first, there is a brief review/preview, providing students with a global impression of the content of the day's lesson. Next comes dramatic presentation of the material. The teacher uses dynamic vocal intonation to present the material; for example, the first sentence is spoken in normal tone of voice, the second sentence is shouted, the third sentence is whispered, and the cycle is repeated. Lively and engaging classical music (such as Beethoven's "Emperor" Concerto) is played at the same volume as the teacher's voice. At the same time, the teacher instructs students to create vivid images associated with the ma serial. The teacher then repeats the material just presented, but this time in a soft, passive voice with baroque music playing in the background. (For reasons not clearly specified in any of the articles we surveyed but having something

11 to do with properties of the tonal frequencies, baroque music is supposedly particularly effective.) The goal of the passive review is to increase the students' alpha waves ant to induce both hemispheres of the brain to work in tandem, thus allowing the utilization of previously untapped potential. The shirt ant final phase of a SALT lesson involves the active practice of the material by the students. This can consist of more conventional classroom exercises (e.g., problem sets) or more imaginative activities (e.g., creating skits or writing stories using the new material). Lastly, lessons may conclude with an ungraded quiz. Students tests, increasing their confidence, and seen or recorded by the teacher reduces generally perform very well on these the fac t that the test scores are not student apprehension. We now turn to an evaluation of the research on SALT. Let us begin our review by describing a study with particularly weak methodology. Garcia (1984) wanted to test SALT on a large class of adults learning English as a second language. Rather than randomly assigning students to the experimental and control conditions, though, she instead described the procedures for the two conditions to the 80 subjects and asked them to choose which class they preferred: the traditional teaching control class, or the experimental SALT class! This fatal error in itself renders any conclusions completely suspect: If any difference is obtained between the two conditions, we cannot tell whether it was due to the efficacy of the treatment or to the fact that different kinds of students chose to go into the two sections. It seems entirely plausible that the students who are more receptive to learning would choose to go into the experimental condition. The experimental manipulation in this study included relaxation exercises, positive suggestions, active and passive presentation, and

12 practice. The author was the instructor for both classes and consequently was not blind to the hypotheses or experimental condition of the students. (This is a serious prob lem in teems of expectancy effects that is true of all the studies on SALT, and which we will discuss in more detail later.) The next serious error committed by this author was in the analysis of the results. Because of "the large number of subjects," she selected only eight subjects from each group for analysis. The statistical power afforded by 16 subjects is no low that the author practically guaranteed that she would not obtain significant results. She found that students in the experimental group improved more than the students in the control group, but the improvement was nonsignificant, t(l4)=1.40. This t, however, is associated with an effect size of r=.35, a nontrivial effect. Had she used the data from all the subjects, the results would probably have been significant. (However, we cannot trust her t value very much as the means she reported in the text do not correspond to the values in her table . ~ In sum, from beginning to end we cannot be confident of Garcia's results . The question of to what extent expectancy ef fects may be responsible for the results is almost moot. Now we will turn to the best example (methodologically speaking) we found of a study on the SALT technique (Gritton ~ Benitez-Borten, 1976). In this study, SALT techniques were used by the first author in his 8th grate science classes (10 sections , 213 students total); two other junior high schools were used as control classes (106 students total). Consequently, neither students nor schools were randomly assigned to condition, again leaving open the possibility that preexisting differences among the classrooms or students could be responsible for any obtained results. The experimental manipulation consisted of using SALT techniques (exercises, relaxation, early pleasant

13 learning restimulation, active and passive presentation of material) on 15 occasions throughout the semester; traditional classroom procedures were followed on the other days. The control classrooms used traditional teaching methods. The same standard text, pretest, and posttest were used in all classrooms. Analysis of the pretest scores showed that the experimental classrooms scored significantly lower than the control rooms. Analysis of covariance, adjusting posttest scores for the pretest, revealed a significant treatment effect, F(1,314)=7.69, r=.155. The adjusted posttest means were 13.55 for the experimental group and 11.21 for the two control groups combined. A contrast comparing the experimental group to the control groups computed on the gain scores (an analysis similar in spirit to the ANCOVA) yielded an even more significant treatment effect, F(1,314)=22.16, r-.257. Therefore, the Gritton ~ Benitez-Borden (1976) study, which utilized better controls and analyses, suggests a small to medium positive effect of the SALT technique. However, this study was not without its own flaws. Again, there was no randomization of students to condition, the author of the study delivered the manipulation, and was not blind to the experimental condition of the students. These are characteristics that leave open the possibility of alternative hypotheses including expectancy effects. Furthermore, experimental treatment was completely confounded with teacher, so any significant results could be due simply to the characteristics of the different teachers rather than to the SALT technique itself. The remaining empirical articles we have examined tend to fall somewhere in between the two examples described above . We are not able to conduct a thorough review of all these studies, but some description is warranted to

14 convey a better impression of the literature. Table 2 shows in summary form the results of all the empirical articles we had available to us. The second column of the table shows the effect sizes, expressed as the correlation coefficient r, illustrating the degree of effectiveness of SALT obtained in the various studies. We estimated these correlations from the data provided by the authors; we corrected statistical errors when they could be identified before computing the effect size. The last two columns of Table 2 show how the effect sizes can be interpreted using the BESD. For example, the r(l2)~.38 for the Zeiss (1984) study can be interpreted using the BESD as meaning that receiving SALT is equivalent to increasing student improvement rates from 31Z to 69%. Glancing at the effect sizes for all the studies, we see that they range from a low of -.131 (meaning that the result was in the opposite direction) to a high of .672; the mean of the 14 correlations was .29. The mean correlation weighted by sample size was somewhat lower, r=.l93. What can we conclude from this cursory review of the SALT literature? There are two issues to address: the first is the general methodological adequacy of the studies, and the second (of more relevance to the goals of this paper) is the extent to which effects of SALT may actually be due to expectancy effects. In terms of general methodological adequacy, the studies reviewed all possess weaknesses that pose serious threats to the ability to draw causal inferences about the efficacy of SALT. Only a single study randomly assigned subjects to conditions (experimental or control classroom), the most crucial ingredient for causal inference. Consequently, any differences found could have been caused by pre-existing differences between the two conditions or selection bias influencing which students got into which condition. Furthermore, most of the studies used only one classroom per

15 condition. This also sheds doubt on SALT as the causal agent, for any differences could conceivably have been caused by any external change or event occurring in one of the classes, influencing all the students within the class. Research of this kind is more ideally conducted by having many classrooms involved in the project and using classroom (rather than student) as a unit of analysis. Students within a classroom may not be independent in the statistical sense, and it can be misleading to consider them so. An additional weakness of these studies is the small number of teachers used. In many cases, one teacher taught the experimental SALT class, and another teacher taught the control class. As noted earlier, such a design completely confounds treatment with teacher; any obtained differences could be due to SALT or they could be due to other, irrelevant differences between the two teachers. In other studies, there was one teacher who taught both the control and experimental classes. This removes the confound just discussed but introduces other serious problems, primarily of generalizability: When there is only one teacher or experimenter, any results obtained cannot be readily generalized beyond that particular teacher. An improved design of these studies would employ several teachers (at least four to ten) and have them teach several classes each. On the basis of the preceding discussion, we conclude that the empirical evidence on SALT is so methodologically weak that it remains an open question as to whether SALT is effective, a conclusion that makes asking about interpersonal expectancy effects as a possible rival hypothesis less urgent. Suppose, however, that we pretend that the results of these studies can be trusted. To what extent, then, and in what ways could the beneficial effects of SALT be due to interpersonal expectancy effects? To answer this question,

16 we need to make the distinction between expectancy effects that are exogenous to SALT (i.e., they are expectancies co ~unicatet unintentionally as a consequence of poor experimental design ant controls) ant expectancy effects that are endogenous to the SALT technique itself (i.e., they are an intrinsic ant intended part of SALT). This distinction is important because different courses of action would be recommended for the two types of effects: For exogenous effects, we would suggest improvements in experimental methods in order to eliminate expectancy effects. For endogenous effects, on the other hand, we would want to acknowledge the role of expectancies ant see if we court apply the literature on expectancy effects to the SALT technique to make it even more effective. There is a very real possibility of exogenous expectancy effects in the SALT research. As notes earlier, the teachers were always aware of the hypotheses and experimental condition of the students; because they believed in the SALT technique, they undoubtedly expected better performance from the subjects in the SALT condition. These expectations could have been communicated clearly to the students, either overtly or subtly. Given the nature of the SALT technique, it is difficult to conceive of an experimental design in which teachers could be blind to the condition of the students. (That is, we could not conceal from the teachers which style of teaching they were using!) It would also be difficult to keep teachers from guessing the hypotheses that were being tested. Is there any way, then, that the threat of exogenous expectancy effects could be eliminated? Perhaps one approach would be to use teachers naive to SALT and manipulate expectations for its efficacy. For example, one group of teachers could be given typical instructions indicating that SALT is a promising new teaching method, and other teachers

17 could be told that many studies have shown that SALT was worse than traditional techniques, but you want to give it one last try. Another approach would be to divide up the teaching responsibilities, ant have a different teacher (one who did not know whether the students were in the experimental or control group) be in charge of administering the pretests ant posttests. A third approach would be to automate as much of the SALT process as possible, for example, creating audiotapes of the warmup exercises or the presentation of the material. None of these approaches solves the problem completely, but they would help. Clearly, endogenous expectancy effects play a prominent role in SALT in the guise of the positive self-expectancies elicited in the students. Inducing positive expectations for learning is an explicit part of the SALT procedure. In terms of the four factor theory, the mediation of expectancies in SALT involves primarily the climate and input factors, with climate being by far the most important factor. Teachers using SALT deliberately adopt a warm, friendly interpersonal style; they praise and encourage frequently. Also present are nonverbal behaviors that go into the climate factor, for example, smiles, dynamic voice respect to input, the presented twice, once Linking back to Tab 1~ tone, speech rate, body gestures, ant eye contact. With SALT system may increase input because each lesson is in an active manner and once in a passive manner. _ __= ~ 1, we see that most of these behaviors were strongly implicated in the behavior-outcome link of the mediation of expectancy effects. Specifically, positive climate, praise, eye contact, input, gestures, smiles, speech rate, and encourages had combined correlations with improved student outcomes of .399, .124, .325, .332, .310, .291, .480, and .410 respectively. These values are on the whole larger than the magnitude of the

18 effects reported in research on SALT. Given the incorporation of so Any of the mediating behaviors in the SALT technique, and given the literature showing the positive impact of these behaviors on student performance, it is possible that the reported effects of SALT could be due entirely to the presence of these mediating behaviors. We could test conclusively this possibility by designing SALT studies where the presence or absence of the endogenous expectancies is experimentally manipulated. That is, we could have a condition in which the explicit induction of positive expectations during the preliminary phase is deliberately omitted. This condition court also use tape-recortet relaxation exercises and class ma serial to minimize expectancies communicated during the presentation phase. We court then compare the results found in this condition against those found for the regular SALT technique. If the effects for the experimental condition (the one where endogenous expectancies are eliminated) were significantly lower, it would indicate that a substantial portion of the effects due to SALT might be caused by the expectations communicated implicitly or explicitly by the teacher. Such a conclusion would be of great value in planning ant implementing programs for accelerating learning AS research could be directed to delineating more precisely the behaviors that communicate positive expectancies and to training teachers in using these behaviors . Neural inguis tic Programming Neurolinguistic programming (NLP) was formulated by Bantler ~ Grinder (1975, 1979) with the aim of improving interpersonal communication, particularly within the counseling context. The basic premise of NEP is that individuals process ongoing events in the world through specific

19 representational systems corresponding to the five senses of sight, sound, touch, taste, and smell. The latter two systems are rarely uses, so most research focuses on differences among the visual, auditory, ant kinesthetic systems. Individual differences exist in the extent to which people use each of these three systems, ant the system that an individual uses most of the time is called the Preferred Representational System (PRS). Communication is hypothesized to be enhanced when both interactants use the same PRS ant impeded when interactants' PRS are not matches. The bulk of research being conducted in the area of NLP tests some aspect of this hypothesis. Before discussing this research, we need to be more explicit about what PRS is ant how a person's PRS is assessed. Representational systems are internal, cognitive systems for modeling the world. They are expressed behaviorally through the use of Perceptual predicates. For example, the predicates "I'm in touch with..." and "I feel like..." represent the kinesthetic modality; "My view is..." and "I see.." represent the visual modality; and "It sounds like..." and "I hear..." represent the auditory modality. There are three primary methods of assessing the PRS, varying in degree of reactivity or obtrusiveness. The first method is to measure the direction of a subject's eye movements in response to questions. If the eyes move upwards in either direction, or remain closet or looking straight ahead, the PRS is visual. If the eyes move to either site or to the lower left, the PRS is auditorye Lastly, if the eyes move to the lower right, the PRS is kinesthetic. The theoretical basis for this assessment method is neurological in nature, involving brain lateralization, ant an often overlooked aspect of the entire NLP model is that it holds only for left-hemisphere dominant (e.g.,

20 right-handed) individuals. The second method of assessment is more direct ant consists of counting the frequencies of the various types of perceptual predicates present in the subject's verbalizations. The modality used most often by the subject is then considered that person's PRS. The third method is the most direct and involves simply explaining the three different PRS modalities and asking subjects which one they considered themselves to be. The available research on NLP tends to fall into two ma jor categories: (a) studies validating the PRS concept by examining the agreement among the three assessment methods of eye movements, verbalizations, and aelf-report; and (b) studies examining the effect of PRS matching or mismatching on communication effectiveness. We are more interested in the latter category of studies for the purposes of this paper, as we want to determine whether expectancy effects might be operating when NLP principles are used in applies settings. However, it is also useful to review briefly the first category of studies and to evaluate the PRS concept on more general methodological grounds . Four studies have investigated the extent of agreement among the three PRS assessment techniques (Birholtz, 1981; Falzett, 1981; Gumm, Walker, ~ Day, 1982; Owens, 1978). All of these studies showed that there was no agreement among methods at acceptable levels. As an illustration, we will describe the Gumm, Walker, ~ Day (1982) atuty in more detail. Fifty right-hantet women were given a short interview to assess PRS through verbalization, completed a 24-item questionnaire to assess PRS through aelf-report, and had their eye movements videotaped while answering 20 questions. There was no agreement between verbalizations and self-report, Cohen's kappa ~ -.051, ~.30, or between eye movements and self-report, Cohen's kappa ~ .007, ~.46; the

21 strongest agreement was between eye movements and verbalizations, Cohen's kappa = .103, ~.09, though this result still does not reach standard levels of significance. Also interesting is the fact that the PRS system that is discovered to be the most common changes depending on which assessment method is used; most subjects are identified as having a kinesthetic PRS by the verbalization method, whereas most subjects are identified as having an auditory PRS by the eye movement method. The results of the Gumm, Walker, & Day (1981) study are representative of the other three studies on the issue, and we are left with the conclusion that the PRS assessment methods have not yet demonstrated strong convergent validity. This casts some doubt on the utility of the NAP ma ~ and makes the interpretation of results from the experiments on PRS matching, which tend to rely on only one method of assessment, more difficult. -in As noted earlier, the second major category of Research studies on NLP focuses on the effects of PRS matching on communication. Within this category there are two subcategories: (a) studies in which a subject's PRS is assesses prior to the interaction, and counselors are instructed to match or mismatch the PRS; and (b) studies in which there is no prior assessment of PRS and counselors are instructed to either match or mismatch the subject's use of perceptual predicates as it occurs during the ongoing interaction. This distinction has important implications for the potential expression of interpersonal expectancy effects, so we will discuss prototypical research examples from each subcategory. Let us look first at research where the PRS is as ses sed prior to the interaction and describe a study that yielded the strongest results for PRS matching (Falzett, 1981~. In this study, 26 right-hantet undergraduates came

22 in for 30-minute interviews conducted by two interviewers. In the first part of the interview, both the experimenter and the interviewer recorded the subject's eye movements in response to a standardized list of questions. After six questions, the interviewer ant experimenter conferred about what the PRS of the subject was, ant then the interview continued. In the PRS matching condition, the interviewer was to use predicates that matches the aubject'a PRS exlusively; in the PER mismatching condition, the interviewer was to use predicates from the two modalities other than the subject'a PRS. Following the interview, subjects were administered the Counselor Rating Form Trustworthiness scale, which constituted the dependent variable of subject satisfaction with the interview. Analyses showed a large, positive effect of PRS matching; subjects in the matching condition felt the counselor was significantly more trustworthy than tit subjects in the mismatching condition. However, there are features of this experimental design that leave open the possibility that expectancy effects may be responsible for some of the positive results. The most striking characteristic is that both the interviewers and the experimenter were not blind either to the hypotheses under study or to the experimental condition of the subjects. Because they had assisted in the assessment of subjects' PRS, the interviewers knew whether or not they were matching their subjects' PRS. Also, interviewers had been "instructed in the principles of the PRS model" (Falsest, 1981; p. 306) and consequently expected more positive subject outcomes in the matching condition. Therefore, it is plausible that the interviewers behaved differently to subjects in the two conditions ant that it was those behavioral differences rather than the PRS matching per Be that led to the positive results. In this particular study, the verbal content of the

23 interview was standardized, so any differences would have to have been in the nonverbal domain. The interviewers, for example, could have spoken in a warmer tone of voice, smiled more, or engaged in greater eye contact with subjects in the PRS matching condition, nonverbal behaviors that are important in mediating expectancy effects. We do not know, of course, whether such differential behavior actually occurred in this study, but it is likely as nonverbal behavior is very difficult to monitor ant control in a standardized manner. Indeed, there is indirect evidence in this study to suggest that something else other than PRS matching, such as expectancy effects, was operating. In addition to the responses to the eye movement questionnaire, the author tabulated sub jects ' responses during the interview and made an assessment of their PRS using the verbalization method. However, all but three of the subjects were shown to have a kinesthetic PRS under this classification method, in sharp disagreement with the results from the eye movement method, and the author thus decided to ignore the verbalization results. The discrepancy between the two assessment methods for this study, though, does call into question the strong results found for PRS matching. Now we describe an example from the second subcategory of studies, where the interviewer "tracks" the perceptual predicates used by the subject during the ongoing interaction. It is important to note that these studies, strictly speaking, to not test the NLP model as they sidestep completely the issue of whether or not the PRS is a meaningful entity. Hammer ( 1983) had three graduate students in a counseling program interview 63 subjects. The interviewers received a 15 hour training program in identifying ant tracking subjects' perceptual predicates, and subjects were randomly assigned to

24 matching and mismatching conditions. The interviews were partially standardized for content by having a list of questions about dorm life as a guide. After the interviews, subjects filled out an Empathy Scale. Prior to analyzing the data, the author checked the audiotapes and threw out those interactions which did not meet his criterion of following predicate matching or mismatching according to the assigned condition. He ended up omitting over 28: of the interactions, however, a sizable amount that restricts the generalizability of his results. Analyses showed that subjects in the matching condition obtained significantly higher empathy scores than did subjects in the mismatching condition, F(1,56)=4.96, p<.05, r=.29. Again ~ though ~ we are left with the question of to what extent these results may be due to expectancy effects. In this type of study, it is impossible for the interviewer to be blind to the condition of the subject because the interviewer has to track the subject's conversation and then make the conscious effort of matching or mismatching the perceptual predicates the subject is using. Consequently, interviewers that use ongoing predicate ma tching may exhibit the same sorts of tif ferential behavior as discusses above. The behaviors most likely to change are those nonverbal cues associated with the "climate" category-- tone of voice, gaze, leans, nods, smiles, ant so on e In sum, what can be concluded about the design of these studies and the possibility that expectancy effects might be occurring? Two features of experimental design appear to be the most crucial: standardization of materials and keeping interviewers blind to the experimental condition of the subjects. With respect to standardization, an optimal way of designing a study would be for experimenters to prepare a set of audiotapes or videotapes of

25 interviewers that were identical in verbal content except for the particular perceptual predicates uses; these tapes would then be listened to by subjects with kinesthetic, visual, ant auditory PRS's. In other words, the same tape would match some subjects' PRS and mismatch other subjects ' PRS. The possibility that interviewers may unintentionally behave differently in the matching vs mismatching conditions, and thus the possibility of interviewer expectancy effects, is therefore eliminated. However, it must be noted that the kind of standardization suggested here involves an inevitable trade-off with experimental realism; watching a prepared videotape is not as natural and realistic as taking part in a one-on-one interview. The second design feature important in preventing expectancy ef facts in studies on NAP is keeping interviewers blind to the condition of the subjects. When the system of tapes interviews described above is used, there is no actual interaction so the issue of keeping the interviewer blind to the condition of the sub ject is irrelevant. However, when actual, ongoing interactions are part of the study, whether or not the interviewer knows the PRS of the subject becomes very important. In the category of studies where the subject's PRS is assessed prior to the interview, the interviewer should not be told what the subject's PRS is. In other words, the interviewer should be told, for example, "use visual predicates"; he or she would not know whether or not the visual predicates were a match or a mismatch of the subject's PRS. As noted earlier, though, it is impossible to keep interviewers blind to condition in the category of studies where the interviewers track the subjects' ongoing use of predicates. In the section on accelerated learning, we made the distinction between endogenous and exogenous expectancy effects, that is, separating expectancies

26 that are an explicit part of the technique under consideration from expectancies that are communicated unwittingly as a result of poor experimental design. In the NLP area, there are no endogenous expectancy effects; the communication of positive expectancies is not part of the model. Consequently, it becomes desirable to design studies that prevent the communication of differential expectancies. Based on the above considerations, we recommend that in planning research on the NLP construct, more studies should be conducted that use standardized audio- or videotapes, or that assess subjects' PRS prior to the interaction and keep interviewers blind to the subjects' PRS. Studies where the ongoing use of predicates is tracked by the interviewer are too vulnerable to other changes in interviewer behavior; too much is left uncontrolled, making it harder to conclude that PRS matching vs mismatching is the causal agent. Given these considerations, it is very interesting to note that a recent review of the NLP research shows that the strongest support for PRS matching was obtained in tracking studies (Sharpley, 1984, p. 246). More total PRS matching occurs in the tracking studies, which might account for the stronger results obtained. However, studies that used standardized videotapes or that assessed subjects' PRS prior to the interview do not support the NLP model (Sharpley, 1984). This lack of support leaves open the very real possibility that interpersonal expectancy effects are responsible in part for the positive results found in the predicate tracking studies . Imagery and Mental Practice One idea that has held considerable appeal for many years is the hypothesis that mental practice enhances actual performance of physical tasks. For example, Dick Fosbury, the celebrated high jumper, is well known for his

27 insistence on mentally practicing for several minutes before making each jump. But is mental practice truly efficacious? In contrast to the human performance technologies discussed no far, there exists a considerable literature examining the benefits of mental practice. Much of this literature stems from the 1940s through 1960s, when the idea of mental practice was in vogue, although there has been a recent revival of research interest in mental practice in the past few years. We now turn to a description of that research, providing a summary of the findings on mental practice and our assessment of how much the findings could be due to interpersonal expectancy effects. Fortunately for our purposes, Feltz and Landers (1983) conducted a recent and thorough meta-analysis of the literature on mental practice and skill learning, summarizing the results of 60 studies. They gathered all the studies they could find that compared mental practice to either a pretest baseline or to a no-practice control group. They coded various items from each study, including what type of task was used (motor, strength, or cognitive), number and length of mental practice sessions, subjects' previous experience with the task, and so on. Examples of the three types of tasks are: (a) motor tasks-- volleyball serves, dart throwing, and basketball free throws; (b) strength tasks-- sit-ups and hand grips; and (c) cognitive tasks-- card sorting, maze tracing, and symbol substitution. The results of the Feltz and Landers (1983) meta-analysis show that the mean effect size for mental practice across the 60 studies was equivalent to a correlation coefficient of .23, a small to medium effect. Within the three types of tasks, however, the mean effect size varied considerably; the effect of mental practice was greatest for cognitive tasks (r=.58), followed by motor tasks (r=.21), and least for strength tasks (r=.10~. There are several

28 possible explanations for these differences in effect size. If, as many researchers propose, mental practice works by allowing subjects to rehearse the symbolic components of a task, then it follows logically that mental practice would work best with tasks that have more symbolic or cognitive components. A related explanation is given by the fact that mental practice of cognitive tasks is closer to the actual practice of the task than is mental practice of a motor task. For example, symbol substitution done mentally is practically identical to symbol substitution done physically; the only difference is the lack of writing down the solutions in mental practice. A third possibility, more relevant to the goals of this paper, is that if interpersonal expectancy effects are operating in this area of research, performance on cognitive tasks would most likely be more susceptible to influence from expectations than would strength tasks. Now that we have outlined some of the major issues ant findings of this research area, we turn to a more Be tailed description of some examples of typical research. As the studies involved share very similar design features and methodology, we will briefly generally the pass ib le in fluence describe is an older one (Corbin researchers in this area. In this juggling pretest and divided into describe each in turn and then discuss more of expectancy effects. The first study we , 1967), conducted by one of the prominent study, 120 male undergraduates were given a three levels of high, medium, ant low skill. Subjects were then randomly assigned within each of these levels to one of four treatment groups: control, mental practice, physical practice, and combined mental and physical practice. Subjects then engaged in the appropriate practice (mental, physical, or both) 30 times each day for 21 days. The control subjects did no rehearsal during this time. After the 21

29 days, all subjects were given a posttest. Results showed there was no significant difference between the mental practice and control group; the physical practice group and the combined group did significantly better than the control group but were not different from each other. In Mendoza ~ Wichman (1978), 32 undergraduates were pretested on dart-throwing ability and randomly assigned to one of four groups: control, mental practice only, mental practice plus simulated dart-throwing movements, ant physical practice. The control group was told to report back to the lab in a week; the other groups were instructed to come into the lab for fifteen-minute practice sessions twice each day for the week. Subjects in the mental practice groups were told to imagine themselves throwing darts, making the images as vivid as possible, and to correct for any imagined misses. Subjects in the mental plus motor practice group were given the same instructions but also stood up and pretended to throw darts, moving their arms appropriately. Analyses of the posttest minus pretest gain scores showed that the mental practice group scored signif icantly higher than the control group, F(1,28)=15.66, r=.60. There was no difference between the mental practice group and the mental plus motor practice group, and the physical practice group scored significantly higher than the two mental practice groups and the controls. The last study we will describe (Gould, Weinberg, & Jackson, 1980) compared different kinds of mental preparation strategies. This study reported two experiments. In the first, 30 undergraduates were tested on an exercise instrument designed to measure leg strength. Each subject was tested under five different instructional sets: attentional focus (concentrating on feelings in the legs), imagery (mental practice), preparatory arousal

30 (emotionally "charging up"), rest control, and counting backwards. The instructions for the imagery set, the condition of interest to us, were as follows: "...close your eyes and picture yourself kicking your leg up as hard and as fast as possible, like you were kicking a football. In addition, visualize yourself setting a new personal best on each trial" (Gould, Weinberg, ~ Jackson, 1980; p. 331). Subjects were given test trials for each condition in counterbalanced orders. Analyses showed that the preparatory arousal and mental practice conditions were not significantly different from each other, but both were significantly better than the other three conditions. The second experiment, rather than having instructional set as a repeated measure, randomly assigned subjects to one of three conditions: preparatory arousal, mental practice, and a rest control. Otherwise the procedure was the same as in the first experiment. These results showed that subjects in the preparatory arousal condition performed significantly better than subjects in the control condition, r=.40. The difference for the mental practice condition was not statistically significant, but the effect adze was moderate and similar to that for the preparatory arousal condition, r-.34. In sum, then, this study found support for both mental practice and preparatory arousal in improving leg strength. Now that we know better what typical research in this area is like, we can ask about the possible influence of interpersonal expectancy effects. As a first consideration, we can point out that in all of there studies , experimenters were not blind to the experimental condition of their aubjecta, a factor that we have stressed before as being crucial in the operation of expectancy effects. If experimenters are expecting better performance in the mental practice condition, and they know which sub jects are using mental

31 practice, then they might treat those subjects differently. For example, they could exhibit those nonverbal behaviors we have discussed earlier; they could smile more, sound warmer and more encouraging, engage in greater eye contact, appear more interested in the subject, and so on. As before, these differential behaviors on the part of the experimenter might be responsible in part for any differences in performance on the part of subjects in the mental practice condition. A factor relevant to the above discussion concerns the nature of the control groups used in this area of research. Very often, as in the Corbin (1967) and Mendoza & Wichman (1978) studies, the control group merely consists of subjects who are given a pretest and then told to report back to the lab in a given number of days. The experimental subjects, on the other hand, come back to the lab repeatedly for their practice sessions. This means that subjects in the mental practice group differ from the control subjects in two important respects: (a) they are receiving the benefits of mental practice, relevant to the hypothesis of interest; and (b) they are receiving considerably greater amounts of experimenter time and attention, relevant to the possible mediation of expectancy effects. A better way of designing these studies would be to have a control group that spends the exact same amount of time in the lab and is treated as much as possible like the mental practice group so that the only difference between the two groups would be that one of them uses mental practice and the other does not. The distinction drawn earlier in our discussion of the SALT technique between interpersonal expectancies and self-expectancies becomes relevant here. Certainly, the subjects are not themselves blind to their condition; subjects who are using mental practice know they are using it and undoubtedly

32 have positive expectations about its efficacy. Possibly it is simply doing something different --a kind of Hawthorne effect ~ accompanied by the subjects' own expectations for success that brings about the improved performance rather than the specific act of imagining oneself doing a task. There is evidence in the Gould, Weinberg & Jackson (1980) study to suggest that this might be the case. As noted above, sub jects in the preparatory arousal condition performed equally as well if not better than subjects in the mental practice condition. The instructions given to subjects in the preparatory arousal condition were "...I would like you to 'emotionally charge-up.' In essence, psych yourself up for maximum performances by getting mad, aroused, pumped-up or charged-upt' (Gould ~ Weinberg , & Jackson ~ 1980 ; p. 331~. These instructions would tend to elicit positive self-expectancies on the part of the subjects, yet they lack the specificity of the mental practice instructions. The fact that both instructional sets resulted in similar levels of improvement, though, suggests that mental practice per se was not essential. The possibility that self-expectancies court be influential in this area of research implies that studies need to be Resigner so as to include control groups that involve similar levels of self-expectancies but differ in the presence or absence of mental practice. As it stands now, studies have confounded mental practice with positive self-expectancies, and we do not know whether mental practice in itself is efficacious. In sum, our feeling is that the research on mental practice is fairly well-designed. In terms of the exogenous/endogenous distinction made earlier, there is some possibility of exogenous expectancy effects in that the experimenters are almost always aware of the experimental condition of their subjects. That can easily be rectified in future studies, however. The

33 possibility of endogenous expectancy effects, in the form of subjects' self-expectancies, is much greater and impossible to eliminate. The best that can be done is to create control groups, as.mentioned above, that will permit the direct testing of self-expectancies vs. mental practice. The last point that should be made is to note that although the Feltz & Landers (1983) meta-analysis showed a small, positive effect of mental practice, actual physical practice has been shown in many studies to be much more effective in improving performance (Corbin, 1972). In situations where physical practice is logistically and economically feasible, then, physical practice should be used. But that does not mean that mental practice may not be useful in cases where it is too expensive or impractical to use physical practice. Biofeedback The use of biofeedback in medical and therapeutic contexts has enjoyed tremendous growth over the past 20 years. Nonexistent prior to the 1960s, biofeedback was developed in response to psychophysiological experiments showing that operant cond it toning could shape responses of the autonomic nervous system. As more and more clinicians began using biofeedback, it was increasingly heralded as a panacea for a variety of stress-related symptoms, such as headaches ~ back pain, Raynaud ~ 8 disease ~ high blood pressure' and teeth grinding. Despite the attention lavishes on it there is considerable doubt and controversy about the ef ficacy of biofeedback. We will discuss some of the issues involved in that controversy, particularly those relevant to expectancy effects, and describe some examples of typical research. At the same time that biofeedback was growing in popularity among clinicians, researchers began encountering puzzling failures to replicate basic laboratory demonstrations of biofeedback. In a recent article published

34 in the American Psychologist, Alan Roberts (1985 ~ describes several of these _ _ . failures to replicate. For example, he had conducted two studies showing that people can learn to control the skin temperature of their hands. However, later attempts by his lab and other researchers to replicate these findings did not succeed. He then cites a number of other reviews on the efficacy of biofeedback in different domains (e.g., headaches), reviews that all conclude there is no evidence that biofeedback works. For example, Jessup, Neufield, & Mersky ( 1979 ~ reviewed 28 studies and found no support for the specific benefit of biofeedback in reducing migraine headaches. Furthermore, this review found that the most promising results were obtained in uncontrolled studies, where the potential for expectancy effects or other experimental artifacts is greater. Roberts concluded that the current status of clinical biofeedback research is "dismal" and that "there is absolutely no convincing evidence that biofeedback is an essential or specific technique for the treatment of any condition" (Roberts, 1985, p.940~. These reviews create a gloomy impression of the biofeedback literature. However, these were traditional literature reviews, not meta-analyses, and one of the points we have stressed in this paper is that it is crucial to examine the magnitude of experimental effects when evaluating the outcomes of studies rather than concentrating merely on whether or not the study was significant at the ~<.05 level. We found only one meta-analysis of biofeedback research in the limited literature review we undertook, and its conclusions were not as gloomy. In this meta-analysis, Sharpley and Rogers (1984) assessed the comparative effectiveness of biofeedback, other forms of relaxation, and control conditions in reducing frontalis EMG levels. For their dependent variable, they subtracted the mean posttest EMG level from the mean pretest

35 EMG level, and divided by the mean pretest EMG level for each condition in a given study, thereby yielding a measure of proportion drop in EMG levels. They obtained 60 such measures from 20 different studies. They found that biofeedback was the most effective in reducing frontalis EMG levels, with a mean of .426, followed by other forms of relaxation, with a mean of .332, and control conditions, with a mean of .210. An analysis of variance accompanied by post-hoc comparisons was conducted on the 60 means, and it was concluded that biofeedback was significantly better than control conditions, but there was no significant difference between biofeedback and other forms of relaxation. We computed a linear contrast of the three means, using their data, and found a highly significant linear trend, F(1,57)=12.15, p<.001 (with r=.42 based on studies as the sampling unit and r-.13 with subjects as the sampling unit; this latter estimate is conservative to the extent that there is nonzero variation of studies within type of study), confirming that biofeedback was the most effective, control was the least effective, and relaxation was in the middle. This meta-analysis, then, indicates that biofeedback does indeed work, but its superiority over other forms of relaxation was supported only at ~=.07, one-tailed. However, it should be noted that whereas control was the least effective, its mean was nonetheless .21, indicating a hefty 20: decrease in EMG levels. This decrease is presumably due to nonspecific factors such as the placebo effect, habituation effect, or regression to the mean, factors that we will discuss in detail later (White, Tursky, & Schwartz, 1985). First, however, let us describe some typical studies carries out in this area. Banner ~ Meadows (1983) randomly assigned 63 subjects to one of six experimental conditions: (a) EGG feedback; (b) temperature feedback; (c) EMG +

36 temperature feedback; (d) relaxation (subjects given relaxation instructions and hooked up to machines but given no feedback); (e) placebo control (subjects listen to soothing music, hooked up to machines, but given no feedback; and (f) wait list control (subjects receive no treatment). This design is commendable for its careful inclusion of different types of control groups; many studies in this area either do not use control groups at all or simply have a wait list control. The design of the Banner ~ Meadows (1983) study, though, allows the assessment of the extent to which the benefits of biofeedback may be due to nonspecific factors such as the placebo effect that comes as a consequence of the subject entering into a therapeutic relationship and being connected to very elaborate and scientific-appearing electronic equipment. All subjects (except for the wait list controls) then came into the lab for a series of 9 one-hour sessions spaced over a period of three months, following which a posttest was administered to all subjects. Analyses showed, disappointingly, no significant differences among groups on any of the dependent measures: EMG levels, finger temperatures, self-reported tension, or self-reported frequency of problems. Unfortunately, the article did not report means for the six conditions on any of these measures so we are unable to estimate effect sizes or see whether the results were even in the right direction. However, the F's for treatment condition in all these analyses were always less than 1.0, so it is unlikely that the biofeedback conditions differed significantly from the control conditions in efficacy (maximum possible F<5). Yet there were significant decreases in tension levels over time across the six conditions. The authors conclude that "we are obviously dealing here with a non-specific (placebo) effects (Banner & Meadows, 1983' p.

37 191). Another study (Gauthier, Doyon, Lacroix, & Drolet, 1983) investigated the use of blood volume pulse biofeedback in treating migraine headaches. Twenty-one female migraine patients were assigned to one of three conditions: temporal artery constriction biofeedback, temporal artery dilation biofeedback, and a wait list control group. The treatment consisted of 16 sessions over a period of ~ weeks, and subjects were asked to practice at home. Control subjects were told to report back in two months for treatment. According to theory, only the artery constriction biofeedback group should have been effective, because artery constriction mimics the actions of the pharmacological agents used to treat migraines. Results showed, however, -that both treatment groups showed significantly greater improvement than the wait list control on the dependent variables of frequency of headaches (r=.48, number of headache days (r=.49), intensity of headaches (r=.53), and duration of headaches (r=.41~. There were no significant differences between the two feedback groups. Interpretation of these results is a little difficult. If the authors' original theorizing hat been correct, then the fact that there was improvement in the artery dilation condition implies the presence of a placebo or expectancy effect. Subjects in this condition had received substantial amounts of time and attention from the experimenters, they had been attached to impressive looking equipment with flashing lights and meters, and they had been led to believe they were receiving treatment that would help their headaches. All these factors very easily could have elicited positive self-expectancies, thus creating a placebo effect that resulted in marked improvement over the control subjects, who received no special treatment over

38 the two months. The authors do not prefer this interpretation; instead, they conclude that it is more likely that the psychophysiology of headaches is not yet fully understood and that artery dilation might actually be effective in reducing headaches. The former explanation is highly plausible, however, and has not been ruled out. What does the all the information covered so far indicate about the possible influence of interpersonal expectancy effects? In terms of exogenous expectancy effects, we find that researchers in the biofeedback area, like those in the other areas we have discussed, are most often not blind to the condition of their subjects. This leaves open the possibility of experimenter behavior toward experimental and control subjects. biofeedback studies where double-blind procedures are used, often negative (e.g. , Guglielmi, Roberts, & Patterson, 1982~. evidence that the experimenter's behavior toward the subject is crucial in determining the outcomes of biofeedback studies is given by Taub ~ School (1978). First, they relay an anecdote about one of their earlier experiments on biofeedback: They discovered that one experimenter who showed an impersonal attitude toward the subjects was able to train only 2 of the 22 subjects. Another experimenter, though, who used the exact same techniques but was more informal and friendly successfully trained 19 of 21 subjects. Astounded by this result, Taub ~ School (1978) undertook a formal study to investigate this phenomenon. They experimentally manipulated experimenter behavior toward two groups of sub Sects. In the first, the experimenter adopted an impersonal attitude, e.g., using last names, discouraging conversation, and avoiding eye contact. In the second group, the experimenter adopted a friendly attitude ~ e fig., Using first names ~ encouraging experimental d if ferent ia 1 Moreover, in results are Fur ther abso 1 ute ly

39 friendly conversation, engaging in frequent eye contact. Results showed that at the end of the training series, the group treated in a friendly manner achieved a mean change in skin temperature of 4.2 degrees as compared to the mean change of 1.3 degrees for the group treated in an impersonal manner. Taub ~ School note that this difference is the largest experimental effect they have obtained by the manipulation of any single variable throughout their entire sequence of experiments! They conclude, "It is almost impossible to overemphasize the importance of the experimenter-attitute variable for the success of thermal biofeedback training" (Taub ~ School, 197S, p. 617 ~ . Clearly, then, the potential for exogenous expectancy effects is real . But, some critics argue (e.g., Steiner & Dince, 1981; Surwit ~ Keefe, 1983), these effects may not be exogenous at all; that is, if biofeedback is used as an adjunct to therapy, then having a therapist or experimenter who behaves warmly toward the subjects and communicates positive expectations about biofeedback constitutes an integral part of biofeedback. Even if positive experimenter expectations were an essential aspect of biofeedback therapy, however, we would still be interested in learning whether or not the actual machinery of biofeedback training contributed an independent benefit. Equally clear is the importance of endogenous expectancy effects in biofeedback , taking the form of the placebo ef feet or sub ject self-expectations. Placebo effects are as powerful as some of our most sophisticated bugs and treatments, and they work for an unlimited range of problems (e.g., Beecher, 1955, 1961, 1962, 1966; Shapiro, 1960, 1964). As discussed earlier, they are of particular importance in the biofeedback area. Subjects come into the lab or clinic, often suffering from chronic disorders with psychosomatic roots. Whey are then wired with an elaborate series of

40 electrodes and connected to the impressive looking equipment with all the dials and meters. It seems likely that subjects believe they are receiving a powerful treatment and should rapidly improve. Indeed, Stroebel & Glueck (1973) speculate that biofeedback may in fact make its greatest contribution as "the ultimate placebo." Again, even though the placebo effect may be integral to how biofeedback works, research still needs to be tone to assess the independent contributions of placebo effects and biofeedback in and of itself. This can be accomplished by incorporating into studies placebo control groups such as those mentioned earlier, which involve having subjects come into the lab, receive the same treatment and attention from the experimenters, ant be hooked up to biofeedback equipment; the only difference is that the subjects do not receive accurate feedback on their physiological levels. In conclusion, we can say that despite the enormous amount of research attention paid to biofeedback over the past 20 years, we cannot state conclusively that the effects attributed to biofeedback are actually produced specifically by biofeedback. There are masses of clinical and lab studies showing that biofeedback is effective, but many of these studies suffer from methodological and design flaws, and there are also many failures to replicate. We have also seen that experimenter- and subject self-expectancy effects are pervasive in this area. At this point, what is needed is a series of carefully controlled studies that address these issues. The potential benefits of biofeedback are enormous; the research we have proposed could only add to the promise of biofeedback as it would increase our understanding of how and why biofeedback works.

41 Parapsychology There are two transition points in the recent history of parapsychology. At each point parapsychology advanced to a new level of more rigorous research and scientific respectability, though neither point earned for it full acceptance as a respectable field of scientific inquiry (Boring, 1962; Murphy, 1962, 1963; Truzzi, 1981~. The first point was in 1882 when the Society for Psychical Research was founded in London by a group primarily from Cambridge University. Among the distinguished presidents of this Society were William James, Henri Bergson, Lord Rayleigh, ant C.D. Broad (Schmeitler, 1968). The second point was in 1927 when William McDougall, newly arrived at Duke University, was joined by J.B. Rhine (Boring, 1950; Schmeitler, 1968). It was Rhine who establishes the basic procedures of parapsychological research that are still employed today. His best known method required the subject to guess which one of five designs was the "target" stimulus. Since the probability of guessing the correct design was .20, any subject's "psi" ability could be evaluated for statistical significance by comparing the obtained success rate with the .20 expected under the null hypothesis. Parapsychological investigations cover a wide variety of phenomena including: telepathy (e.g., guessing a design being viewed by another); clairvoyance (e.g., guessing a design not being viewed by another); precognition (e.g., guessing a design not yet selected); psychokinesis (e.g., trying to influence the fall of a pair of dice); survival after teeth (e.g., reincarnation). The first three of these are often referred to generically as ESP (extrasensory perception). Because the types of research subsumed under the topic of parapsychology range so widely, and because of the sheer number

42 of parapsychological investigations, we have confined our discussion to a focused domain of parapsychological inquiry: the ganzfeld experiments. Ganzfe Id Experiments In these experiments subjects typically are asked to guess which of four stimuli had been "transmitted" by an agent or sender with these guesses made under conditions of sensory restriction (usually white noise ant an unpatterned visual field). There were several strong reasons for the selection of this domain of parapsychological research: 1. The domain is of recent origin so that even the earliest studies managed to avoid some of the older problems found in parapsychological research (Hansen & Lehmann, 1895; Kennedy, 1938, 1939; Mall, 1898; Rosenthal, 1965, 1966; Warner ~ Raible, 1937~. 2. Because of the Decency of the research, access to original cats was ~ ~ ~ ~ ~ of the older areas (Rag, 1985~. ~ an especially promising area of parapsychological inquiry (Hymen, 1985; Rao, 1985~. 4. Investigations in this area have been carried out by respected researchers (Hymen, 1985~. 5. The area has been the subject of recent sophisticated public debate by eminent investigators and critics of the area (Honorton, 1985; Hyman, 1985; Rao, 1985~. 6. As an outgrowth of this debate, two formal meta-analy~e~ of this area have become available (Honorton , 1985; Hyman, 1985~. Heta-analytic results Five indices of "psi" success have been employed in ganzfeld research (Honorton, 1985~. One criticism of research in this area is that some more like ly than for some 3. The domain is considered

43 investigators employed several such indices in their studies and failed to adjust their reported levels of significance (a) for the fact that they hat mate multiple tests (Hymen, 1985).1 Since most studies employed a particular one of these five methods, the method of direct hits, Honorton focused his meta-analysis on just those 28 studies (of a total of 42) for which direct hit data were available. The method of direct hits scores a success only when the single correct target is chosen out of a set of t total targets. Thus the probability of success on a single trial is 1/t with t usually = 4 but sometimes 5 or 6. The other methods, employing some form of partial credit, appear to be more precise in that they use more of the information available. Although they differ in their interpretation of the results, Honorton (1985) and Hyman (1985) agree quite well on the basic quantitative results of the meta-snalysis of these 28 studies. This agreement holds both for the estimation of statistical significance (Honorton, 1985, p.58) ant of effect size (Hymen, 1985, p. 13~. Stem-and-Leaf Display Table 3 shows a stem-and-leaf display of the 28 effect size estimates based on the direct hits studies summarized by Honorton (1985, p. 84~. The effect size estimates shown in Table 3 are in units of Cohen's h which is the difference between (a) the arcsine transformed proportion of direct hits obtained and (b) the arcsine transformed proportion of direct hits expected under the null hypothesis (i.e., 1/t). The advantage of h over i, the difference between raw proportions, is that all h values that are identical are identically detectable while all i values that are identical (e.g., .65-.45 and .25-.05) are not equally detectable (Cohen, 1977, p.lBl).

44 The stem-and-leaf display of Cohen' s h values is shown on the left and the display is summarized on the right . Tukey ( 1977 ~ developed the stem-and-lear~ plot as a special form of frequency distribution to facilitate the inspection of a batch of data. Each number in the data batch is made up of one stem and one leaf, but each stem may serve several leaves. Thus , the stem .1 is followed by leaves of 3, 8, ~ representing the numbers .13, .1B, .18. The first digit is the stem; the next digit is the leaf. The stem-and-leaf display functions as any other frequency distribution but the original data are re tained prec ise ly . Distribution. From Table 3 we see that the distribution of effect sizes- is unimodal with the bulk of the results (80%) falling between -.10 and .58 . The distribution is nicely symmetrical with the skewness index (~1~.17) only 24: of that required for significance at it< .05 (Snedecor & Cochran, 1980, pp . 78-79, 492) . The tails of the distribution, however, are too long for normality with kurtosis index g2=2.04, E:=.02. Relative to what we would expect from a normal distribution, we have studies that show larger positive and larger negative effect sizes than would be reasonable. Indeed, the two largest positive effect sizes are significant outliers at p<.O5, and the largest negative effect size approaches significance with a Dixon index of .37 compared to one of .40 for the largest positive effect size (Snetecor & Cochran, 1980, pp . 279-280, 490 ~ . The total sample of studies is still small; however, if a much larger sample showed the same result, that would be a pattern consistent with the idea that both strong positive results ("psi") and strong negative results (llpsi~missing't) might be more likely to find their way into print or at least to be more available to a meta-analyst. Effect size. lathe bulk of the results (82Z) show a positive effect size

45 where 50% would be expected under the null (p = .0004~. The mean effect size, h, of .28 is equivalent to having a direct hit rate of .38 when .25 was - expected under the null. The 95% confidence interval suggests the likely range of effect sizes to be from .11 to .45, equivalent to accuracy of guessing rates of .30 to .46 when .25 was expected under the null hypothesis. Significance testing. The overall probability that obtained accuracy was better than the accuracy expected under the null was a ~ of 3.37/10 associated with a Stouffer Z of 6.60 (Mosteller & Bush, 1954; Rosenthal, 1978a, 1984). File drawer analysis. A combined ~ as low as that obtained can be used as a guide to the tolerance level for null results that never found their way into the meta-analytic data base (Rosenthal, 1979, 1984). It has long been believed that studies failing to reach statistical significance may be less likely to be published (Sterling, 1959; Rosenthal, 1966~. Thus it may be that there is a residual of nonsignificant studies languishing in the investigators' file drawers. Employing simple calculations, it can be shown that, for the current studies summarized, there would have to be 423 studies with mean p=.50, one-tailed, or Z=O.OO in those file drawers before the overall combined ~ would become just > .05. That many studies unretrieved seems unlikely for this specialized area of parapsychology (Hymen, 1985; Honorton, 1985~. Based on experience with meta-analyses in other domains of research (e.g., interpersonal expectancy effects) the mean Z or effect size for nonsignificant studies is not 0.00 but a value pulled strongly from 0.00 toward the mean Z or mean effect size of the obtained studies (Rosenthal & Rubin, 1978~. Comparison to an Earlier Heta-Analysis

46 We felt it would be instructive to compare the results of the ganzfeld research meta-analysis by Honorton ( 1985 ~ to the results of an older and larger meta-analysis of another controversial research domain -- that of interpersonal expectancy effects (Rosenthal ~ Rubin, 1978). In that analysis, eight areas of expectancy effects were summarized; effect sizes (Cohen's d, roughly equivalent to Cohen's h) ranges from .14 to 1.73 with a grant mean t of .70. Honorton's mean effect size (h=.28) exceeds the mean t of two of the eight areas (reaction time experiments [d=.17 ] ; and studies employing laboratory interviews [d=.143~. The earlier meta-analysis displayed the distribution of the Z's associated with the obtained ~ levels. Table 4 shows a comparison of the two meta-analyses' distributions of Z's. It is interesting to note the high degree of similarity in the distributions of significance levels. The total proportion of significant results is somewhat higher for the gansfeld studies but not significantly so ~ (1~=1.07, N=373, .30, =.05~. Interpretation of Mets-Analyt~c Results Although the results of the meta-analysis are clear, the meaning of these results is open to various interpretations (Truzzi, 1981). The most obvious interpretation might be that at a very low p, and with a fairly impressive effect size, the ganzfeld psi phenomenon has been demonstrated. However, there are rival hypotheses that will need to be considered, many of them put forward in the recent detailed evaluation of the ganzfeld research area by Hyman (1985). Procedural Rival Hypotheses Sensory leakage. A standard rival hypothesis to the hypothesis of ESP is that sensory leakage occurred and that the receiver was knowingly or

47 unknowingly cued by the sender or by an intermediary between the sender and receivers As early as 1895, Hansen and Lehmann (1895) had described "unconscious whispering" in the laboratory and Kennedy (1938, 1939) was able to show that senders in telepathy experiments could give auditory cues to their receivers quite unwittingly. Ingenious use of parabolic sound reflectors made this demonstration possible. Mall (1898), Stratton (1921), and Warner and Raible (1937) all gave early warnings on the dangers of unintentional cueing (for su aries see Rosenthal, 1965a, 1966). The subtle kinds of cues described by these early workers were just the kind we have come to look for in searching for cues given off by experimenters that might serve to mediate the experimenter expectancy effects found in laboratory settings (Rosenthal, 1966, 1985). By their nature, ganzfeld studies tend to minimize problems of sensory cueing. An exception occurs when the subject is asked to choose which of four (or more ~ stimuli had been "sent" by another person or agent . When the same stimuli held originally by the sender are shown to the receiver, finger smudges or other marks may serve as cues. Honorton has shown, however, that studies controlling for this type of cue yield at lesat as many significant effects as do the studies not controlling for this type of cue. Recording errors. A second rival hypothesis has nearly as long a history. Kennedy and Uphoff (1939) ant Sheffield and Kaufman (1952) both found biased errors of recording the data of parapsychological experiments. In a meta-analysis of 139,000 recorded observations in 21 studies , it was found that about 1: of all observations were in error and, that of the errors committed, twice as many favored the hypothesis as opposed it (Rosenthal , 197Bb). While it is difficult to rule recording error out of ganzfeld studies

48 (or any other kind of research) their magnitude is such that they could probably have only a small biasing effect on the estimated average effect size (Rosenthal, 1978b, p. 1007~. Intentional error. The very recent history of science has reminded us , that while fraud in science is not quite of epidemic proportion it must be given close attention (Broad & Wade, 1982; Zuckerman, 1977~. Fraud in parapsychological research has been a constant concern, a concern found justified by periodic flagrant examples (Rhine, 1975~. In the analyses of Hyman (1985) and Honorton (1985), in any case, there appeared to be no relationship between degree of monitoring of participants ant the results of the study. Statistical Rival Hypotheses File drawer issues. The problem of biased retrieval of studies for any meta-analysis was described earlier. Part of this problem is addressed by the 10 year old norm of the Parapsychological Association of reporting negative results at its meetings and in its journals (Honorton, 1985~. Part of this problem is addresses also by Blackmore who conducted a survey to retrieve unreported ganzfeld studies. She found that 7 of her total of 19 studies (37Z) were judged significant overall by the investigators. This proportion of significant results was not significantly (or appreciably) lower than the proportion of published studies found significant. A problem that aeems to be a special case of the file drawer problem was pointed out by Hyman ~1985 ~ . That was a possible tendency to report the results of pilot studies along with subsequent significant results when the pilot data were significant. At the same time it is possible that pilot studies were conducted without promising results, pilot studies that then

49 fount their way into the file drawers. In any case, it is nearly impossible to have an accurate estimate of the number of unretrieved studies or pilot studies actually conducted. Chances seem good, however, that there would be fewer than the 423 results of mean Z-O.OO required to bring the overall combined ~ to >.05. Multiple testing. Each ganzfeld study may have more than one dependent variable for scoring a success. If investigators employ these dependent variables sequentially until they find one significant at p<.05 the true ~ will be higher than .05 (Hymen, 1985). Although a simple Bonferroni procedure can be used to adjust for this problem (e.g., by multiplying the lowest obtained ~ by the number of dependent variables tested) this adjustment is quite conservative (Rosenthal ~ Rubin, 1983). The adjustment can be made with greater power if the investigators are willing to order or to rate their dependent variables on a dimension of importance (Rosenthal & Rubin, 1984, 1985~. Most useful, however, is a procedure that uses all the Data from all the dependent variables with each one weighted as desired so long as the weighting is done before the data are collected (Rosenthal ~ Rubin, 1986). Randomization. Hyman (1985) has noted that the target stimulus may not . have been selected in a truly random way from the pool of potential targets. To the extent that this is the case the ~ values calculated will be in error. Hyman (1985) and Honorton (1985) disagree over the frequency in this sample of studies of Improper randomization. In addition, they disagree over the magnitude of the relationship between inadequate randomization ant study outcome. Hyman felt this relationship to be significant and positive; Honorton felt this relationship to be nonsignificant and negative. Since the median level of just those 16 studies employing random number tables or generators

50 (Z=.94) was essentially identical to that found for all 28 studies it seems unlikely that poor randomization procedures were associated with much of an increase in significance level (Honorton, 1985, p. 71 ) . Statistical errors. Hyman (1985) and Honorton agree that six of the 28 studies contained statistical errors . However, the median effect size of these studies (h=.33) was very similar to the overall median (h=.32) so that it seems unlikely that these errors had a major effect on the overall effect size estimate. Omitting these six studies from the analysis decreases the mean h from .28 to .26. Such a drop is equivalent to a drop of the mean accuracy rate from .38 to .37 when .25 is the expected value under the null. Independence _ studies. Because the 28 studies were conducted by only 10 investigators or laboratories, the 28 studies may not be independent in some sense . Whi le under some data analytic assumptions such a lack of independence would have implications for significance testing, it does not in the ganzfeld domain because of the use of trials rather than subjects as the independent sampled unit of analysis. The overall significance level, then, depends on the results of all trials, not the number of studies, or subjects, or investigators (any of which may be viewed as fixes rather than random). However, the lack of independence of the studies court have implications for the estimation of effect sizes if a small proportion of the investigators were responsible for all the nonzero effects. In that case the average of the investigators' obtained effects would be much smaller than the average of the studies' obtained effects. In an extreme example the median effect size of a sample of studies court be .50 while the median effect size of a sample of investigators could be zero because very few investigators obtained any nonzero ef fee t . That did not turn out to be the case for the ganzfe Id domain .

51 The median effect size (h) was identical (.32) for the 28 studies and the 10 investigators or laboratories . The mean effect sizes, however, did differ somewhat with a lower mean for labs (.23) than for studies (.28~. The proportions of results in the positive direction were very clone; .82 for studies and .80 for labs. It is of interest to note that investigators did differ significantly from one another in the magnitude of the effects they obtained with F(9, 18) = 3.81, p<.01, intra-class r = .63. There was little evidence to suggest, however, that those investigators tending to conduct more studies obtained higher mean effect sizes; the F(1, 18) testing that contrast was 0.38, EM 54' r=.14.2 Conclusion On the basis of our summary ant the very valuable meta-analytic evaluations of Honorton (1985) and Hyman (1985), what are we to believe? The situation for the ganzfeld domain seems reasonably clear. We feel it would be implausible to entertain the null given the combined ~ from these 28 studies. Given the various problems or flaws pointed out by Hyman and Honorton, the true effect size is almost surely smaller than the mean h of .28 equivalent to a mean accuracy of 38: when 251 is expected under the null. We are persuaded that the net result of statistical errors was a biased increase in estimated effect size of at least a full percentage point (from 37Z to 38Z). Furthermore, we are persuaded that file drawer problems are such that some of the smaller effect size results have probably been kept off the market. If presses to estimate a more accurate effect size we might think in terms of a shrinkage of h from the obtained value of .28 to perhaps an h of .18. Thus, when the accuracy rate expected under the null is 1/4, we estimate the

Footnote 2. (Indicator2 should be added to Page 51, line 11, r=. i~2 ) After preparation of this paper we learned of a possible problem in the randomization procedures employed by the investigator contributing the largest number (9) of Ganzfeld studies to the set of 28 summarized in this section. Accordingly we constructed Table 4a to investigate the effect on the mean and median effect sizes of omitting all the studies conducted by this investigator. The top half of Table 4a shows this effect when we consider the 28 studies, as the units of analysis. Omitting the 9 questioned studies lowers the mean effect size from .28 to .26 and does not change the median effect size which remains at .32. The lower half of Table 4a chows this effect when we consider the ~ a_ _ _ as the units of analysis. Omitting investigator in question lowers the mean effect size from .23 _ .22 but raises the median effect size from .32 to .34. It eeeme clear that the questioned randomization of the 9 studies of this investigator cannot have contributed substantially to an inflation of the overall effect size. 10 inveetiantor~ the to

52 obtained accuracy rate to be about 1/3. Situational Taxonomy of Human Performance Technologies In the previous sections we have reviewed domains of human performance research individually. We now turn to questions regarding these areas of research taken together: How do the areas compare with respect to their overall effect sizes and methodological adequacy in general? What are the important characteristics of these domains in terms of their susceptibility to expectancy effects? What is our best estimate of the "adjusted" or "true" effect size for each of these areas after taking into account the possibility of interpersonal expectancy effects and other methodological weaknesses? In attempting to answer these questions, we developed a situational taxonomy of the five areas of SALT, NLP, mental practice, biofeedback, and ESP. This situational taxonomy is given in Table 5. The first line shows our estimates of the mean effect size (I) for each area based on our reviews of the literature. Given the diversity of these areas, these effect sizes are remarkably homogeneous, ranging from a low of .13 for biofeedback research to a high of .29 for SALT research. We repeat our caveat, though, that these effect sizes are not the products of exhaustive meta-analyses, and they are accurate estimates only to the extent that our samples of studies are representative of their populations. The next two lines of the table present the number of studies on which our analyses are based ant the estimated total number of studies existing on the topic. These figures help in determining the stability of our estimates; we are most confident in our judgments of the ESP ganzfeld literature and least confident in our judgments of the biofeedback literature. It is important to remember that our reviews in some cases were

53 quite selective: Our discussion of NLP, for example, focused only on those studies that investigated the Preferred Representational System aspect of NEP theory, and our discussion of ESP focused only on studies of the ganzfeld technique that employed the criterion of direct hits. The second part of Table 5 lists important exogenous factors of the studies, that is, elements of experimental design that are not necessarily part of the technique. The exogenous factors that we identified as being of particular importance are random assignment of subjects to experimental condition (or stimuli to condition in the case of ESP studies), keeping experimenters blind to the experimental condition of the subjects, setting up appropriate control groups (or comparison values in the case of ESP), and the length of experimenter-subject interaction. Of these factors, random assignment and experimenter blindness in particular are the most important in determining the possibility that exogenous expectancy effects could have occurred. Looking at Table 5, we see that the SALT studies do not compare favorably with the other areas with respect to these factors, and that only the ganzfeld ESP studies regularly meet the basic requirements of sound experimental design. The third section of Table 5 lists relevant endogenous factors, or characteristics that are actually part of the human performance technology. Two endogenous factors seemed especially important: whether or not the subjects' self-expectancies play a major role, and the climate of the experimenter-subject interaction. Self-expectanciea are an important part of SALT, mental practice, and biofeedback, and they may be important in ESP studies as the literature suggests that larger effects are found with subjects who believe that ESP exists (Schmeidler, 1968~. The domains characterized by

54 the warmest experimenter-subject climate, which we have seen to be a major component in the mediation of expectancy effects, are SALT ant NLP. Mental practice and ESP studies are characterized by more formal and neutral experimenter-subject relations, and although biofeedback studies often take place in a therapeutic context, the quality of the experimental interaction is nevertheless usually formal and neutral. The next line of the table presents our overall rating of the methodological quality of the research in these areas. These ratings were arrived at in a subjective manner, based on the factors listed in the table as well as our overall impression of the literatures. The scale employed is arbitrary, with a hypothetical maximum of 25; the absolute values of the quality ratings are less important than are the distances among the domains on this scale. As Table 5 shows, we have given SALT the lowest quality rating, followed by the areas of mental practice and NLP, which are close together in terms of quality; biofeedback and ESP are the two best areas in terms of methodological quality. Interestingly, there is a strong inverse relationship between the rated quality of an area and its mean effect size; the correlation coefficient is r(3~=-.85, p=.03, one-tailed. The last line of Table 5 gives our estimate of the "residual" effect sizes for each of the five areas, that is, our judgment of what the "true" effect size for an area would be after adjusting it for any possible bias due to expectancy effects or methodological weaknesses. This adjustment was made on a qualitative basis rather than on the basis of any explicit weighting scheme, although clearly some of the factors listed in Table 5 (e.g., random assignment and experimenter blindness) were more influential in determining the residual effect size than were others (e.g., mean length and climate of

55 interaction). We wish to emphasize that the values of these residual effect sizes are presented for purposes of illustration ant should not be interpreted too literally. As can be seen, the degree of adjustment varied across the five domains; the largest drop was for the SALT domain, where the effect size decreased from .29 to .00. The smallest drop was for the biofeedback domain where the effect size decreased from .13 to .10. Several interesting relationships among the results of Table 5 are worthy of mention. The zero-order correlation between the original and residual effect size was r=-.104. The correlation between the original effect size and the quality rating was negative, r=-.847; however, the correlation between residual effect size and quality was positive, r-.306. The partial correlation between the original and residual effect size controlling for the quality rating was r=.307. Lastly, the partial correlation between the residual effect size and the quality rating , controlling for the original effect size, was r= .413 . - The magnitudes of the effect sizes, both original and adjusted, for the five areas are not large. This is not surprising, for the five areas are all controversial, and one hallmark of a controversial area is a small effect size: Sometimes you get a positive result but sometimes you don't. If a research area always yielded large, significant effects there would be no controversy. We feel there are several important implications of the realization that these areas are characterized by small effect sizes. The first is that "small" toes not mean "unimportant." Even the smallest (unadjusted) effect size, r=.13 for biofeedback, can be interpreted using the Binomial Effect Size Display (Rosenthal ~ Rubin, 1982) as an increase in success rates from 44% to 56% for subjects receiving biofeedback therapy. In

56 short, even though the five areas may be associated with small effects, these effects nevertheless can be of substantial practical importance. Another implication involves the underlying distributions of these effects in the population. The effect sizes we have reported are means computed across multiple studies. We to not know what the underlying distributions of these effects are in the population. For example, does the mean (unadjusted) effect size r=.l4 for the ganzfeld studies mean that ESP is normally distributed in the population, with most people exhibiting it to the tune of r=.14? Or is it the case that most people would show a zero effect and a small number of people would show a large effect, resulting in a mean r=.l4? The informal ion needed to decide among these and other alternatives is not available. However, the question of what the distribution of benefit looks like for these technologies is an important one ant deserves attention. To discover the nature of these underlying distributions, researchers would need to test a large number of subjects over a long period of time. But this is information worth gathering, because the selection ant training of subjects in these human performance technologies might be very different if we thought a given technology more or less affected all people in a normally distributed manner than if it affected only a portion of the population in a skewed manner. The shirt important implication concerns the nature of replication. As states above, these are controversial topics, ant they are controversial in part because of the issue of replication failure. As it stands now, most researchers regard a failure to replicate as when a stuty's not reaching the .05 level of significance. We suggest that rather than emphasizing significance levels in the assessment of replications, the focus should be on

57 the comparability of effect sizes. Thus the question becomes, "Do the studies obtain effect sizes of similar nonzero magnitude?" rather than "Do the studies all obtain statistically significant results?" Defining replication in terms of similarity of ef feet sizes would obviate arguments over whether a study that obtained a p=.06 was or was not a successful replication (Nelson, Rosenthal, & Rosnow, in press ; Rosenthal, in press) . Suggestions for Future Research Expectancy Control Designs Throughout this paper, we have offered our opinion on the extent to which interpersonal expectancy effects may be responsible for the results of atuties on various human performance technologies. Our approach has been necessarily speculative, as very few of these studies directly addresses the possibility that expectancy effects might be an important cause of the results. We have pointed out factors that feat us to believe that expectancy effects may have been occurring in several cases, but we were not present at the time the studies were conducted, ant we to not have videotapes of the sessions. All we can conclude on the basis of the information available to us is that expectancy effects could have happened; we do not know that they ted. However, we can make suggestions for designing future studies that would not only assess whether an expectancy effect was present but also would allow the direct comparison of the magnitude of expectancy effects versus the Dhenomenon of interest. This is accomplished through the use of an expectancy control design (Rosenthal, 1966; Rosenthal ~ Rosnow, 1984~. In thin design, experimenter expectancy becomes a second independent variable that is systematically varied along with the variable of theoretical interest. It is easiest to explain this design with a concrete example, ant we will use as our

58 illustration a study by Burnham (1966). Burnham was interested in the effects of lesions on the learning performance of rats. Half of a sample of 23 rats were given brain lesions, and the rest of the rats underwent "sham" Surgery. The unique aspect of the study was that eperimenter expectancies were also manipulated by labelling the rats as lesioned or unlesioned, so that half the time the label inaccurately described the true state of the rat. In other words, half of the rats labelled "lesioned" were actually unlesioned, and half of the rats labelled "unlesioned" were actually lesioned. Results showed a main effect of lesioning such that unlesioned rats performed better on the maze task, an unsurprising result. More astonishing was the fact that the main effect for expectancy was just as large as the effect for lesioning, so that rats who were thought to be unlesioned performed better than rats who were thought to be lesioned. The primary advantage of the expectancy control design is that it allows the direct comparison of the independent effects of expectancy and treatment manipulation on the dependent measure. Analogous expectancy control designs could easily be used in research on the human performance technologies described in this paper. For example, experiments in the area of neurolinguistic programming on predicate matching could easily adopt an expectancy control design. This would entail manipulating the counselors' expectations that they would be interacting with a client who was matches or unmatched with respect to their Preferred Representational System (PRS). Half of the time, however, the counselor's expectation would be the opposite of the actual state (matched or unmatched) of the subject. The effect of counselor expectancy could then be compared to the effect of clients actually being me tched or unmatched with respect to PRS.

59 Studies on mental practice could also adopt an expectancy control design. This court be tone by giving subjects written instructions that are sealed in envelopes. The labels on the envelopes court be manipulated such that half of the time the experimenter thought the subject was using mental practice, but the instructions actually tell the subject to to something else. Of course, in this and other expectancy control studies, care would have to be taken to make the cover stories to the experimenters and subjects plausible. Biofeedback is ideally suited for expectancy control studies. Experimenters could be told that half of their subjects were receiving actual feedback on their physiological levels, and half of the subjects would be receiving random feedback c In reality, half of the subjects labelled as receiving random feedback would be receiving actual feedback, and vice versa. It is harder to plan an expectancy control design for the SALT technique; the teacher of necessity must be aware of what teaching technique he or she is using, and it would be difficult to lead the teacher to believe that the technique they were using was actually something else. A not wholly satisfactory alternative would be to have two teachers per classroom, one who uses the SALT techniques and another who takes over the classroom afterwards and administers the tests. This second teacher could then be given false labels about which classes had received SALT training. This design, however, would not be able to address expectancy effects taking place during the actual SALT training, yet that is when expectancy effects are probably most prevalent. Controls for Expectancy Effects The expectancy control design is the only way researchers can assess the extent to which expectancy effects are occurring in their studies . However,

60 many researchers do not want to know the magnitude of expectancy effects; instead, they simply want to include controls to ensure that expectancy effects to not occur. Some of the strategies a researcher can undertake to prevent expectancy effects are as follows (these strategies are elaborated in Rosenthal, 1976, and Rosenthal & Rosnow, 1984~: 1. Keeping experimenters blind to the experimental condition of their subjects. We have stressed the importance of blind contact between experimenters and subjects over and over again throughout this paper. If experimenters do not know what treatment their subjects are receiving, they will be unable to communicate differential expectancies for the efficacy of those treatments. lithe necessity for keeping experimenters blind is fully recognized in the area of medical research; no pharmacological study is taken seriously unless it has followed elaborate double blind procedures. 2. Increasing the number of experimenters. This strategy reduces the likelihood of expectancy effects in various ways. First, it tends to randomize expectancies; that is, experimenters may have different expectancies that will cancel out if there are enough experimenters. Second, it helps to maintain blind contact between experimenters and subjects; experimenters will be less likely to figure out what treatment a given subject is in if they do not interact with many subjects. Third, it decreases the learning of influence techniques; if an experimenter learns on an unconscious level, over time, how best to influence the subject's behavior, then expectancy effects will be minimized if the experimenter sees fewer subjects. Lastly, even beyond the issue of expectancy effects, increasing the number of experimenters increases the generality of the results. As mentioned in the SALT section, we can be more confident of a result if it was obtained by a larger number of people

61 than if only one experimenter produced it. 3. Minimizing experimenter-subject contact. This strategy, along with keeping experimenters blind, is one of the best ways of assuring the expectancy effects will not occur. Logically enough, the less contact an experimenter has with a subject the less likely he or she will be communicating expectancies to that subject. Experimenter-subject contact can be minimized by relying more on standardized or automated means of data collection. For example, instructions to subjects can be written out or tape recorded . As personal computers become increasingly popular, more and more researchers are programming computers to instruct subjects and record their responses. Some experiments consist only of greeting the subjects and seating them in front of a monitor; the computer does all the rest. However, the strategy of minimizing contact with the subject may be difficult to employ in some of the human performance technologies that rely heavily on interpersonal interaction, such as SALT ant NLP. But even in the case of SALT, it would be possible to prepare videotapes of lessons, and analogous tapes could be similarly prepared for NLP studies. Such automation would make the experimental context more artificial, but if these studies were conducted in conjunction with the typical, more natural kind of studies, we could be more confident of the results. 4. Observing experimenter behavior. Another strategy is to have the principal investigator observe the experimenters as they conduct their sessions. This will not by itself eliminate expectancy effects, but it would help in identifying unprogrammed, differential experimenter behaviors. Experimenters would also probably make greater efforts to keep their behavior constant and standardized if they knew they were being observed.

62 5. Developing training procedures. If experimenters are given extensive training and practice in the running of experimental sessions, their behavior should be better standardized, which should reduce the risk of expectancy effects. It should be self-evident that these strategi~s;are, on the ~hole, uncomplicated and easy to implement. Moreover, many of them are rooted in basic principles of good experimental design. In our brief review of the literature on these human performance technologies, we felt it unfortunate that many of the studies overlooked these basic design principles and consequently made sound causal inference virtually impossible. It is our hope that, in the future, studies in these areas can incorporate some of these suggestions ant thus produce results of which we can be more confident. Expectancies and the Enhancement of Human Performance If expectancy effects may be responsible for some of the results reported in human technologies research, then why not use positive expectations themselves as a means of enhancing human performance? Indeed, several of the techniques we have discussed (e.g., SALT, biofeedback, and mental practice) incorporate positive expectations, explicitly or implicitly, as part of their procedures. For example, one distinct component of the SALT technique is the induction of positive expectancies in the students for a successful learning experience. Another example is biofeedback therapy, where an initial period is typically spent convincing the patient that the biofeedback equipment does indeed accura te ly reflect the pa t ient's physiological states. Consequently, a valid question is whether incorporating the systematic induction of positive expectations into the technologies discussed here would result in increased human performance. The induction of expectancies could

63 take place on two levels: the intra-indivitual level, where peoples' expectations about themselves are changed, and the interpersonal level, where teachers' or leaders' expectations about other people are changed. It is the interpersonal level in which we are most interested in the present paper, for what we want to know is whether we can take advantage of expectancy effects by encouraging the explicit communication of positive expectancy on the part of therapists , teachers, leaders, and other authority figures. With respect to this issue, the distinction between selection and training is helpful. Selection occurs when we identify those people who believe in what they are doing and are able to communicate their confidence to others. Every person can think back to his or her elementary school days ant remember those teachers who were exceptionally warm and enthusiastic about education, as well as those teachers who seemed to regard teaching as a not too pleasant job. Such differences in behavior are probably due in part to the "natural style" of individuals as well as past patterns of reinforcement; for example, teachers who accurately think that they are able to teach well probably think so because in the past they have taught well. Administrators of human performance programs would do well to pay explicit attention to the issues of personal style and how well a person communicates enthusiasm and positive expectations when selecting personnel for running their programs. A second approach to incorporating positive expectations in human performance technologies involves the direct training of personnel. It is certainly feasible to identify behaviors associated with positive communications and to train teachers, supervisors, ant other people in leadership positions to use those behaviors. The meta-analysis of the mediation of interpersonal expectancy effects (Harris & Rosenthal, 1985)

64 provides one such list of behaviors that, in most cases, could readily be fostered through training. A couple of programs have already been developed in the domain of education aimed at providing such training in positive expectations. We will now describe one of these programs, the Teacher Expectations and Student Achi=~,emer.- ~n-se-vice training model (Kenton Martin, 1980), so as to give a better idea of how such programs are implemented. The TESA training model concentrates on three categories of teacher behaviors, based on the four-factor theory: response opportunities (output), feedback, and personal regard (climate). Within these three broad categories, 15 specific teaching behaviors are addressed, including touch, praise, distance, higher-level questioning, and equal distribution of reinforcements. The workshops focus first on educating teachers about interpersonal expectancy effects and then on training them in each of the 15 skills. A recent evaluation of the TESA program (Penman, 1982) showed that teachers who received the TESA training exhibited significant increases in positive behaviors and decreases in negative behaviors toward low achieving students. Programs analogous to the TESA workshops could easily be developed for application to the human performance technologies of interest. In our opinion, however, the selection approach would probably be more effective in the long run than the Braining approach; human performance may be enhanced more by people who possess naturally high expectations than by trying to induce high expectations artificially. Both approaches, however, deserve further research at tension . From an applies perspective, there is the question of whether such training programs need be developed or whether we should simply continue with

65 the programs that have already been developed, such as SALT or biofeedback. After all, if a program works' in a pragmatic sense it does not matter what the causal agent is, be it expectations or the treatment as originally conceptualized. The decision of whether to pursue these programs depends in part on the Cost of the program compared to the cost of using a program specifically designed to enhance expectations. It also depends on how well the expectancy effects generalize from the laboratory to applied contexts ~ a ques tion that needs to be addressed empirically. Cone lus ion The quest for the enhancement of human performance has captured the imaginations of men and women for centuries. Much progress has been made as our approaches have become more scientific ant theoretically based. But as the reviews in this paper have shown, much work remains to be tone. In many of the areas covered here, we cannot at this point conclude with confidence that the treatment works, and we have pointed out in each section ways in which research designs could be improved for future studies. At the same time, however, enough data exist in terms of anecdotal evidence and the studies conducted so far to indicate that most of these domains are well worth further exploration. Continued research on these techniques would also help to specify those variables that are critical in enhancing performance, variables that court be then be incorporated in other more cost-effective training packages. A final thought concerns the attitude of researchers and critics in these areas. When dealing with controversial areas such as the five covered in this paper, it is best to adopt a skeptical but open attitude. People's reactions to these areas vary across a long continuum, and we feel that reactions at both tails of this distribution are not helpful. Advances in our understanding

of these areas are not likely to be mate by proponents of an area who are so convinced of the utility of a technique that they do not entertain the possibility of negative results, or by critics who are so convinced that a 66 phenomenon toes not exist that they to not accept positive results. We support the spirit o' the meta-analytic approach, an approach that says "show me the data" and does not make premature judgments either about "what is surely true" or about "what cannot possibly be."

67 References Banner, C. N., and Meadows, W. M. 1983 Examination of the effectiveness of various treatment techniques for reducing tension. British Journal of Clinical Psychology 22:183-193. Beecher, H. K. 1955 The powerful placebo. Journal of the American Medical Association 159:1602-1606. 1961 Surgery as placebo. Journal of the American Medical Association 176:1102-1107. 1962 Nonspecific forces surrounding disease and the treatment of disease. Journal of the American Medical Association 179:437-440. 1966 Pain: One mystery solved. Science 151:840-841. Berendt, J., and Schuster, D. H. 1984 Mental speeded practice in learning words. Journal of the Society for Accelerative Learning and Teaching 9-113-141 · ~ Birholtz, L. 1 9 8 1 Neuro l in gu is t i c P rag ram ing: Tes t ing some teas ic as s ump t ions . Doctoral dissertation, Fielding Institute . Boring, E. G. 1950 A history of experimental psychology (2nt ed . ~ . New York: App le ton-Cen fury-Crofts. 1962 Parascience . Contemporary Psychology, 7:356-357. Broad, W., and Wade, N. 1982 Betrayers of the truth. New York: Simon and Schuster. Burnham, J. R. 1966 Experimenter bias and lesion labeling. Unpublished manuscript, Purdue University. Carter, J. L., and Russell, H. L. 1984 Description and results of a field study using the Relaxation- Handwriting Improvement Program with learning disabled and emotior.- ally disturbed children. Journal of the Society for Accelerative Learning and Teaching 9: 297-305. Cohen, J. 1977 Statistical power analysis for the behavioral sciences (rev. ed.~. New York: Academic Press.

68 Collins, H. M. 1985 Changing order: Replication and induction In scientific practice. Beverly Hills, CA: Sage. Corbin, C. B. 1967 The effects of covert rehearsal on the development of a complex motor skill. Journal of General Psychology 76:143-150. 1972 Y.ental practice. in W.P. Morgan, ea., Ergogen~c aids and muscular Derformance. New York: Academic Press. Falzett, W. C. 1981 Matched versus unmatched Primary Representational Systems and their relationship to perceived trustworthiness in a counseling analogue. Journal of Counseling Psychology 28:305-308. Feltz, D. L., and Landers, D. M. 1983 The effects of mental practice on motor skill learning and perfor- mance: A meta-analysis. Journal of Sport Psychology 5:25-57. Garcia, Y. L. 1984 A Suggestopedic study in second language acquisition using Hispanic non-English speaking adult learners. Journal of the Society for Accelerative Learning and Teaching 9:271-275. Gauthier, J.' Doyon, J., Lacroix, R.' and Drolet' M. 1933 Blood volume pulse biofeedback in the treatment of migraine head- ache: A controlled evaluation. Biofeedback and Self-Regulation B:427-442. Gould, D., Weinberg, R., and Jackson, A. ~ 1980 Mental preparations strategies, cognitions, and strength perfor- mance. Journal of Sport Psychology 2:329-339. Grassner-Roberts, S., and Brislan, P. 1934 A controlled, comparative, and evaluative study of a Suggestopedic German course for first year university students. Journal of the Society for Accelerative Learning and Teaching 9:211-233. ~ . ~ Gritton, C. E., and 8enitez-Bordon, R. 1976 Americanizing Suggestopedia: A preliminary trial of a U.S. class- room. Journal of Suggestive-Accelerative Learning and Teaching 1:83-94. Guglielmi, R. S., Roberts, A. H., and Patterson, R. 1932 Skin temperature biofeedback for Raynaud's disease: A double-blind study. Biofeedback and Self-Regulatzon 7:99-120. Gumm, W. B., Walker, M. K., and Day, H. D. 1982 Neurolinguistic Programming: Method or myth? Journal of Counseling Psychology 29:327-330.

69 Hammer, A. L. 1983 Matching perceptual predicates: Effect on perceived empathy in a counseling analogue. Journal of Counseling Psychology 30:112-179. Hansen, F. C. C., and Lehmann, A. 1895 Unbar Unwillkurliches Flustern. Philo~oDhische Studien, 11 :471-530. Harris, M. J ., and Rosentha l, R. 1985 Mediation of interpersonal expectancy effects: 31 meta-analyses. Psvcholo~ical Bulletin 97: 363-386. Honor~con, C. 1985 Ileta-analysis of pal ganzfeld research: A response to Hyman. Journal of Parapsychology, 49: 51-91 . Hyman, R. 1985 The ganzfeld psi experiment: A critical appraisal. Journal of Parapsycho logy, 49: 3-49 . Jessup, B. A., Neufield, R., and Mer~ky, H. 1979 Biofeedback therapy for headache and other pain: An evaluative review. Pain 7: 225-270 . Kennedy, J. L. 1938 Experiments on ''unconscious whispering." Psychological Bulletir, 35: 526 (Abs trac t ~ . 1939 A methodological review of extra-sensory perception. Psychological Bulletin 36:59-103. Kennedy, J. L., and Uphoff, H. F. 1939 Experiments on the nature of extra-sensory perception : III. The recording error criticism of extra-chance scores. Journal of Parapsychology 3:226-245. Kerman, S., and Martin, M. 1980 Teacher expectations and student achievement (Teacher handbook). Office of the L.A. County Superintendent of Schools, Los Angeles, CA. Lozanov, G. 1978 Suggestology and outlines of Suggestopedy. New York: Gordon and Breach . Mendoza, D., and Wichman, H. 1978 "Inner'' darts: Effects of mental practice on performance of dart throwing. Perceptual and Motor Skills 47:1195-1199. Merton, R. K. 1948 lathe self-fulfilling prophecy. Antioch Review B: 193-210. Moll, A. 1898 Hypnotism (4th ed.~. New York: Scribner.

70 Mosteller, F. M., and Bush, R. R. 1954 Selected quantitative techniques . Pp. 289-334 in G. Lindzey, ed ., Handbook of social psychology: Vol. 1. Theory and method. Cambridge, MA: Addison-Wes ley . Murphy, G. 1962 Science in a straight jacket? Contemporary Psychology 7: 357-358 . 1063 Parapsychology. Pp. 56-63 in N. L. Farberow, ea., Taboo topics. New York: Atherton. Nelson, N., Rosenthal, R., and Rosnow, R. L. 1986 Interpretation of significance levels and effect sizes by psycho- logical researchers. American Psychologist, in press. Owens, L. 1978 An investigation of eye movements and representational systems. Doctoral dissertation, Ball State University. Penman, P. R. 1982 The efficacy of TESA training in changing teacher behaviors and attitudes toward low achievers. Unpublished doctoral dissertation, Arizona State University. Rao, K. R. 1985 The ganzfeld debate . Journal of Parapsychology 49 :1-2. Render, G. F., Hull, C. R., and Moon, C. E. 1984 The ef facts of guided relaxation and baroque music on college students ' test performance. Journal of the Society for Accelerative Learning ant Teaching 9: 33-39 . Rhine, J. B. 1975 Second report on a case of experimenter fraud. Journal of Parapsychology 39:306-325. . Roberts, A. H. 1985 Biofeedback: Research, training, ant clinical roles. American Psychologist 40:938-941. Rosenthal, R. 1963 On the social psychology of the psychological experiment: lathe experimenter's hypothesis as unintended determinant of experimental results. American Scientist 51:268-283. 1965 Clever Hans: A case study of scientific method. Pp ix-xiii in O. Pfungst, Clever Hans . New York : Bolt , Rinehart and Winston. 1966 Experimenter effects in behavioral research. New York: Appleton- Century-Crofte. 1973a lathe mediation of Pygmalion effects: A four-factor "theory." Papua . New Guinea Journal of Education 9:1-12. ~ , _

71 1973b On the social psychology of the self-fulfilling prophecy: Further evidence for Pygmalion effects ant their mediating mechanisms. New York: MSS Modular Publications, Module 53. 1976 Experimenter effects In behavioral research (enlarged ed.~. New York: Irvington . 1978a Combining results of independent studies. Psychological Bulletin 85: 185-193. 197Bb How often are our numbers wrong? American Psychologist 33 :1005-1008. 1979 The "file drawer problem" and tolerance for null results. Psychological Bulletin 86: 638-641. 1984 Meta-analytic procedures for social research. Beverly Hills, CA: Sage. 1985 Nonverbal cues in the mediation of interpersonal expectancy effects. Pp. 105-128 in A. W. Siegman and S. Feldstein, eds ., Multichannel - integrations of nonverbal behavior. Hillsdale, NJ: Lawrence Erlbsum Assoc fates . 1986 Meta-analysis and the nature of replication. Unpublished manuscript, Harvard University. Rosenthal, R., ant Rosnow, R. L. 1984 Essentials _ behavioral research. New York: McGraw-Hill. Rosenthal, R., and Rubin, D. B. 1978 Interpersonal expectancy effects: The first 345 studies. Behavioral and Brain Sciences 3:377-386. 1982 A simple general purpose display of magnitude of experimental effect. Journal of Educational Psychology 74:166-169. - 1983 Ensemble-adjusted ~ values. Psychological Bulletin 94:540-541. 1984 Multiple contrasts and ordered 8Onferroni procedures. Journal of Educational Psychology 76:1028-1034. 1985 Statistical analysis: Summarizing evidence versus establishing facts. Psychological Bulletin 97:527-529. 1986 Meta-analytic procedures for combining studies with multiple effect sizes . Psychological Bulletin 99: 400-406 . Schmeidler, G. R. 1968 Parapsychology. Pp. 386-399 in International Encyclopedia of the Social Sciences. New York: MacMillan and Free Press. .

7 '2 Schuster, D. H., and Gritton, C. E. 1985 SALT: Suggestive Accelerative Learning Techniques. New York : Gordon and Breach Science Publishers. Shapiro, A. K. 1960 A contribution to 8 his tory of the placebo ef feet . Behavioral Science 5:109-135. 964 Fac ,ors contributing to tne Dlacebo ef feet . American Journal of Psychotherapy 18:73-88. Sharpley, C. F. 1984 Predicate matching in NLP: A review of research on the Preferred Representational System. Journal of Counseling Psychology 31:238-248. Sharpley, C. F., and Rogers, H. J. 1984 A meta-snalysis of frontalis EMG levels with biofeedback and alter- native procedures. Biofeedback and Self-Regulation 9: 385-393. Sheffield, F. D., Kaufman, R. S., and Rhine, J. B. 1952 A PK experiment at Yale starts a controversy. Journal of the American Society for Psychical Research 46:111-117. Snedecor, G. W., and Cochran, W. G. 1980 Statistical methods (7th ed.~. Ames: Iowa State University Press. Steiner, S. S., and Dince, W. M. 1981 Biofeedback efficacy studies: A critique of critiques. Biofeedback and Self-Rezulation 6:275-288. Sterling, 1959 Publication decisions and their possible effects on inferences drawn from tests of significance -- or vice versa. Journal of the American Statistical Association 54:30-34. Stratton, G. M. 1921 The control of another person by obscure signs. Psychological Review 28:301-314. Stroebel, C. F., ant Glueck, 8. C. 1973 Biofeedback treatment in medicine and psychiatry: An ultimate placebo? Seminars in Psychiatry 5: 379-393. Surwit, R. S., and Keefe, F. J. 1983 The blind leading the blind: Problems with the "Double-Blind" design in clinical biofeedback research. Biofeedback and Self-Regulation B: 1-2. Taub, E ., ant S choo l, P . J. . 1978 Some methodological considerations in thermal biofeedback training. Behavior Research Methods and Instrumentation 10:617-622.

Truzzi, M. 1981 . Reflections on paranormal communication: A zetetic's perspective. Pp. 297-309 in T. A. Sebeok & R. Rosenthal, eds., The Clever Hans phenomenon. New York: New York Academy of Sciences. Tukey, J. W. 1977 Exploratory data analysis. Reading, MA: Addison-Wesley. Warner, L., and Raible, M. 1937 Telepathy in the psychophysical laboratory. Journal of Parapsychology 1:44-51. White, L., Tursky, B., and Schwartz, G. E. 1985 Placebo: Theory, research, ant mechanisms. New York: Guilford. Zeiss, P. A. 1984 A comparison of the effects of SuperLearning techniques on the 73 learning of English as a second language. Journal of the Society for Accelerative Learning and Teaching 9: 93-102 . Zuckerman, H. 1977 Deviant behavior and social control in science. Pp. 87-138 in E. Sagarin, ea., Deviance and social change. Beverly Hills, CA: Sage. - r

~ At. Author Notes Preparation of this paper was supported in part by the National Science Foundation. We would like to thank Timothy Reagan for his helpful comments . Footnotes 1 This prob lem of using mult iple dependent variables can be addressed with little difficulty if the correlations among the dependent variables can be estimated (Rosenthal & Rubin, 1986) 74 .

15 TABLE 1 Meta-Analysis of Expectancy Mediation . . , EXPECTANCY - BEHAVIOR BEHAVIOR - OUTCOME: Effect BESD Effect BESI) Variable Size (r) From X To ~ Size (r) From ~ To Accepts ideas .28 36% 64X - _ Asks questions .17 41: 59: .24 38% 62: Correc tive feedback .24 3BX 62X --- --a Dis Lance .20 40: 60Z .45 28: 72% Encourages .09 46X 54% .41 30X 701 Eye contact .11 44X 56Z .32 34Z 66Z Gestures -.04 52% 48Z .31 351 66Z Input .26 37: 63% .33 33Z 671 Interaction duration .17 41% 59X .46 27X 73X Interac tion frequency .21 40: 60Z .21 39Z 61 Z Negat ive c lima te .29 36Z 64Z .36 32Z 68Z Nods .20 40X 60Z ~~ Off-task behavior .19 40: 60X .43 29: 71X Positive climate .21 39: 61X .40 30X 70Z Praise .12 44: 56Z .12 44Z 561 Smiles .12 44X 56X .29 35Z 651 Speech rate .04 4BX 52: .48 26Z 741 Wait time ~ --- . IB 4IZ 59X NOTE: A positive ef feet size means that the result was in the predicted direction (more positive behaviors shown toward high expectancy students ~ . Table adapted from Harris & Rosenthal ~ 1985 ~ .

- ~ cD o B X I1 Z o P CO . . ~3 5 Or o ~ ~3 0 :r ~ ~ ~ S of Or s o £ In - e o D, X _ s 3 ~ t ~ - - ~ ~ 3 o ~ Cal 3 ~ - . · _ o S ~- N £ P. go ~ 1~ - - 11 t - . ~ e ~ ~- ~: a, Cal 3 or 3 ~~ o 3 on ~- - - ~ 3 ~ at o ~ ID lo. 3 ~ S . C o ~ - ~t N - . - - - 4: 0q ~ In - . : - ~ r ~3 . < 5O c~ c~ c~ en tn -3 D :~ 3 P ~ ~ _ ~ ~ ~ t: P o ~ o ~ ~ ~ ~ ~ ~ ~ .= o ~ ~ ~ - . o P ^ :' ^ _ ~ _ :~ ~ _ ^ ~ oo ~ _ ~ c~ C tn _ ~ ~ ~o ~ ~ ~ ~ n _- _ ~ _ ~ 5 C C _ ~r N O - 1 1 ' C~ · - - · · - · · - · - ~ ~ - - - o _ 0 _ ~ ~ ~ _ ~ ~ ~s~ ~ ~ 1~ x c~ =~ ~ ~ ~ ~ ~n ~ O C~ ~ O ~ ~ ~ - - _ _ ~ ~ ~ ~ _ _ ~n ~ ~ ~ ~ ~ - . o 0 o~ ~ 0 _ _ _ _ _ IZ 3 o o o o o s ~D 3 s ~ o N ~D ~D ~- 0o S ~D C~ ~r 0 ~- N £ 0 1~t 11 . ~D . - -m O X m ~ ~ o~ ,0 ·~ ts~ r `: =. - . ~ {D ~ ~ ~ c ~ ~ ~ ~ c ~ e cr s ~ ~ ~ ~ ~ - . ~ - . C 0 ~ ~ 0 ~ ~r 0 ~ ~ 0 ~ - - - - - . - . O ~ O ~ C ~ O ~r O rr rt 0q ~ ~ ~ O :, o) ~} ~ ~ ~ 0 0 ~ oo ~ 0 t. ~ 3 O ~ a~ 0 ~: B "s C ~ 0 ~ ~ ~ c~ r t. ~ ~ ~ 0 - - P, - - ~ -3 -3 ~ ~ ~ ~ t~ X ~ ~ ~' X ~ X _ ~ ~ ~ S P ~ ~ r' ~r ~ ~ _ _ ~ ~ 3 ~r - ~ - ~ ~ ~ :' - - 0q t-- 0 5 0 ~ 0 ~ ~ 0o O O P ~ ~ 0 ~ P 0 ~r ~ X ~ ~ ~ ~ C S ~ ~ ~ 0 ou -- - - - - ~ ^ o 3 _ ·~~ 0~3 S ~ - . s ° ~n ~ ~ ~ ~ W ~ ~ ~ ~ _ o~ ~ ~ ~ ~ ~n m~ ~ V, ~ ~ ~ C~ ~ ~ ~ ~n _J ~ =_ _ ~ ~n ~ ~ O :- 0 0 - - 0Q ~ o ~r C s - . ~ 3 {D ,7 C C~ - . _ C~ P o~ P tn -3 o P, cn cL ~- 0 w ~x B ~- ~- 0q cn :- r ~3 -

TABLE 3 Stem-and-Leaf Plot and Statistical Sugary of Ganzfeld Studies Employing Criterion of Direct Hits Cohen' ~ h _ Stem Leaf Sugary Statistics 1.4 4 Maximum 1.44 1 .3 3 Quartiles (Q ~ .42 1 . 2 Med fan ~ ~ ~ . 32 1.1 Quartile (Q ~ .08 1 . 0 Min imum 1 1 _ . 93 Q3 Qt . 34 .8 TV: [.75(Q ~ Q ~ .26 .7 3 S 3 1 .45 .6 Means .28 .5 8 N 28 .4 0 2 2 2 4 Proportion positive sign .82 .3 1 2 2 4 4 7 8 Z of proportion positive 3.40 .2 2 Combined Stouffer Z 6.60 .1 3 8 8 t test of mean Z 3.23 .0 7 7 9 Correlation between h and Z .86 _ _ -.0 5 Correlation between h and raw i .98 -.1 0 Number of studies with mean Z ~ 0.00 -.2 required to bring combined results -.3 2 to ~ > .05 423 -.4 0 -.5 Confidence Intervals -.6 from to -.7 80% .17 .39 - .8 95% .11 .45 -.9 3 99Z .04 .52 99.9: -.03 ·59 * where N ~ 28 studies a Unwe ightet; weighted mean ~ .23 77

78 TABLE 4 Proportion of Studies Reaching Critical Levels of Significance for Two Research Areas Expected Expectancy Gansfeldb Interval for Z Proportion Research Research Difference Unpredicted Direction -1.65 and below .05 .03 .07 +.04 Not Significant 1 .64 to +1.64 .90 .60 .50 -.10 Predicted Direction l +1.65 and above .05 .36 .43 +.07 +2.33 and above .01 .19 .25 +.06 +3 .09 and above .001 . 12 . 18 ~ . 06 + 3 . 7 2 and above . 0001 . 07 . 04 - . 03 a N = 345 studies; from Rosenthal & Rubin (1978~. N = 28 studies; from Honorton ( 1935 ~ .

Table 4a . Ef feet on Effect Size (_) of Removing Studies by Sargent Me an Median Sargent ' s Studies (N=9) .30 . 37 Analysis by Studies Including Sargent (N=28) .28 . 32 Omitting Sargent (N=19) .26 . 32 Difference .02 .00 Analysi s by I Eve st ~ gators Including Sargent (N=10) .23 . 32 Omitting Sargent (N=9) .22 . 34 Difference . 01 -. 02

79 TABLE 5 Situational Taxonomy of Human Performance Technologies Human Performance Techno logy Mental Biofeed- SALT NAP Prac t ice back ESP Mean ef feet size (r) .29 .27 .23 .13 .14 - N of studies on which our analysis is based 6 8 60 20 28 Approxima te number of studies in population 30 15 100 2300 28 Exogenous Fac tars Random ass ignment no yes yes yes yes Es blind to condition no rarely no rarely yes Adequate comparison groups or values no yes sometimes sometimes yes Lean length of inter- 32 1 hr 9.4 ses- 26 30 action per study hours signs days trials Endogenous Fac tars Se 1 f -expe c Lane ie s important? yes no yes yes maybe Climate of E/S very warm/ neutral there- neutral interact ion warm therapeutic - peutic Overall quality rating 3 10 9 13 19 very poor fair fa ir good very good "Res idual" ef feet size .00 .16 .18 .10 .09

POSTSCRIPT

Postscript We have been asked to respond to a letter from the committee raising questions about the presence and consequences of methodological flaws in the ganzfeld studies discussed by Honorton (1985), Hyman (1985), Hyman and Honorton (1986), Rosenthal (1986), and by the present authors of this paper. Our response is in two parts. In part 1, we examine the likely effects of flaws on the meta-analytic results of the ganzfeld studies. In part 2, we examine the results of a series of new studies designed to address the flaws discussed by Hyman and Honorton in their individual and joint papers. Flaw Effects The committee has called our attention to possible flaws in the randomization procedure employed by Sargent and his colleagues. In its letter it noted that Honorton agreed with Hyman about the assignment of these randomization flaws to the Sargent study. However, Honorton states in two letters that this agreement was not reached (personal communications of November 25, 1987, and January 15, 1988). Apparently, experts on the ganzfeld research disagree on whether the Sargent studies' randomization procedures are flawed given all the evidence available to both, evidence which is summarized in papers by Blackmore (1987), Harley and Matthews (1987) and Sargent (1987). For purposes of this postscript and the following data analyses, we are going to assume that Hyman is correct in his assignment of randomization flaws and all other flaws he assigned in his 1985 paper. The heart of the matter is the relationship of flaws to research results anc! that is what our analyses are designed to investigate. In a 1986 manuscript, Hyman suggested that the relationship of flaws to study outcomes should

2 be examined in a multivariate manner. Accordingly, that is the nature of our analyses in our first pass effort to examine the likelihood that methodological flaws are driving the results of the ganzfeld studies to an appreciable degree. Canonical analysis. The most general of the multivariate procedures examines the maximum relationship that can be found between two sets of variables, for example, a set of predictor variables and a set of outcome variables . In our analysis the predictor variables were Hyman 's (1985) flaw variables of documentation (DOC), feedback (FB), randomization (R), security (SEC), single target (ST), and statistical analysis (STAT), all coded as O if adequate or 1 if not adequately done or not adequately specified. The out come variables were the significance level Z and the effect size Cohen Is h. Ihe adjusted canonical correlation was only .46, a magnitude that for two-predicted-from-six could have arisen under the null hypothesis 54 times out of 100 (F(12,40) = 0.91). Interestingly, three of the six flaw variables correlated positively with the flaw canonical variable and with the outcome canonical variable (DOC, FB, R) but three correlated negatively (SEC, ST, STAT). Thus, the canonical analysis gives no support to the hypothesis that the research results are a significant faction of the set of flaw variables. Regression analysis. Separate analyses were also done for each of the outcome variables _ and h. The battery of predictor variables correlated only .44 with Cohen's h (F(6,21) = 0.84, p = .56) and .57 with Z (F(6,21) = 1.65, ~ = .18). For neither of the outcome variables did any of the six predictors account for a significant proportion of Variance either in zero-order form or after partialing. Since there were two methods of partialing employed, a total of 36 (3 methods x 6 predictors

3 x 2 outcome variables) t's were computed, none of which reached the .05 level. Regression analyses, therefore, gave no more support than did the canonical analysis to the hypothesis that ganzield research results are a significant function of the set of flaw variables. New Evidence Hyman (1985) and Honorton (1985) were agreed (Hymen and Honorton, 1986) that new studies were needed that would take account of the flaws they had found in their critiques of earlier research. Since our present paper was completed we have learned of a series of 10 new studies conducted by Honorton, one of the four investigators singled out by the Committee on Techniques for the Enhancement of Human Performance as among the best in the country (Druckman and Swets, 1988, p. 223. Me series of 10 ganzfeld studies yielded a combined Z of 2. 79, ~ = .0026 and a mean h of .23. This effect size, based on 10 studies, is only slightly smaller than the mean effect size of Sargent's nine studies th = .303 and is very close to the mean effect size of the remaining 19 studies (h = . 26; see Table 4a). For the original 28 studies plus the 10 new ones from Honorton's lab, the combined Z is now 7.10 and the mean effect size is now an h of .27. Omitting Sargent's nine studies changes matters very little-- _ is now 5.74 and h = .25. In short, the new evidence based on studies designed to meet earlier methodological objections is very consistent with the earlier evidence and makes the null hypothesis still more implausible. Conclusion Our analysis of the effects of flaws on study outcome lends no support to the hypothesis that ganzfeld research results are a significant function of the set of flaw variables. In addition, a series of 10 new studies designed to control for earlier presumed flaws yielded results quite consistent with the original set of 28 studies.

4 References B1 ackmore, S . ( 19 87) . A report of a visit to Carl Sargent's laboratory. Journal of the Society for Psychical Research, 54, 186-198. Druckman, D., ~ Swets, J. A. (Eds.) (1988) . Enhancing human performance. Issues, theories, and techniques . Press . Washington, D . C .: National Academy Harley, T., ~ Matthews, G. (19873. Cheating, psi, and the appliance of science: A reply to Blackmore. Research, 54, 199-207 . Journal of the Society for Psychical Hyman, R. (1986). To conclude or not to conclude: A reply to the commentators . Unpub fished manuscript, University of Oregon . Hyman, R., ~ Honorton, C. (1986~. A joint communique: the psi ganzfeld controversy. Journal of Parapsychology, 50, 351-364. Rosenthal, R. (19863. Meta-analytic procedures and the nature of replication: The ganzfeld debate. Journal of Parapsychology, _, 315 -336 . Sargent, C. (1987) . Sceptical fai~ytales from Bristol. Society for Psychical Research, 54, 208-218. Journal of the

COMMITTEE RESPONSE TO POSTSCRIPT

Fly- i- i '-a. an d ,--, ~t ~ i' F'y lo| ~ ~ . AL ~ 1 n , =.m~ t:,-, ,-,-ir7,-].~.-. I:,-,r-,~r,='r,r.7 y,, r made)- s. ~ ~ ., ~'eply i,., F,-,C-`I.s, 1,lpt Hi, i-:Ot~l~lEhIT=; ON THY. F Ct=.T=I-:F T F: T TO F1-,~ ~ 5 AND :; O=EhlTHAL F; '.~entE~c`3 added the p;-~ets'-r i pt t.-, thei r paper =~+ t'-,ei ~ ar,=1 ysi ~ I-' f tE-,~: leans fel c: e,;oer i mentor they i ,-,r,' A.. Am. in- hit; tE~,~'=.e '-I! tarn C:,-.mmi ttee. c-.~:~-"r~r tt1~ ~i-' e``Flai n ttlei r reas' ~r,i nn tl-l tt-'e T!, ~i~ _`f';`il' :'. '':: ': t- ~ - ' ', -ir:l `-;~. c.i~ ~' '_,t ~ ~-' ,:~ W~ I_ 1-11_t l C~ r) l-tt ~11- c:`W 1- ,-,r,, 1 ~ `~` i ,-' n cs -~ ~! ,-,~ ~ ~ F - ~ '-,r, ·L, t-l (,.` ~) --i '..-: ,. ~:. i-l '' ._. ,[-t =~' l- ~ ~ t- ~ =~n + I~an ~ ~ e l d d ~ t =` ~ ~ =~- ~ ~I_ ~sL' = :l~i':~--1 &' -~rii ,~ p,-r- ai~-r,tE~' in ':,nF; ,:,,~ m,:,r~; '~ays, d~p=,:~;=c'i Fr,:,r,, t E~ e m e t l ~ ~:, d ,:, 1 `:' c i i `2- E 1 s t E4 n d E! ~' d s t; ' ~ c? i r~ a r a p s ~ '- ~ ~:~ 1 ':~ c~ i -- t E.. 'i': t'i E? m 5 ~ 1 '~' i=' , l~a~e p-~:',-=E--~cl. Tt~ese st<~ndE`rciE ir,':lude adequ~?t~ ,-a id''mizat '-; `:`f t ar C!et ~- y 5E4. f ~c-; ~ ~ -`r c] ~- ag ~, i n s~ E-`en E-~:~r y 1 eE~i:aq e ~ U5i r-'c~ ap p r `:' F r i ERt e t~ti ~i~i' ~- ~ ~r,-' c ,-,t;~e~= Wtli, t-, we brief1y S-n'~rr~er=~.te in `-'l~ f '~,r' ~ :' r~ ~:: ., [~I~_r ~ 5 _ir~1 -'~-~.. en ~ F~ ~1 5; ,-,ri ~F~e ~-,t F~ er F~ n~, ~_,-~r~ ,-' c~s ~F~at these ;=me e.~;per i mentCs su~p' ,rted tE~e hyp`-'tFnesi ~ . ,f psi ~ Tt~ey al- t:n,-,wl edged tt-~t ,-ertai n defe'~ts '_hara'-ter i "ed these eXperiment=. TE~ey .-c,ndu':ted ~ series '_.f analyses, however, wE~ich tE~ey bel ieve dem,:,nstccite that the .~'bserved defe'_ts did n,_,t af fe'-t t F] e r e=;~ 1 t: s ~ Tt,ei ~ 2`na1 vseS; rai se ~ number ·-,f quest i ,-,ns-- t,-l'-, many t,-, di e' c~ ~=> i n tE-'i 5 '-~~'mmer,t ~ Instec~d, we wi 1 1 1 i mi t '-l~rsel veC t', tE~e

I~;'eF:., 1 y '.. ,-, ~ 'a, >a. `--.. ;::: 'I:' -I i: it. 8~ ~.'';p~'r~ i. mer,tC~` tE-~.t ~ :~- ~ Ant an'~.l F-ii =` '~~.~. i MIMI t~ , ,~:'r.J.~ .~ :~. 6~.~- ~:3 Chew se.mple 'of A ~-ie.r'-f~ld e.`,~F:~-^mer-,l::c;. FIat-r'--, and! F,-,c-.=rit:~-,aJ.' I,-,~- tE~eir ana1 y5is~ ~ed tE~'e qua1it:y ~ atinn~ by t~''~iarle-= F'i-,n~~~r!:,,r,2~ ~ t-, ~ ~ at- =~p c `,..~- 't~', ]. ,-,~ i '=t ~ TE~ e>~e ~- ~ t: :~ n c~ s d i ~ ~ E-,lr i. i] i. m}.-, ,- ~ '`: -~.n ~,., I~c.` r ~ t: ~ i-~;` t.t~~~sie m~.clE: ,- ar l i ~:~- by Hy~r,an-'. FI,-,ni-,rt,.'n and Hyman st i ].1 di f fer , ,n F~, ,w t~, assi an f 1 aw, t.~-, t.~ese e`,`~'e~' i ment<= anc] ti'~e di s~qreemerit; i n 4~.~-,i - feCi;~rd i. -~ ~'r~~~~61:~: ~ y iJip~-~--~< iblE t'_' ~ e~-.,-ilve. '"-~~~- tE~i~'' F',-,=>te'-rip,6.,, lA<,~-ri=~ ar~ic' F,~~r,t't-'c.~.~. .'=--.~.-t ~syiTle''.ri! '-. :- ~.~: ~ n c~ r '-' ~ ~ it-'l--... ~ ~ cAw'.; ~ us i rl ~l ~ '-~' T ~-lyri-~.~-. 7 ' ~ ~ ~ ~ ~ W ·:.:~. - ..:. ~ c' n ~T,-,n ',:. t-`i~ -: ;-;r,rlc~s~: ~.: =~: k,-, ~. F-, r~'', ~1 ~ i i p ~ e ~ e~] f ~r-~ ~-~rl =lrl ~~ ~- an ~~~n i ~- ~ l r ~ ~ ~- t- C~ ]. ~t i ~ Ir! - ct s - ~ . :t i y :~ ~ .i '~' ,. ·~: ! -, Z ·~ t- '-' f F.? ~ n `-.1 ~ $ f e ,- t ~ i ~ ~ =` s t. P~ e. ,- ~ ~ t. m. .' i ~ ~ N ,-, n ~ ,-, ~ 4 ~; -;~: .-ir,.-< ii y ~.e-. p~- ~ .,vi d~d "~i ~n i f i ,- ant " , ~.~. i-,-,me~ - A,- ,- ,-,r di nol y, t he '.t t-~', -. s et c'. i r-, t F~ e i ~- i n i. t i rt 1 ,-, .o'- 1 ~ 3 i '-' n s t t~ ~t: t ''~ e nan ~ t ~ 1 d p =. i. t-~. r,ci'~. r-` :~ ... ;-.;~-,.-... ~-:'.7 .~.t,~' ~. t-,,:~.~.: t~; `~'bse~ ved fl=~!r~- ~r, f.~e e...F=~'iment~ - ~- .~~rl t~e.~ =;c~. ~ ~~ ~. c]r-l~-~r- ~-~.~s-. n Ll~,f,,~.unately' tt~e '~:,-'mmit.tee finds tE~e Fr'5itse- ript; in,-~~.mplete and unsatisfe,-t,_,ry as a response le. '-'ur inquiry. The authc~rs did nc~t res;3~-.nd t ':' '~~ur ~ ~~'n~~ern with the la'-~: '~'f rc~bustness. When ~ cat: s. base i c s'-' c~n=~abl e that .just ~ si ng1 e '~hange .-,n ~ di sputed pc~int yields ~ di fferent '-~~.n'-lusi.-'n' the data base la,-~.s ~ H,_,n,-,r t ,-,n ~ C:. t: l 98Ei :, . I~let a-ana1 ysi s '~, f psi gan~ f el d resea~-,-~: A re=.P'~'nse t, ~ Hyman. J'~.urna1 ,-,f Farap=.y~ F-,'.,l'-'qY ~ .'v 54 -- =Ll~m=~r,, 1~. f I'9='' ~ . TE~e '=ar,-, e.'. ~i' pi~: e.~.~perimen`. u ~ '~ri t;i~~~l t ~ ~ ~ Ci i. ~ ~ C.t }. D .! ~~. ~ n ~` ~ ~ :, . F ~ ~ ~ `, s y ~ E~, ~ 1 ,~, ~ y ~ ~: ~ ~ ~

F.eply t.~~. F.~;~l-~-iFi: ~,,b~`s;tne3-i=.~ ctat:iC;t.i., ianc.; ,'7ave devised a nc~ml:ler ,~,f incii' c~t,-'r~ 5~-,L-' as inflc~en,~~: citatisti'.s t '-' assess the r,,bL~=itnesis '-' f a given d=`ta tJaEie. WE~en SL`, ,L~ indi'-at;,.,rs inf'~,rm tE,e investinat,-,r thc~t tE~e al t er ~.t i, ,n ,-,r r em,-,va l ,:, f j~'st ,~'ne '- ase--~-'r even a f ew ,_ ases-- c ar, =~lter the ,~~~'r~rl~`si,~'r,ci t,'~i.c~ warns t,"e investioat,:,r aac~inc-;t .cirawinc any '-~ 'r~:. l erEii. ~~'ns. We bel i eve that the '~c~.riz fel d data ba~je l a~- i:s EiL!~-~-` K':'bLLcitr~c3EiEi I-,r, a r~L~mber ,,f qr,~,unds. Again, we wi l l reC~trict `_!'_'.l !~-'r~er't~ l~--' .ic~st ~` effe, t ,,f ·-~ancinq tE~e a~ianmer,t: ,-,r' rar~d,~'m~ ~ .~t i ,-,n i,, tE~e S~.~ner,t e,;per i. mer,1.~. n +: F~ ,-. i ~- ~ ar.' ~ `- v H~. i . ~ ~.; =.,n c:' F: '~' >er-, t. !~ =` ,, .~. ~ '. ~mr. =. ~, ~ '~, ~ `.lr.,,-.,r, . - t- ~- -~: c: ~-~ ~J., i::t n ~rnh r- ,- ,-, ~~ ,~,-, !:: en +. ~ .=` 7. ~ t-~ ~- e~-~ 4- -. ~ ,-~ ~ t~ ~ v.~. ~ i d ~. -= r ,-~ ~ 1 Ir_l I_~' ~i~.~~~_~..3iCi ~r-~'l't. m~r~t-~. ~:~.n ~-~ r-l~_~.,,~-.:~. T;i sE-,,-~;^ tE~c,.^ rar~d,:,mizc~ti,-:,r, F~c~= ric, effe`...t ,_' n effe,:t 5iZ~ `:~f ~i~ni fi .c~.n:c-3. ChE-:y ci,-,,-,w tE~c~t t,~3 mec'i an 7 5~:C'rE: f,:,r t,'7~- 1E st~ci es wi t,'~ c-.,.deauate ~-and,:'r~,izc~ti,-'n was ec =;er~t al ly t'-,e sarr,e acs the medi ar, s,-,-,re f,:,Y t t-lc 4 ._' '.. L. t.~.( ! i. i-'-. I~i 4.. ~ i ~ ~,=`cJea~=,! ~ ~- and' ,mi - at i `-,n ~ Hc~wever ~ ~ F~ev ir~,- ~ c!ded ~ '-,f S~.r~~er`t s~dies am, nq tE-,'-~se witE-~ ~deq~ate rar,d,:,mizati'-n. As Harri s and F~osentE~al indi,-ate, tE~e adequa~:y '~'f the rand'-mizat~-'n ,~,f Saroent' s studies is n'-'w ,-,penly beinq d i snc~t ed i n t [~e par apsy'- F~-~1 '-,~ i, a1 1 i t er at ur e3 ~ 3Harrit= ~ncl F 'sen=~1 ,~,-,int 'rut that H ~n-,rt,-'n' in tw,-' l et t er ~, den i ed aqr eei nq wi t F~ me t t~at Sar c~ent ' s e`~;per i ment s sh'-~ld be ~ lass~ fled ~5 inadeq~ate '~'n rand,-'mizati,-'r~- I have re,:ently di~,:ussed this with~ H,-'n,,rt':,n and F~e e':al ls tEnat w~ did d i _~: uss sr~me pr ~ ~b l em~ wi t h Sc r ner~t, l~ut he d i d n':~t agr ee wi t F~ my j~d~me,~,t; ,:' n ~'r-'r-e r~r,d~-'i~'i~ at ,-n. ,A,-,-,,rdin~ t,-, F~i~ re,-,'1 le~-ti'-,n, ,-,n tE~=tt =~`e ':~-~~ ~=i ~:'r F~e cli d stete that F~e n' ~w a':~:epted my judgment a,h, ,~.~ ~r-i'~'tE~e~~ set: '~~+ e.~;~er i meets i n the data base. He 5~\qgle5t5 tE-,at ~ m~.~ F-~-~.Ye ,-~~`n f~=e~} ~-'e~e tw,-' m~.ttere in my 1~ter re'_~11 .

F~ply C,-, 7,-,~;~.~.~.. ripe: ~ I Fit ~~ c: i .r3 en. we.. I,,'~;,[: i-.' t Id-, m~..cli-`r~ ~ A;, '-'rob I,-,r tt]F-~ ~ Llndi5F>~- wE~:,-~-~ 1~:1 Id... rar~d'-'mi.~at:~-'n we fired it t. ~ be A:. 18~, Deli, i-' i=. .-~-'r~sister,1 witE-, '-halt, e. The mecli~n ~ c,-,-~re .-' f the E~ =araent st~die`-- i c I. 74 arch for the remair,inc~ i~ a, t E-, ir.~:loq~ate . re.ncl'~'rr,i~ati'-.,r, tE]F- median 7 s~ '-~-e is i.~t). If Harris and P.~sentt~1 Fnac.1 c~ppl i =d t:~e s;=,.me an=~] y5i 5 tE,ey c(sed tE-~ e di sp~:ed Sc~.~cen+. C-tr~di e~ ,-'mi tted .~~r ~-ar,:d,-,mi;Y~-.,-.i'.,r,'? t E-, r-~; w'~'l'.1d P~'c`~- '~ ,,me t'~' renercli,~~:i ~i-,~-: ~,-,~_ib F! r=~.at::~.~-~r~-,i.p '-,! . . i n tF~ei ~ paper ~ but wi tE~ r es, ,-,red as inc.~decl~.~-~.~::e `-,n t he ,-,pp,-,ei ~ ~ ,- ,-,r,,- 1 ~. ,-,n rand,-,mi_~.t i ~-,n t;,-, ,~`,'t'.~-.me. TE~ ~; '-:` 'a~m i ~ t ee ,, '-' f '- '~~r ~e ~ ~'-'e~ r-', 't i n c i st ~ F-, ~t: i ~ t~ ,-,-,n, lc~=i,-,ns are tE~e ,-.niv .-'ne~ p,-,ssit)le. We dicl want: l~lc`rr~im anci F.osent h~al, F~:'wever, t ':. exp l se. n the ba~i s f.:~r tE~ei Y ,:,:,r~,': l usi ,~,n=. In -~- ,-.~.=e we are.. n,-:'w clis`~~=inq' ~ '~~ance in '-,ne di~p~.l::ed assionment w'-~tid F-,av~ p~,-,,ic~-ed ~ di f ferent l-,,nl lu'=ii-'n ~.sinc - HB.Y~i--, ~F,C] F.l,-,C;er11:h~.~..·C~ pr'.~' ecIc~re. TE~is is .ju=:~: ar,,.,1.~F.~- I~! r~no 1.~; 1::~-~? C'^i4.:~ ]. B.~~ 1: tE1e r~' e=-~(y ~_IL)UStneE;~ tI~' ~UPPl~~t ei thG ,-~-,n'-l~si ~ 'n. Ha~'ris ar~cl F.,'~ientE,c~l did n'-'t resp'_'nd t'-' tE~is spe'~i fi'- WE-, ~.t e vet- t [~e a'~ t ua 1 '~~' '~ ~tr ~ er''~ e may have l~een ~ i t i ~ ·~ 1 ear ~ hat H,-'n.-'r t '-'r' n,-'w t'e 1 i eves t F~at Sar ~ en t ~ ~ e.~; p er i ment s wer e adeguat el y rand'-'mi. eel. EFF-I~ "S ~,r c: en+: a'- ~ ~1 l y ~ ,-'nt r i bc~t ecl 'd e.~;p er i men t si ~ '~' t h i ~ c1~. barie. B'-~, I-~.-`r~-'rt.,-'n and I a~reed in judginq .,ne l~lf them i,, be? i nedeac~ate '~'n rend' ~mi at i '~'n ~ S'~' tE~at e.~;per i men t i s n '-' t: ctnder d i sip ~ ~t; ~ .

Feplv t:,:, F:,-IC I~.=, I i nc!~ ri ry. Instead ~ tE~ey reso'-,ndeci by app~rent :1. y =.~-,!::,'~i. n~ r,i-' rel ~t: i, ,nsE~i p ketweer, Z s.-,~~re =~nd ef fe' t si ze ancl si .-,~ F`yman' c di`_~-,.:.t'~m':,L`c-. fl<,w ratings. They indi,:ate that they ':~:`ndu, 4~ed tE,i~ m'`ltiv=ariate an=,lysiG tia=-ed '~,n ~ s~qgeel~i '-,n HYm=~n macle i.n =., i'98!~ m~r, c'= ,- r i p t. ~ Lin f ,-,r t; ~n at e 1 ~, v ~ i~ ey m i s'- ,-,r, st ~ ~ ~ed ~ il. ,= p'_, i n t: '-, f Hym~n ' sc'.r-:~ne=ti.,-''?-' w't~i.~-~, ir'-'ni'-c`1 ly' W~5 macle t,-. a~,_,~d .~;t tit-,,-~: awl:ward pr~:~blems, reated by the H=.`rris and F:'senthal m~.ltiva iate an~.ly--~. Iri ~7i'> '!~-" a4~.r~1 an,--~1Y=-is Ilf the '3anzfeld exp=-riments, `!~ ,~.~.n ~.~-.i ar,e~, r ~.~.: i r,~= , ,n ~ di ,- F~-,t,-.m' .as s`~ =~1 ~--pr es~r,,- e ~~'r ab S=n'. e--,~ri i~ ,:~:~,,rie~ ,~f flawsu TE~t tl~e v=~bles; are c~' di ,~: tn,-'t, ~mi. e~c~nri 5-,r5' -~-` r,~mer'-,~;=i i n r=1 5~Lt i '-'n t'~' tE~e samo1 e =1 ~ rai.=e;=. ric~mber ,:,f pr,:,tilerr:~.:. Three ':,f ti~e variabl~., f,:,r e,;amF,le' F-,~d t;,-, be di=.,.c..rcle.- =-,.t tt'.5~-! `-,~_et be,~use =1 ~,,-ist a~ 1 th1F e.~;peri. ment~ F~ad the same va1 ~e '-.n them. P1any fa'-t,~,rs di '-~5t:ed aq~inst tryi.no t,:, i.nvestioate tE7e relati.:.nsE7ip<= bet:ween ea'~:l~, flaw .r,c~ tE-,~ ,- ~ i ter i ,-,r, separate1 y. Sever='. sepcar~.te ~-,~~rrel ~.t i ,-,n=~ w, ,~1 d i.n,-t-e~sF: tE7~ F!~r':'b1 em, ,.,f Tyr:~e 1 err,:,rs. TE~e di,:~c~t,:,m,~,~rs ~,c=.' esT ':'n t ~ 1 5= ~ It E~ er ~ t 1 5~n d wer 5= p r ~ ,b ~.b l y b, ~t F~ , .w , ,r, ~ e 3. i c~b i l i t y ~n d ~et-5tr i ~ t~::! tE-,e m=~ni tc~de '-' f the p,-,esi b1 e ', .rrel at i ,-,ns. Idea1 ly, ~ sinale ~ ~ ~m,c''-'site l~,f all tE~e variables; w`-uld F~ave pr,_,vided ~ s':.luti.-.n. This w.:.L`ld in.-rease reliability, restri.-t tE~e number ,~,f signi fi,-an'~e tests t.-. '_'ne, and pr'_'vide ~ quasi- ,, ,nt. i r,s~,-.~. ~, c~l e. Un f, ~rtc~ne.tel y' ti~e pr i n, i pl e '~, ~mp.-.nent;s ,.n':`i y~- i - ,-,f tt~e ~i var i at~l es su~qested that we w,-,~1 d reqe~i re tE~ree i ndepende. it '-, ,mp, ,~i t:es; t' ~ =, '_~-.cint f.-,~ the ma.j'~ir i ty . .f tE~e t'-'t~.'

I;.ep:' 1 ~ 4~; ~ f :'i-, i. ~ s'. ·|' i j VC- 1- i C:tr. '- A .~.mi-,i--t .:, 41. i-, F.- '-a! `:~! . .-t 1 e .~.:; ~ ~ ~ ~~. i nip I-~r t an ~ ~ I-! ~ II-,' .L I' Add, ~ - , ,-,r~ .;, ~ =.t~e~ t:t-, jr ee ~ '~,rr,n,-,<~it -s ins =p er,den~].y ,- L1'-'W ~~]'- mea=~re~. ret cited to able '-a- i. ter i a. Hyman ,-~-'mpc~ted tans. ma ~1 t i pi = ~'e,~j.,r=.S~i ,-,n I-,! t.E,= =.~-,Yes ,-,-,mF',-,~i te<-. Hopi not I' i=, '-aim- ~-~.~. i'-, ,-ri.t 6?]r Ll_.IrI. TEn~ res;~1tir~ a,~. t;ip1e '~~~.~re].~~,n .,f .Ei~ t:~d j~ed V<B1 Ll~ ~ =~l ~ W=< ~ si r:lni f i '-ant 1~:,y '-~-,~'ver-~t i '~', ,a1 st~r,c'icarde ~ p. (~()~. ~ . (]f 6,`-,~ i ncl~ Yi C4~t,:~.1 I~~ ~,~nF:i ;C i t:ec: ~ ~-,n]. ~ t:,-~t ~-,-.mp,-,si te i,~t , ,-lnt=.i nec i~, ~, r:~ ~ 1 .::~.~. rv~ . [; ' -3i h~ r ~ r! -;; r ~ ~ ~ ~ : ~-; - ~ = ~-~ F! =,. ~ ,-.~.~ e 1 y ~ i ~ n i f i ,- <-.~. ri i.: <: ~ - ~ t ~ t ~ , f&,'~ ~; I l `_`I_~.4 t~l {:: r~. :' ' " ~ i -, r, :r ~ ~. i . =`r1 t:: " ,::.' '.~: `- `-,iT,:-- i =~ <-~ .~. £ -.~: i `, ~ ~ !~ ym~-~-, '~- n ec:~ I-ii . ~-~;~,;-.-i-_. t.;i-. iI~ ,-=, ,-~,'3. ~.r, F>~'i;~:~ r,~, t:,,,-' m~.l-.'-' ~ ~' - ~r-~' P ~i.p,,:,rl i1: ,-:,r .t ~! .." , - , i, it-i ~' 't!' 4' . ~ r.., =, '_; ~ ~ i i~ r, ~ ~ ~,: ~ ir-,, ~ ~ p ~ 1 i ~ d ~ ~ , ~ i, i ~ d a. ~ =, ,~, ~ =. ,_, s ~,=~-rim and F.,-~ent;t-,,-;l ~ ~ ml3~1tivariat~ an=2.yse-., =' =1treacly indi,-~tecl' rc~r, ir,t,-, many ,-,l t33'~ii~ di f fi,-ulties tt-,at -1~;mc~.n W=i~ tr . ir-~:~ t,.i ,-.v,-,icl Wr] en F~e ,-,,r)c~trc~-ted F~i~ ,-,-,mp,-,site i 3~) d" '-:e=~ Tt,e . r ec,r E2E,--,i ':'n ct. lel y~;r-i df' n':'t; P~E:]. p res,-,l ve an~ i scues be,- ac~=e wi t l- F't~~..~i.~.~.~~'r~_ =.~nd I,n1\f ~'£~. ,-:~c~s' tE~~; pr,s~ibil~.~;y '~'f .~t'-~:~.r-,i.r`~ s~`tiC-ti.' c-.Al C~if=ar3i~i`-:c-~n'-e, even i.[ the flaws a.~-e t'~ly ,-,-,rre~.~.t;ec! witE~ tE-,e `_Ylt~i~, ~' 1~_twe Witi] ~ m~ltiple ,-,-~e'sti.r~n tE,~t 3~:~-'l~,en w'-,~ld desi c~r,~::e as medium in size, the estimated p,-'wer is ,-,nly -:~. Even Wit3r~1 ~l larne '-r'~relati'_'n, the p`'wer w'.uld be less t['l=3n (_3- =~! ~ ~1 I-~-lt! ~m~::~= predi '-~rs wi t'~ asymmetr i '-~1 di str i but i I-ln5 flt4irt!-~l- ~q~~v~t;e- the p~~~wer pr~-~blem. F.-~r e.~~;ample, in this sample "Hyman ~ ~ sampl e ~-~-'nsi ~ted .~'f SE e`,;per i ments fr'-'m wE~i'-h ~ Z- .,-,-,~e ,-,-,~l d bw '-'btai ned ~ Hyman ha=- n;-,1 attem~ted tr' apr?1 y the ame re~r-e~.=i'~'n analysis t,, H,-,r,,-'rt.,-,n' s smaller sample ~ 'T -~8! but e =~`p~-t=, be'~~=e ~~~f 1,-~w p,,wer ~ tt~at t;~e '-~tt'-~-'me miqL~t ve~-y we'' 1 n'-,l: yield ~tatisti' a1 "sioni.fi'.an'_ea "

Reply to Postscript 11 recommended that our sponsors establish a committee of experts to help them keep track and evaluate the work of Honorton, Jahn, Schmidt and other of the more promising parapsychologists.

I; ~p 1 y t; i-.~ ~ ~-~-. ~ ,.. i- ~- ~ r-. .s -~:1-~ exper:i.m=,nts t ':' C-Ltr!F':'rt thE P~yp'-'tE~esi C. ~ ~f 3=j ~ Adriar, Fc.~, I:e~ ~.nc! Ni.lC Wil:lLLnCl! tw':' E~:,pe~n p=.`rapsycE~,-,l,:~gic^ts, F~ave ~ e' entry reasserted tE~i =~ i mp'-`rtant req~ti rement7. I ~ tE~ pe~-~- i pi ents ~_an i d er~t: i f y t F. e t ar ~ et ~; ~ ~ n'-~- ma 1 sen s'~'r y '~ h =~n e 1 ~ ., ~ F~ ~ par ap~y`~~-,1 '-~. st; 1-,~ n' ~ r i qi~t t' ~ '-1 c~i. m tE~at psi w=.~. i nv,-,1 ved . A, .,: ,-,~ c:! i r, ,::: 4. ,, '- '-'n t em~.~ i .,r c~r y r~ar ap =.y'~ h'-~ 1 ,-,~ i, ci 1 st c~nd ar d ~ ~ t Flen .,. t:~-~,: ,:;nl `~ F: ',pF~s ~ fnF r, ~ i n F.i ':' r~-`r+,c~n ~ - de.t=' ti<*sF'' t~it]~l+~ '-2`i i p'r ~ ''',i ~]F- '=-.i,,.-im'~' ~ ·~.~.~:`pi-:~-- t-' ~ p=.i. a~-~-: £~ e.~~i~imer't~ b\; 1-: s 1 .,~Ir rI Er,-.. ct~l(! t-, i .. ~ '-' ,. 2. - `..i '.~ -. ~ , F-i c. c*, . ,- c: p t: ab i. 1 i. t y 1-`, ~ IL-I ~ =e -. ~.: `. p =' I- i cr,=,r`. ' -. ',-~ ,~:i p wit~-,ir, t:~-,~ p=;.r-~.~ .;y~-:~-,~~.~-~qa.~-~., ~~mm~.nit:~. ~~fic~ ~ n,-~w =.' ...2. c- ~- t,-, t.~~~ree .~,f tE~--. rep'-~rt<=. in t.~~!~. r. di -.,c:3''.t.~;. !~i :~f 4~: ~, ~ 73~-L:e1' =nct l,~i l.:lu,~d t:1'987:, ~tc~te~ tE~c~t: =~..- , ~ , T... ~ .... !i.~.-, ~ I- ~ ]. W ~ .~. P ~ r- j. men t _ l~c...~'= '~ c~n i ~ L~e c~n d 5F3F I-~=,1 j.~,F3,,} t~1, e. TE]F Y ':an be des,:ribecl as tt]F! b~;lwa~rt i-' f P~F.,C~.~-~-,,-.,1,,LY'S ,-:1CAim !,, F1~Ve f,-,~\nd ~L ~GF=C~.t=~].e I~.r1~1 e~i;~.'ef iment. Witt-,,''`t. them' the wh,-,' e Bcan~f~ld edi fi,-e fcal 15. ~ . Ti-'e wh'~,le t=an~ feld s, enari'-. has thus an ir'-'ni'_ twist: tE-~e c~ttem-~. ~I~l ~ ir'-~mvent. tF~e '-riti'~'s araument, tF,~. the evidi='n' e f'-'r p=.~. .-~-'nSi5t5 '~'f arti f=' t. and fraud, t,Y ~Sar ger`~. f ~ ~ L. t: I'98~? ~ . E.~;p 1 '-lr i no psi i n the a~n_ f el ~ . F::c~rac'=`y'~1~-'' '-~=i'-~3 N;-'n'~~~c~.~s N,-~. 17. I~Jew Y`-.r[., NY~ F'ar ap sy' t~'-~ 1 '-'c~ y F.~~`n ci at i '-,n .. 7F .;,~~:~ r~~ f;. ~ 9~. ~- ~::2,.~~, hi. t:~637:~.. TF~e ~~an~feld e~~;r~~r~ments: 4.. ~ ,'-~.-~.r c1~. an =~.C-sessment: J~ `' tr r~1 '~' f ~ F~e S`~~ i et. y f'~~r F sy'- t~ i ,~ =T F;~' ~ r; ~; ~ ·r~ '~ 1-] tj ~ n ~ ~ ~ · ~

F.=p3. y t,-, F. ,~t:~:l-:, ~ CJJl- it-'s ~ =t '-.' l~~ ~ ~ r~ . ~ t_.t SGS ~ ~ [_i ! ~ .. ·"..-- ~ ~! . , ' .' .~ ~ :' ~ t~ l_.~ n i. ~ t TI_,~ ~! e ~ ~~ a' m T~ j~ ~ F ~ ~ _ '_.| ~' ~ i. q r~ i rT ~ .-lr'.-:' ,-:~. -r >. ~n,-., ,::,u'.l n~! e;;~T-ri.men'L.~;. t't-.Ta~L, ~.~rc: free [r,-,m m,-,~t ~ i f n,-,t:: 1~ 1 ,l. ~ I._T f ~ ~ T ¢_ m e ~ ~, ~ ~ d ~ ~ ~i. ~ ~ ~T i- ~ ~ 3. p ~ '_' b 1 e m ~ ~ F~ ~ ~ pT 1 ~ ~T ~t ~ d ~ F~ e.. , , ~ i ~T i. n ~ 1 t_T Ct ~ ~ f r .. ~. c] c T =~. ~ 'i b ~ s e ~ F: 1 ~ .' ~' t, T T. ~ ~; ~. - . 3. ~ ~' F:~i ~~ ~ ~ e cl s T_' ~ ~ ~ ~ W e '_ ~ T~ ~ ~ n d e ~ ~ =~ n ct w h y t,,,n,-:,~-=i~'r. ~l;-:`rTr:: '~it:~ F,-,:,-nt: ~'TCi.1 ~nd Ha~ ~ i=.! fe=.1 en'-i~,~`rac~ed. Fr'-.m 4 _. wt ~ `~:. we , ~, i -, !,,,' ~ l-' ,-; I*,' - .~r p~s ," y ~ . 3. ~. , . . ~ .: ~ . .; . ~ r , ; . :. , i ;~ . ; 4, t'c. e~i;p=~' i m~ nt. `~- d,-, n, ,t: ~;~,,- ,- ~ - -, `;~, ~f '-;-.r 7 - -~ ~ F:F.~'im~r~.~ In tE~;~. ,-,ri.,-~.r',,3 e.r-~ +~., ~ I-J '-- ;;F''_r i. m~r -:: .... 4-'-~- 4: '`; r ~ - ;~ 6-r'~- r~' c'.1~. '=> p~i '~ ti~re - '~~- ,-e":,r,-,`:';~i i:-Li:,r,-` Ilf r:ai r'$-.l.nc,..;.~ Tr~i '6~~~. f~i,~'l: '- e`~`g,=~rimFint ... :i.n Li,-,r,1~,r-~:,-,r,' ~ `-~F';~,i '--~.' ''.~.'"-v ~.!~-.14_;i'-' ~t,~.-,i,. ar~ dyn.=mi1- i~.~. V!~ 1' r L!~-~_i~ TE-' f--~' i-.s. -~; ~ ;~.m '. I- J`: .~.~- r! e'~' --. '- I-Ir~.-> i ~;~; e~r,' `-' f =i~~ ~ ~ ,-lr~! eF5~. '-~ e~n -: F''; ~ .,n ~' i r~ ~-, t ~j ~ £~- . Tt-i~ ~--~.ti.' i:~.~-~:~::~. ~'e,-e =`irr,i.l,.., f.~~' h:~I:~=e c`~-;e~:' i.n t.t~e '.~riair~c'.~. t-.iC.it ',-.'_ +-~2'. .! e.~...F`er i. mer~=iu In t:~ t~ new e.~;per i ments tE-~e res''.i t.= w t:~~' tE-~ ~>t.~-i:; i ~::; t:.-~ir~ ~et~- were ~1 w~.y~ ~~' ~nsi stent wi tE~ tE~e '-hanr-e P-tyF:~-'tt-~eci --. h''~.~Ver~! s'-'rne '.'f tF3 e.~;oerimentC witF-, tE~e dyrnami,- te.rnetC yiel rIel-= ~~li ~1-~] y si nni fi, ant '-~'ut,-~`mes. H,,n',rt,,r, belie~e~ th=`t t[~e dyr,ami, te.~-qetC may [~.ve s,.metF~ing ir, ,-.-.mm'-.n witE~ tarcet~ in tE~e '-,rinina1 '~an~feld e.~;per~.ments tE-,=.t '-,-'nsisted '~,f mc'15::~1~: vie.,~,~. ,`~ ~ =-ir,~-.. t'-'F'i.,-- If H,-'r,,~,rt,-,n's new e.`,~periments '~~-'ntinue t,, yield pr,_,mis;inQ re`.~.~lts' and i ~ tE-~e'~f '-:an be repli,-ated in '-~tEner 1~b,-,rat,-'ries, t:~-,~.-.~-n we -,el 3.~Y'~ tE~ey deserve tE~e atten' i,,r, ,-,f ,-,ur sp'-,~-'rs; as ~'ell <~; tE~. aer,e' =.1 =~-ier,tifi,-, ,-,mmunity. That is wh' we

Reply to Pos recommended that our sang . . Establish a committee of a-- ~ 1 r 1 p ~ ~ experts to help them keep track and evaluate the work of Honorton, Jahn, Schmidt and other of the more promising parapsychologists.

Intuitive Judgment and th Dale Griffin Stanford University Evaluation of Evidence

Inn~iiive Judgment and the Evaluation of Evidence Dale Griffin - Stanford University What I do wish to maintain-- and it is here that He scientific attitude becomes impemuve-- is that insight, untested and unsupported. is an insufficient guarantee of truth, in spite of the fact that much of the most unpor- tant Ruth is first suggested by its means. (Bertrand Russe11,1969, p.l6) Intuitive judgment is often misleading. Natural processes of judgment and decision-making are subject to systematic errors. We formal structures of science-- objective measurement, statistical evaluation and strict msearch design --are designed to minimize the effect of such errors. Relying only on intuitive methods of assessing evidence may lead to faulty beliefs about the world, and may make those beliefs difficult or impossible to change. When very important decisions are to be made, the absence of a well-defined formal strategy is apt to prove costly. We systematic biases which plague our gathering and evaluation of evidence are adaptive on an individual level, in that they increase me ease of decision-making and protect our emotional well- being. Such benefits, however, may be purchased at a high price. In the context of beliefs and deci- sions about national policy issues, Me speedy and conflict-free resolution of uncertainty is not adaptive when the cost is poor u~izadon of the evidence. Bertrand RusseB (1969) argued that science needs both intuition and logic, the first tO generate (and appreciate) ideas and the second tO evaluate their truth. But problems arise when intuitive processes replace logic as the arbiter of truth. Especially in the arena of public policy, evaluation deci-

2 signs must be based on grounds that can be defined, described, and publicly observed.) This paper examines the risks of assessing evidence by subjective judgment. Specific examples win focus on the difficulty of assessing claims for techniques designed to enhance human performance, especially those related to parapsychological phenomena. More generally, the themes will include why personal experience is not a trustworthy source of evidence, why people disagree about beliefs despite access to the same evidence, and why evidence so rarely leads to belief change. The underlying mes- sage is this: The checks and balances of formal science have developed as protection against the unreli- ability of unaided human judgment. Overview of the analysis Organized science can be modeled as a formalized extension of the ways that humans nasally learn about the world (Kelly, 1955). 2 In order to predict and control their environment, people gen- crate hypotheses about what events go together, and then gather evidence to test these hypotheses. If the evidence seems to support the current belief, the working hypothesis is retained; obverse it is rejected. ~ Or In We words of Dawes (1980, p. 68) "In a wide variety of psychological contexts, sys- tematic decisions based on a few explicable and defensible principles are superior to intuitive decisions-- because they work better, because they are not subject to conscious or unconscious biases on the part of the decision maker, because they can be explicated and debated, and because Weir basis can be understood by Dose most affected by them." 2 Since Were is no single model of "formal science", my references win be to the most consen- sual features: experimental method and quantitative measurement and analysis. The specific contrasts between intuition and formal methods win involve only the feamtes of extensional probability and experimental design that can be found in standard introductory e~cpenmental texts (e.g. Carismith, EDsworth & Aronson, 1976; Freeman, Sisal & Purees, 1978; Neale & Leibert, 1980~. The contrast between innubve and scientific methods is not meant to imply that scientific pro- cedure is always motivated by rational processes (B mad & Wade, 1982~. But scientific methods do attempt to minimize the impact of the biases that strike laypeople and scientists alike. A good account of current philosophic criticisms of the social sciences can be found In Fiske ~ Shweder (1986~.

3 Science adds quantitative measurement to this process. This measurement can be explicitly recorded and the strength of the evidence for a particular hypothesis can be objectively tallied. The key difference between intuitive and scientific methods is that He measurement and analysis of the scientific invesi~gai~on are publicly Available, while ir~tuinve hypothesis-test~ng takes place inside one person's mind. Recent psychological research has examined ways in which intuitive judgment departs from formal models of analysis-- and in focusing on such "errors" and "biases", this research has pinpointed some natural mechanisms of human judgment. An particular, attention has been focused on heunsuc "shortcuts" that make our judgments more efficient, and on protective defenses that maintain the emo- tional state of the decision-maker. The first section of this paper win examine the costs associated with our mental shortcuts (infom~ai~on-processing or cognitive biases). The second section mB discuss the problems caused by our self-proteci~ve mechanisms (moi~vai~onal biases). The third section wild discuss how bow these types of biases come into play when we are caned upon to evaluate evidence that has passed Hugh some mediator press, TV, or government. This source of information has special properties in that we must evaluate both the source of the evi- dence and the quality of He evidence. In He final section, the benefim of formal research win be demonstrated.

4 Problems in Evaluating Evidence I: Information-processing biases The investigation of cognitive biases In judgment has followed the tradition of the study of perceptuad illusions. Much that we know about the organization of We human visuad system, for exam- ple, comes from me study of situations in which our eye and brain are "fooled" into seeing something mat is not there (Gregory, 1970~. The most remarkable capacity of the human perceptual system is mat it can take in an army of ambiguous infom~abon and construct a coherent. mean~ngfid representa- tion of the world. But we generally do not realize how subjective this construction is. Perception seems so immediate to US that we feel as if we are taking In a copy of the true world as it exists. Cog- ~ ve judgments have the same feeling of "truth"-- it iS difficult to believe that our personal experience does not perfectly capture the objective world. The systematic biases ~ win be discussing throughout this section operate at a basic and automatic level. Controlled psychological experimentation has given us many insights into these processes beneath our awareness and beyond our control. The conclusions of these experiments are consistent: these processes are set up to promote efficiency and a sense of confidence. Efficient short- cuts are set up to minimize computation and avoid paralyzing uncertainty. But the short-cuts also lead to serious hews in our inferential processes, and the illusions of objecovin,r and certainty prevent us from recognizing the need for using formal mesons when the decision is important. In the MuBer-Lyer visual illusion, Me presence of opposite-fac~ng arrowheads on two lines of He same length makes one look longer than me other (see Figure I). But when we have a nder, we can check that they am the same length, and we believe me formal evidence, rather than mat of our fallible visual system. With cogn~dve biases, the analogue of me nder is not clear. Against what should we validate our judgmental system?

The traditional cornpanson: Clinical versus stansacal prediction The most common standard against which human judgment has been measured is the efficiency of actuanaL or stansucaL predicdon. An Me 1950's, msearchers began to compam how wed expert intuition compare with simple stai~sucal combining rules In pricing mend health prognoses and over personnel outcomes. Typically, such studies involved ~v~ng several pieces of information-- such as personality and aptitude test scores-- about a number of patients or job applicants to a pane! of experts. Each of these clinical judges would give their opinion about the likely outcome of each case. The actuarial predic- dons were obtained by a simple stansucal "best fit" procedure that defined some mathematical way of combining We pieces of information, and determined me cutoff score mat would separate "heals" from "pathology" or job "success" from "failure". The p~ictiorm from the human judges and the statistical models were then compared win the actual outcomes. The cI=cal judges involved In these studies were exceedingly confident that statistical models based on obvious relationships could Dot capture the subtle strategies mat they had developed over yean of personal expenence. But not only were the actuarial p~dichons superior to the expert intui- tions, many studies indicated "that the amount of professional training and experience of the judge does not relate to his judgmental accuracy" (Goldberg, 1968, p. 484). These actuarial models were not highly sophisticated mathematical formulae that went beyond the computational power of human judges. Instead, the simplest models were the most effective. For example, when clinical psychologists attempted to diagnose psychotics on the basis of their MMPI profile, simply adding up four scales (the choice of the "best fit" criterion) led to better prediction than die expert judgment of me best of the 29 clinicians (Goldberg, 1965).

6 A minor upheaval in clinical psychology occulted in reaction to Meehl's (1955) monograph which reviewed a number of studies demonstrating the superiority of objective statistical prediction lo the clinical ~ntuidon most often used to make judgments of prognosis. MeehI's review was followed by a flood of publications iBustranng Mat simple prediction methods based on the objective tabulation of relationships were almost always superior to expert clinical intuition in diagnosing brain damage, categorizing psychiatric patients, predicting criminal recidivism, and predicting college success (e.g. Kelly & Fiske, 19511. These analyses of clinical judgment in the 1950's were Me first to pinpoint many of Me weaknesses of human intuition that are the subject of the first part of this paper. The most important aspect of the cI~n~cal-stai~stical p~ici~on debate is mat Me clinicians involved were very confident in Weir intuitive judgment. This comb~nabon of demonstrably sum optimal judgments and continued confidence of the judges set the stage for Me two themes of me judg- ment literature: What is wrong with human judgment? and Why don't people naturally realize the limi- tanor~s of human intuitive judgment? Another aspect of this debate that is still impOltarlt today is the strong reaction of the pro- ponents of human intuition. Some go so far as to define rational judgment by what humans do (Cohen, 1981~. In particular, many clinicaDy-onented theorists fear that an emphasis on measurable outcomes dehumanizes social science. Tough Meehl was supportive of clinical work, and the point of his ard- cle was to change the focus of clinical psychology from prediction and categorization to therapy, his conclusions received vindent criticism. The problem, the critics of statistical prediction argued, was Mat clinical intuition deals with deeper holistic integrations that cannot be reflected in prediction equa- nons. Many opponents of reducuon~sm even question the validity of quantitative evaluation.

7 This position fails to understand how much we can learn about human judgment from foBow- ing up on the observed superiority of quantitative prediction. In what part of the decision process are we deficient? How important are these deficiencies? If we use formal quantitative models as "measuring sucks", then severe areas of comparison are suggested. First, does our ~ntuinve choice of evidence match up to Cat of "meanness" formal models? Second, do we retrieve and combine information as wed as these models do? Third, how accurately do we follow the nobles of statistics when we try to evaluate the combined information? Fourth, how well do we team from experience? Finally, how can we protect ourselves agape these ears? ]- Intuition versus formal models: selecting the Connation One reason for the superiorly of statistical judgment is Mat it utilizes information based on me observed quantifiable relationship between the predictors and the outcome. A prediction equation starts by identifying those predictor that are meaningful in a purely statistical sense. Survival tables for insurance companies, for example, are created by collecting information on many dimensions of possi- ble relevance. The obvious variables-- sex, weight, ethnicity-- are represented, but so are others, such as family size or income, that are chosen simply because Hey are stadsticaDy related to life span. Humans cannot attend to and measure every part of He social or physical environment, and cannot observe He ~nterrelabonships of every part. Instead, we must have some method of choosing a subset of the available data to monitor most closely. Generally, we rely on pre-ex~st~ng theories to guide our attention. Our confidence In our intuition prevents us from appreciating the power of such theories to determine the results of the data collection. When we attend only to confirming evidence, it becomes very hard to disprove a theory.

8 The condonation bias-- Even In basic cognitive processes, mere are costs to the theory-dnven search for infom~ai~on. Humans tend to leam only one schematic representation of a problem-- and then reapply mat representation In an inflexible mamer to subsequent problems (Duncker, 1945; Luchins, 1942~. Often me tendency to apply a familiar schema to a new problem causes those wad prior training to miss easier, more efficient ways to solve me problem. People Dying to solve logical puzzles doggedly set out to prove their hypothesis by searching out confinning examples, when they would be much more efficient if they would search for disconf~rrning examples ONason, 1960). It seems much more natural to search for examples mat "fit" with the theory being tested, Man to search for items that would disprove the theory. The most dramatic example of theo~y~nven data collection is the problem of the "self- fulfilling prophecy" (Merton, 1948~. This phrase has now become part of popular cuInlre and refed to He way that our theories can actuaBy cause others to act towards us in the way that we expect. The classic work by Rosenthal and his colleagues on this topic is reviewed in detail in another paper in this series, and so will be touched upon only briefly. Especially weB-known is He study by Rosenthal and Jacobson (1973) entitled Pygmalion in the Classroom. Teacher were given false information on the expected achievement of some of their students. Based on the expectations created by this infom~adon, He teachers went on to treat the ran- domly selected "late-bloomers" so differendy that these students scored especially highly on subsequent achievement tests. The standard wisdom is mat such demonstrations point out the absolute necessity of employing experimenters "blind" to the hypothesis in scientific research. When experimenters know how the sub- jects in a particular condition "should" behave, it is impossible not to give unconscious clues to the subjects. But in everyday experience we always have some guiding theor, or stereotype. We are not

9 blind to our expectations or theories about how people will behave. Snyder (1981) has examined how people investigate Meowed in social situations. People who try to determine if others are extroverted ask questions about extroverted qualities-- and discover that most people are exposer. People who try to determine if others are ~n~overted ask about Sniveled qualities-- and discover that most people are tnt~ve~ts. Men who believe that they are having a phone conversation with an attractive woman talk in an especially friendly way. When Hey do ~is, their unseen woman partner responds In friendly and "attractive" ways. Everyone is familiar with the vicious competitor who is certain that it iS a "dog-eat~og" world. Studies of compeiinve games reveal that these people have beliefs about Me world that cause others to act in a way that maintains those very beliefs (KeUey & S~helski,1970~. Aggressive competitors in these studies believed that they had to "get" their opponent before Weir opponents got them. Sure enough, their opponent responded to their aggressive moves with aggressive countermoves, "proving" the competitive theory of human nature. Such biases do not need to come from sing long-stand~g theones, Hey can be created within one situation When people observe a contestant start out with a string of con ect answers, Hey assimilate the rest of his or her performance to their first impression. A person who sow out wed is judged more unteDigent Man a person who gets He same total number of answem correct but starts out poorly (Iones, Rock, Shaver, Goethals & Ward, 1968~. This research provides one answer to the question: Why do people rem awn confident In me validity of poor theones? If you hold a theory strongly and confidently, Men your search for evidence win be dominated by those attennon-ge~ng events Hat confirm your theory.

10 The condonation bias: An example-- People bias their search for data about the social world In such a way as to support their existing beliefs. This is the starting point of the unreliability of personal experience. It is easy to see how social relations can be biased by pre-e%isting stereotypes about indivi- duals and groups, but it is less obvious that our collection of more objective data is distorted by prior theories. An elegant and relevant demonstration of the assimilation of objective evidence to theory was Marks and Kamman's (1980) conceptual replication of the famous Stanford Research Tnstitute's remote viewing experiments (rarg ~ Puthoff, 1974). In this example, formal statistical methods meet "head to head" we ntui~ve judgment Remote viewing is the name given tO the ability tO "see" a setting mat is physically removed from the viewer. In the ong~nal SR! study, viewers described their impressions of a number of targets and sketched an outline of those impressions. Judges then rated how wed each descnphon matched the possible targets. The accuracy of viewing in these experiments was remarkable. (Marks and Kamman believed that the judges actuary knew-- Rough clues about Me date and order of targets-- which drawings and impressions went with which targets. But that is not essential to this story). Marks and Kamman ran 35 remote viewing studies of their own and discovered that despite me fact that both Me subjects and the judges were confident Mat they coed make the correct matchups, Me results were never statishc~y s~ficant. The judges themselves, using objective cutena to rank how wed each viewing transcript matched each target, were convinced that Key had found sing and convincing matches-- yet formal analysis did not corroborate this belief. To understand the nature of Me theory-confirn~ing process, Marks and Kamman accompanied a subject to Me various targets and watched what Key can "subjective validation" take place. "When (a subject) goes to see the target sight after finishing his descophon, he tends to notice the matching elements and ignore the notching elements. Equity, when me judge compares ~anscnpts to the target and makes a relative judgment, he can easily malce up his mind that a particular transcript is the coned one and fan into Me same

11 Hap: he will validate his hypothesis by attending strongly to the matching elements. Abe fact is that any target can be matched to am description to some degree....With subjective valida;don on their side, we are not surprised if a naive person, unfamiliar with the power of subjective validation, visits a location with a description fresh in his mind-- any descnphon-- he win easily and effortlessly find that the description win match" (Madcs & Kamman, 1980, pp. 24-25~. Marks and Kamman conclude that it is impossible for people to fairly judge evidence when they think they know Me truth. That is why formal science insists on setting Me criteria for "success" before the data is collected, and using stai~si~cal models to determine the likelihood of me outcome. In our personal experience we can find matches that fit our theory as long as we are able ~ choose among the presented evidence. We may try a new method of enhancing a~ledc perfom~ance and then after the Dial we can search for domains in which our perfom~an" Is improved. Perhaps we didn't run faster or fee] stronger, but we did fee] a little mom alert or vigorous. When can we ever use unaided personal experience without bias? We can never be sure. Sometimes the biases of attention and memory ane random and "average out" to give accurate conclu- sions. But because we are not aware of these biased processes as Key occur, we cannot know when to trust our ~ntuinve judgments. Personal experience is Me only source of evidence In many areas of life, but it can never be decisive when pitted against objective quanutai~ve evidence.3 3 This is not to say mat objective quantitative evidence should necessanly overthrow singly held theones. When a theory parsimoniously explains a whole range of empirical phenomena, it is appropriate to be conservative about accepting new objective evidence that challenges it. The classic example of this is the reluctance of physicists to accept the "non-replications" of Me Michelson-Moriey experiment camed out by D.C. Miner between 1902 and 1906 (Polany~, 1962~. Notice the key difference between holding a useful explanatory theory (itself partly empirically based) in Me face of objective contradictory evidence and Outsung intuitive data-gathenng over objective measurement.

12 2- Intuinon versus formal models: memory versus counting Not only do we "choose" a subset of information tO pay attention to, there are qualities of the information itself that bias our attention. Especially in the area of social perception, psychologists have demonstrated how people or things that are "eye-catching" are seen as having more influence than are other people or things who act the same way but are less noticeable (Taylor & hake, 1978). The effect of salience-- the tendency to notice some things more than others because they are brighter, louder, unique, or noticeable in some other way-- underlies many of the failures of human judgment of probability and improbability. In baseball "you can't hit what you can't see"; in judging likelihood "you can't count what you don't notice". The greatest advantage of research over intuition is He objective q~u ndtative tabulation of RSUitS. All the evidence collected, whether vivid and exciting, or dull and tedious, is recorded so that the strength of relationships can be measured. But human experts have to combine impressions from memory. And some things come to mind more easily than others. What happens when we have tO retrieve instances to use to make judgments? Do we retrieve unbiased samples of our experiences and knowledge? Some things in our memory are more noticeable and vivid than others, just as some things in our environment are more salient than others. Often the things that are most vivid in memory are the "good stones". Most often such good stones are about individual cases, and are much more memorable than the bland summary statistics we get from formal analysis (Nisbett & Ross, 1980). Consider this every- day example: You want to buy a new car. The several thousand people who write into Consumer Reports provide you with the aggregate statistic that a Volvo is extremely reliable and efficient. How- ever, when you tell your dentist that you are thinking of buying a Volvo he tens you a handful of hilanous stones about the temble trouble his brother-in-law had with his Volvo. Suddenly, you are

13 much less certain about buying a Volvo, though your thousand case data-base has been incremented by only one. Further, when you lie In bed thinking about the pros and cons of buying a Volvo, the vivid stones of the dentist's brother-~n-law's car steaming, screeching and moaning come to mind wad greater impact than do the blue and white dots from Consumer Reports. The shortcut strategy that defines the most likely altemaDve as that one which most easily comes ~ mind is the "availability heuristic" (Tversky ~ Kahneman, 1973~. In many cases there is a correlation between things that are most numerous In the world and Rose things that come to mind most easily. But vividness introduces a systematic bias when things are easily remembered for some other reason than actual frequency. An excellent example of this is Me bias introduced by the media's proclivity for We e~ccinug and bizarre event. People think that more deaths are caused by tomados than by asthma, though asthma kills roughly 9 times as many people as tomados (Slovic, F~schhoff & Lichtenstein, 1982). People think accidents kill more people than strokes. They do not. Both accidents and tomados are widely reported and are vivid enough to stay in the memory of the reader. They are more cognitively available and so seem more likely. In addinon to deciding how common a certain event is, people often judge the likelihood that a certain case belongs to a certain category. A statistical model treats these two problems the same way, by counting. Are suicides more numerous than homicides? Counting says yes, but the shortcut of bringing instances to mind suggests (incorrectly) that homicides are more common. Is a bespectacled, mild-mannered student of love poetry more likely to be a truck driver or an Ivy League Classics pro- fessor? Counting the (proportionally few) poetry students among the vast legions of truck drivers is apt to come tip with a greater number Can the total number of Ivy Lease Classics professor.

14 Humans don't make judgments of conditional probability by counting the number In each category thee fit Ed conditions. This laborious routine goes beyond a reasonable use of our cognitive capacity. Instead, we can make amazingly quick judgments of how well a case fits a category by com- pui~ng the sunilarity of the case to the category. This ~ntuiLve method is the "representativeness heuns- nc" (Kahneman ~ Tversky, 1982a). Judging likelihood by similarity leads to predictable errors in judgments. First, we tend to ignore Me background statistical likelihood, or He "base rate" of the category. In He Duck driver example, we compute the similanty between the case description and the prototypes of truck driven and classic professors-- we do not think of the much greater number of truck dnvers. Second, when the individuating evidence is given (the fam that the target person was meek and liked poetry), we do not consider the validity of the evidence but only how wed it "represents" He category prototype. ThiS problem of He "diagnosticity" of individuating infomlanon is e~cplicidy taken into account in a statistical prediction model. Some piece of evidence that is present In almost an cases Of one category, and absent in almost an cases of the other, is highly diagnostic-- it differentiates one category from another. An one study (Tversky ~ Kahneman, 1982b), students were asked to use the results of projec- dve personality tests to predict He future career choice of a person. In one condition, they were told that people predicting on the basis of such personality sketches were usually highly accurate. In another condition, He student judges were told that the personality sketches rarely led to accurate pred- iciions. In the first condition, He judges were aware Hat they had valid information about the pa'iic''1ar case, and so could safely assume that the sketches contained diagnostic information about figure career choice. In the second condition, He judges were aware that the information had little validity, and should have been influenced by how many people (~e base rate) entered each profession.

15 The results showed that Me students p~icuons were based entirely on the s~milanty of me personality description to the stereotype of the profession. The predictions were the same in both con- ditions: the lack of diagnostic info~madon did not prevem We students from using me invalid infonna- don Hey had. A similar problem arises with specificity: a descnpnon becomes more representative of a category or a causal mode! as it becomes more detailed, yet each added detail makes it stadsticaDy more undikely. When people imagine scenarios, the more complicated and concrete Hey imagine the scene, the more probable they judge that the particular outcome win occur. One group of intemational policy experts was asked lo estimate the probability Hat the U.S.S.R. would invade Poland AND the U.S. would break off diplomatic relations with the Soviet Union within He next year. On average, they gave this com~inanon events of events a 4% probability. Another group of policy experts was asked to estimate He probability Hat the U.S. wowd break off diplomatic relations m~ the Soviet Urn on within the next year. This was judged to have a probability of I%. ~versky ~ Kahneman, 1983~. When the policy experts had a combination of events that caused them to create a plausible causal scenano, they did the impossible: judged that a combination of events was more likely than one of its components. In particular, the thrones behind innovative practices seem more compeUing the more they are fleshed out with specific descriptions and possible causal mIatonships. We evaluate the quality of the "story" we have made up, but fail to stop and consider Hat as the explanation depends on more and more details He less likely it is to be completely true. Skeptics of the paranormal are on the finnest logical grounds when they consider base rates and diagnosi~city. The base rate of fraud and cheating in the history of commercial psychic

16 phenomena-- from the 19th century British spintualists to Uri Geller-- is remarkably high. 4 At the same time, the ability of observers to detect cheating (the diagnosticity of their personal experience) has been remarkably low. This is a state of affairs that should lead to caution when drawing conclu- sions on the basis of the most impressive demonstration. Yet each of us has a powerful illusion Hat our personal experience is valid and our conclusions about a demonstration am diagnostic of their truth. Many biases of personal experience work together to make us overly mpressed by exciting novel events. The likelihood of experiencing some phenomenon unknown to scientific laws is extremely small-- and neither our perceptions nor our sources of information about me situation =e perfectly reliable. If the a priori likelihood of a novel explanation is very low and the sources of info~mabon (assurances of others of honesty of setup, accuracy of machinery and so on) are not absolutely certain few experiences should not be enough to markedly change our beliefs In the undikeliMod of Me expla- nai~on. But the events are vivid, we are wilting to make broad inferences on the basis of a few memor- able cases, we attribute the results to the salient cause without possible base rate explanations Mat would reveal simple unsu~pnsung explanations for Be results-- that is, we jump to conclusions. 4 The rate of deliberate fraud found in modem laboratory research on psychic phenomena may wed be no greater than Hat found in any other discipline dependent upon successful findings for grant support. However, commercial exploitations of alleged psychic ability--the evidence win which He lay population is most familiar--are highly likely to be fraudulent (Gardner. 1981~.

17 Our cognitive efficiencies lead to these systematic mistakes in combining information and judg- mg likelihood and causality. Similarly, once we have come to some summary statement, we are less likely to be conservative about what our judgment means than formal models would suggest. An important part of judgment is moving beyond the immediate case or example, and generalizing our conclusions to some larger domain. 3- Intuition versus formal models: the generalizability of sample conclusions hnfonnal inference is prone to systematic bias in data collection-- either from simple cognitive attentional biases or from biases in retrieving information. In order to keep the effects of this biased sampling procedure minimal, there are some basic cautions that must be taken in We treatment of Me data. The evidence collected will lead to false conclusions unless Me peculiandes of the sample are noted, and we understand what "population", or "true" state of the world, our sample represents. Making inferences from a sample to a population involves deciding whether a small amount of evidence can be generalized to support a principle in the larger world. The Fist limit on generalization is the bias of the sample. Has it been randomly chosen from me larger population? If not, we cannot know about what population we have learned. But people are not cautious about generalizing from samples of Imown bias or unknown makeup. "Consider, for example, the high school disciplinary officer who is convinced Mat He school is fined with drug users or the choral director who marvels at the musicality of the students in the school....Often one is handicapped by one's location in a social system...which can mislead by produc- ing large and compellingly consistent samples of evidence nonetheless hopelessly tainted by biased selection" (Nisbett & Ross, 1980, p. 88). Any proponent of a social or technological innovation narur- ally comes into contact with that subgroup of people who feel that they have benefited from He

18 Ovation-- rarely is such a proponent cautious enough to limit his or her faith unity a random sample is examined.S People are often as likely to make strong inferences about a population of people when they are Even no infonnation about how the sample was chosen as when they are told the sample was chosen randomly or to be representative of the larger group (e.g. Gamin, Wilson ~ N~sbett, 19801. In one particularly unsettling set of experiments, subjects were wining to make strong inferences about the attitudes of the population of prison guards and about the attitudes of the population of welfare recipients, on the basis of one case study, even when they were explicitly told that He case study was atypical. The second limit in generalizing from a sample of evidence is the reliability of Mat sample-- which depends on the size of the sample. This is explicitly taken into account in statistical models, which are very cautious about conclusions made on the bases of a few cases. But even experts seem to have an intuition that relies on a "universal sampling dis~ibunon" which leads to the same faith in summaries based on SO or 1000 cases. (KahT~eman and Tversky, 1982a). This leads to an underappre- ciation of the reliability of very large samples, and a drastic overuse of information gained from talking to a few people. Conclusions based on personal experience are quite often made from the few cases Cat come to mind as p~cula~ly memorable. s Ramly is it possible to test large-scale meaningful innovations by clean, controlled expen- mentation utilizing random sampling, of course. Consider the Head Start intervention program for intel- lectuaBy disadvantaged children. Scientific assessment of this ~ntervendon has involved the "messy" real-life program in place. But note that the problems with this evaluation are public knowledge-- we know enough about the threats to the validity of such studies to qualify our conclusions, and ~ncor- porate uncertainty into future policy. But He personal experience of any one person involved in this program would likely lead to strong beliefs about its efficacy without He appropriate caution-- and without publicly observable methods of evaluation.

19 At Me heart of ~ntuii~ve misunderstandings about chance and sampling theory is a widespread confusion about randomness. People believe the "laws" of chance are stricter in local regions Tan they actually am: processes occurring by chance alone can look quite systematic in any particular sample. "Subjects act as if every segment of the random sequence must reflect the true proportion: if the sequence has strayed from the population proportion, a corrective bias in the other direction is expected. This has been called due gambler's fallacy" (Kahneman and Tversky, 1982a, p. 24~. Kahneman and Tversky argue that the laws of chance are replaced In subjective estimates of likelihood by the Nile of representativeness. The probability of an uncertain event is determined by the extent tO which it is sunilar in essential properties to its parent population. People believe Tat truly random samples cannot have long runs of one kind or regular patterns of altemat~on. This is caused by the intuition that every part of a run must look "similar" to a prototype of randomness. People will even bet on a less likely outcome when it looks subjectively random (an parts are representative of the population proportion and none is too regular) than the more likely patterned altemadve. If people believe that random processes cannot produce sequences that look systematic, how do they respond lo patterns in random data? They search for more meaningful causes than chance alone. Gamblers and professional athletes become superstitious and attribute the good or bad luck to some part of their behavior or cloning. Even lower animals such as pigeons "discover" contingencies that don't exist and show their own form of superstitious behavior (Skinner, 1948). When food is delivered at a random schedule, the pigeons at first ~ to control the deliverer by pecking at the food dispenser. When me food comes, the pigeon can be in the middle of any action, since there is no relation between its action and the food. But it will continue ~ repeat the action that co-occurred with feeding time-- and eventually itS efforts will be "rewarded" by more food. This strengthens me behavior, and it is kept up because it appeam to

20 be successful. A real-life example of Me tendency to discover patterns in random data Is the "hot hand" phenomenon in professional basketball. The hot hand phenomenon is the compelling perception that some players have "hot streaks" and have nuns of successes and failures. Researchers examined the shooting records of a number of outstanding NBA played and deemed mat the number of "nuns" of successful shots did not depart from what could have been predicted by chance models (Gilovich, Val- lone, and Tvenky, 1986). Gus does not deny the true element of skill, but assumes a chance model based on a Even probability of success for each person ~ndivid~ly). Fans (and the players themselves) "perceived" that a successful shot was more likely to be fol- lowed by another successful shot, while a failure was more likely to be followed by another failure. When university players were offered bets contingent on weir ability tO predict the results of Weir next shot, their bets showed sing evidence of their belief in He "hot hand" but their performance offered no evidence of its validity. The lesson for evaluating quanutai~ve evidence by subjective means is clear: these basketball records are comparable to me results of a trial of some new method of performance enhancement Even if there is no systematic structure in the data, people will see meaningful pattems-- without any theories other than the belief Cat random processes must look random. The most serious flaw in our understanding of randomness is the ove~inte~pretation of coin- cidence. ~ order to decide whether an event or collection of events is "unlikely", we must somehow compute a sample space-- a list of all the other ways that the occasion could have fumed out. Then we must decide which outcomes are comparable. If we have a bag with 99 amen balls and 1 red ball, the total sample space is made up of 100 balls. The probability is small, just 1 in 100, that we can pun the red ball out of the bag on our fist try. That is by planning on success-- defining the part of the

21 sample space that we will call success-- we have divided the 100 balls into two sets of comparable events: a successful retrieval of Me red ball or the unsuccessful retrieval of any one of the 99 green balls. Real life resembles the unplanned case. We reach into He bag and pull out a red ball, and then are impasses by He unlikelihood of Hat act. But without pluming and defining a success, we have only one set of comparable events. EACH ban is equally unlikely or equally likely. Any PARTICU- LAR green bath is as undikely as the ~ one. That is the simple probabilisi~c critique of coincidence that is familiar to everyone, and seems trite when compared to the richness of daily expenence. The eminent statisUci~ (and magician) Pepsi Diacon~s (1978) reviewed a more sophisticated view of coincidence in his critique of ESP research Before we attribute unusual and startling events to synch~nicity-- unseen laws and relationships that govern our behavior-- we would do well to ponder what Diaconis calls the problem of "multiple end- points". Multiple end points are the many ways that a surprising intersection of events can occur. It seems more like a miracle than a chance event when our neighbor in a Pans hotel rums out to be a classmate from first-grade. The probability of this intersection of elementary events-- being In Pans, being in this hotel, and meeting this long-lost friend-- is indeed small. But me intersection of a union of elementary evenm-- being abroad in a large tourist city, meeting some fiend in any monument, res- taurant or hotel-- is not so unlikely (balk, 1981). Not only do we focus on the single event because its surprise value makes it salient and memorable, but it has the special property that it happened to us. "One's uniqueness in one's own eyes makes all the components of an event that happened to oneself seem like a singular combination. It is difficult to perceive one's own adventure as just one element in a sample space of people,

22 meetings, places, and times. The hardest wood probably be to perceive oneself as one replaceable ele- ment among many others" (balks 1981, p. 241. The occurrence of some "small-world" coincidences becomes me nom, in a veer large world. In most of these instances-- such as dreams of disaster followed by the dead of a relanve-- there is simply no way of defining the sample space and determining the likelihood of the event ham pending "supply by chance". Such phenomena can not be proof either of me existence or nonexistence of paranormal events. Yet it is just these very personal e~cpenences that cause many people to reject objective measurement in favor of their own subjective impressions. 4- How much do we learn from experience? Given these harsh criticisms of the way that humans collect, analyze and judge evidence, more optimistic readers win ask: So how did we get to the moon? Science and technology have progressed precisely because of the use of formal methods, strict record-keeping and repeatable demonstrations. Intuitive judgments based on personae experience do not have these structures for learning from expen- ence. hnstead, human judgment seems designed as much for protecting the ego of the decision-maker as generating accurate predictions and assessments. Consider Me "hindsight bias", also termed the "knew-it-all-along" effect (Fiscl~hoff, 1975~. A group of physicians ~ ven a list of symptoms and a list of possible diseases and are asked to rate the likelihood mat a person with Base symptoms would have each of We diseases. Another group of phy- sicians is given the same two lists, except that they are also told which disease the patient really had. The doctors who know the right answer judge that the "correct" disease is very likely given those symptoms, while those who do not know the particular answer are quite uncertain (Arkes, Worunann, Saville & Harkness, 198 I). .

23 This demonstration is easily repeatable win many other professions who need to make judg- ments and predictions-- once people am aware of the correct answer, Hey are certain that they would have known that even if un~nfom~ed. We build a net~voric of relationships around the correct answer, and when the answer is "taken away" In ~mag~nabon. the network of megatons built on that answer is son there-- making the logic of the correct answer seem obvious to anyone. When the answers seem obvious to us Her the fart, we believe that our intuitive abilities are being confimled. People In virtually all circumstances and professions (except home-racing handi- cappers and weather forecasters, who receive repeated objective feedback) are much more confident in their judgments and predictions than their performance would justify. One of He few ways to attenu- ate this overconfidence is to explicitly ask the decision-make to list the ways they might be wrong-- for people win only consider the confirmatory evidence, unless prodded. (Koriat, Lichtenstein Fischhoff, igloo. 6 Even when we team that our beliefs were fanned on the basis of completely false info~manon, we cannot "adjust" them back to Heir oral point. Demonstrations indicate that if we falsely tell one group of people that they are showing high Spade for a task, and tell another group that they are showing low api~n~de for the task, the two groups will come up with explanations for their high or low showing. But when the subjects are told that the feedback they received was actually random and had no relation to their actual performance, He "high" group so believes it is relatively good at the task and the "low" group still believes it is relatively poor (Ross, Lepper ~ Hubbard, 1975). The original information has been taken away, but the causal explanations remain. The explanations were created on the basis of false information, but humans are so good at composing causal scenarios that the 6 Built into this paradigm is the not-so-sub~e hint mat people are usually overconfident, but this is one of the most unponant messages mat decision analysts can give decision-makers.

24 explanations are she given with confidence. A prime source of our confidence in our own judgments is our perceived ability to introspect and examine the evidence on which our decisions are based. This makes us feted Mat we can determine whether we are biased and emotionally involved or evaluating objectively. Psychological studies indi- cate that this compelling feeling of privileged access to our decision processes is quite exaggerated-- in many cases, we seem to have little more information about how we make decisions than an experi- enced observer of our behavior. On a series of behavioral studies (Nisbett ~ Wilson, 1977), psycholo- gists manipulated a number of dimensions in an attempt to shift their subjects' preferences. Some manipulations-- order of presentation, verbal suggestion, "warmth" of a stimulus person-- measurably affected the subjects' judgments; others-- certain distractions, reassurances about safety-- did not. But the participants were unable to introspectively determine which influences affected them; instead they made up theories about why they preferred the objects they did. These explanations were not accurate reflections of the man~pulabons but were based on guesses Nat were similar to those that outside observers made. A series of more biological studies supported these conclusions. These ~nvesi~gabons used split-brain patients who had lost the connection between the right and left hemispheres (Gazzaniga, 198S). Certain shapes or pictures were flashed to the right hemisphere which could interpret the pic- tures but could not comInunicate verbally. When the verbal left hemisphere was asked to explain the behavioral reactions to the pictures, the verbal explanations were utter guesswork The "creative" explanations were offered and accepted by the patient automatically without discomfort. As the author explains, "The left-brain's cognitive system needed a theory and instantly supplied one that made sense given the information it had on this particular task" (Gazzaniga, 1985, p. 72~. Human judgment allows lime room for uncertainty; it is set up to explain the world-- and to prevent the anxiety that comes with

25 uncertainty. Yet acknowledging and understanding ulcer nty is the essence of formal decision- making. It signals Me need for caution, for hedging one's bets, for searching for more information, and for relying on background base-rates. 5- Protecting against information-processing biases Research design is me formal structure through which sc~eraists protect themselves from the biasing effects of prior theories, salient evidence, compelling subsets of evidence and over nab pit- faBs that beset researchers whether lay or professional. The process of testing hypotheses through per- sonal experience leads to certain common violations of the basic tenets of research design. A simple but non-obvious nde of correct research design is mat relationships can only be sum ported by examining aD four possible outcomes of a successJfai! mal (see Figure 2~. How do we test He ability of a biofeedback advice to improve creanvi~? The usual intuitive assessment of the efficacy of the device is to compare how many times the device led to improvement ~its) and how many fumes it led to worsened performance (misses). En order to analyze the results stausucaBy, however, we must also examine cells C and D: He number of times that performance was improved or decremented when we did not use the device. The use of this last cell is paruculaAy pulling to people who are not familiar with the logic of research design. Why do we rum to thy He number of tows on which we did not use me device and did not improve? S~mply because we need aD four cells in order to get proportions of success for those times when we use the device and those times when we don't. If success happens a large proportion of times when the device is not used, then the apparent relationship between the device and the successful out- come can be discounted.

26 The natural tendency to search for "hits" (examining the man~pulai~on-p~sent success-present cell) leads to the development of "illusory correlations"-- strong beliefs ~ Me existence of certain con- hngent relationships that do not actually exist. Even clinical psychologists have been shown to main- tain invalid diagnostic theories based on cultural stereotypes (Chapman ~ Chapman, 1982~. This occurred precisely because the c~cians were overly impressed by "successful" pairings of a symptom and a diagnostic outcome and did not notice the many Dials where me relationship did not hold. Clini- cians were struck by me number of mals where paranoid padents drew staring eyes. but did not con- sider the number of Dials where non-pamnoid parents drew staring eyes to be relevant. illusory correlations caused by me focus on "bits" seems to be particularly difficult to avoid when we are working with trial by Dial observation (Ward & Jenkins, 1965~. In a study which exam- med the ability of people to ~ itively determine the contingency between ra~n-seedung and lain, sum jects correctly used the strategy of comparing proportions of success (with cloud-seeding versus without cloud-seeding) only if Hey first saw on overall summary table of an four cells. However, if the subjects had experienced trial by mal presentation of evidence before Hey had seen the summary table, Hey persisted in using incorrect strategies based.pnmanly on He confirming cases. Another clear lesson for decision-makers: look at the summary of the whole dataset, don't get caught up in the excitement of getting a personal feel for the results-- and if you're testing yourself, keep an objective tally of all 4 possible outcomes. Another essential facet of research design that is neglected in He search for evidence Cough personal experience is He need for experiment control. Valid conclusions can only come when data is also collected on occasions when the manipulation of interest is no' used. Control is used in an experiment to ensure that the only thing that differs between the "present" and "absent" conditions is the manipulation itself. If we are testing the efficacy of a biofeedback device to improve c~eadvi~, we

27 cannot compare the performance of a person attached to the biofeedback machine with the performance of another person without biofeedback who is in another room. The presence of the machine may affect the performance of the person, the person attached to the machine knows that he or she is in the special performance condition, and He researcher knows who is in the special performance condition. In order to find the proportion of successes Hat will occur even without our chosen man~pula- tion, we must include a control group that differs from the experimental group only in the one critical variable.7 In order to make sure there are no other variables operating (such as placebo or Hawthorne effects, discussed below), we must treat the two conditions completely alike in all other ways-- neither the researcher administering the treatment nor the subject receiving the treatment can know (or guess through cues) the experimental condition. Of course, this never occurs in personal experience-- as sub- ject or researcher, we are always informed. A good analogue of the problems of control in our daily inferences is the problem of expen- mentation in medicine. When new types of surgery come along, physicians are reluctant to give the treatment only to a randomly chosen subset. Instead they tend to give the new surgeries to the patients who would seem to benefit the most. The results of such trials often seem very impressive, especially compared to the survival rates of those who do not receive the surgery. However, those receiving the surgery start out differently on health variables than those who do not, know they are receiving treat- ment, and are cared for by staff who know they are receiving special treatment Gilbert, McPeek Mosteller (1978) compared such uncontrolled field trials with randomized experimental trials and 7 Obviously, this is a gross simplification of Mill's canon of causal analysis: vary the factors one at a time, keeping an others constant. As Wimsatt (1986) points OUt, when this is followed mechanically, important interdependencies among variables are ignored. Most techniques proposed to improve human performance are actually combinations of a number of distinct interventions. In such cases, conceptual combinations of variables must be identified that are diS~Ct Tom attention "d demand effects in order for 'evaluation' tO be sensibly applied.

28 discover that about 50% of such innovations in surgery were either of no help or actually caused ham. The necessity for placebo controls has also been best demons~i In Me area of medical research. When patients are given a Plug of no medical value, a substandal proportion will improve simply because of Weir belief In the efficacy of the Snug. A new drug or treatment must be compared to a placebo to test whether it is of greater value than simple suggestion. The analogue to the placebo effect In industrial research is the Hawthorne effect, so named because of an ~nvestigabon Into mesons of improving the efficiency of the workers at the Hawthome plant of the Western Electric Company (Roe~lisberger & Dickson, 1939~. The investigators found Hat every alteration in working conditions led to improved efficiency-- not because the changes in conditions affect productivity but because He atterdion of the researchers improved the workers' mode. Similar problems plague research into psychotherapy. Con=] groups must offer supportive auendon without He actual psychotherapy in Order to be a valid comparison to the group receiving the actual therapy. Informal examination of evidence gained through personal experience is subject to flaws bow In the gathering and in the analysis of the data. We coot be "blind" to our theories when coercing the data, and we always know whether each data point collect supports or weakens the evidence for a theory. Without careful consideration of research design, people cannot help but bias the sample of data Hey collect. The risks of gathering evidence through personal experience: an ex~nple Part of me responsibility of He Committee on Techniques for He Enhancement of Human Per- formance has been to make site visits to venous research establishments. A number of these visits have been to groups who have not yet subjected Heir theories to ngomus experiment test, but who have

29 developed considerable enthusiasm and faith in the validity of their phenomena. Though these product developers am for the most part research scientists, before Hey begin controBed experimental evalua- nons of Weir products they use the same processes of intuition as the "lay" intuitive scientist. An evaluation vISit by some scientists on He co~iaee to Oeve Backster's t0mto~ ~ San Diego led to cniical reports (by Richard Thompson of Stanford University and Ray Hyman of the University of Oregon) that highlighted many of the problems mentioned in He first part of this paper. Most visitors to this laboratory come away impressed win the first-hand evidence Hey have gathered. But the scientific evaluations of Thompson and Hyman indicate Nero general problems mat prevent any conclusions about the phenomena under study: He post-hoc selection of outcome cntena, and the opportunity for ~eo~Aven data generation by both experimenter and subject. One of me most provocative paradigms under study In the Backster laboratory is He reaction of human leucocytes that have been removed from the body to the emotions e~cpenences of He donor. The subject's cheek cells are placed In a solution, and the electrical potential of the solution is condnu- aBy monitored and related to the emotional reactions of the subject. He post-hoc data selection identified by Hyman involves the practice of identifying a substantial electrical inaction recorded *om the solution and then asking the subject whether he or she had experienced an emotional reaction at Hat time. This ignores the base rate of emotional reactions in general, focuses only on He "hits", and leads to an illusory correlation between the "sign" (the in vim leucocyte response) and me "cause" (~e human emotional response). Further, Here is no a pnon definition of emotional response, so the sample space of successful outcomes is defined after the fact, malting probability estimates meaning- less.

30 What makes the subjectivity of the criteria even more serious is the fact that both the expen- menter and the subject know the hypothesis being tested, and have an opportunity to search for success on each teal. When the experimenter asks the subject if he or she can remember some emotional reac- tion about Me time of the electrical response of the solun on, the subject is free to search antic he or she can find some subjective validation of the suggestion. It is not enough to forewarn the investigators and the subjects about the dangers of personal validation, for people are unable to "adjust" their conclusions after an experience--persona experience is too powerful, and beliefs remain even after their evidential basis is questioned. There is no training that can "inoculate" researchers against seeing what they want or expect to see. Biases cannot be com- pletely removed-- but they can be minimized by research methodology that can be publicly scrutinized and objectively evaluated. Problems in Evaluating Evidence Il: Motivationalfactors in judgment Not only are our perceptions biased towards our expectations, but we also actively distort our perceptions in order to see what we want to see. In general, expectations and desires are highly corre- lated and distortions in judgment cannot be strictly attributed to one or the other. A healthy ego seems to be correlated with the use of self-enhancing distortions (Alloy ~ Abrarnson, 1977). It is in fact the depressed who show a tack of protective distortions. Depressives seems to be more willing to accept that Hey do not have control over random events, while normals show an "illusion of control". The illusion of corded is the belief that a person has control over chance events that have personal relevance ~anger, 1982~. Anecdotal examples from gambling are easy to generate: dice players believe their throwing styles are responsible for high numbers or low numbers and Las Vegas casinos may blame their dealer for nuns of bad luck

31 (Goffman, 1967). People win wager more before they have tossed the dice than after the toss but before the result is disclosed. (Strickland, Lewicki & Katz,1966). In a series of studies, Langer confirmed that people believed their probability of winning a game of chance was greatest when irrelevant details wem introduced that reminded the participants of games of skill. Allowing the players to choose their own lottery numbers, or introducing a "schnook" as a competitor-- without changing me obviously chance nature of the outcome-- made people more confident in their likelihood of win- nlng. While depressives appear to be less vulnerable tO this illusion, nonnal people will see ~em- selves "in control" whenever they can find some reason. In a study of mental telepathy, Ayeroff and Abelson (1976) found no evidence for the existence of ESP. They did, however, manipulate the amount of "skill details" in the situation. They found that when subjects were able to choose their own occult symbol to send, and when the sender and receiver were able to discuss their communicative technique, they believed that they were operating at a success rate three times the chance rate. But when they were arbitrarily assigned a symbol and had no particular involvement in the task, they believed they were operating at about the chance rate. A similar experiment used a psychokinesis task to test this hypothesis (Benassi, Sweeney, & Drevno, 1979). Again, no significant ESP results were obtained, but subjects' beliefs in their ability to influence the movement of a die did vary with active involvement in the task and with practice at the tasks Most often we do have the opportunity to surround our activities with the trappings of skill. This motivation to feel in control, combined with the tendency to notice and remember successful confirming outcomes, adds another probable and plausible cause of the tendency of people to see meaning In chance events.

32 The ability of people to give themselves the benefit of the doubt was more directly examined in a series of studies on the phenomenon of "self-decephon" (Quattrone & Tversky, 1984). Subjects are told that certain uncontrollable indices diagnose whether they have a heart that is likely to cause trouble later in life. They are then given an opportunity to take part in the diagnostic task which will indicate which type of heart they have. The task is painful and unpleasant but subjects who believe that continuing with the task indicates a healthy heart report little pain and continue with the task for a long period. Those who believe that sensitivity to pain indicates a healthy heart find themselves unable to bear the discomfort for more than a minute. A few of the participants are aware that they are "cheating" on the diagnosis, but only those who are not aware of their own motivation are confident that they have the healthy type of heart. These people are not deceiving the investigator, they are deceiving themselves. Continuing with the painful task does not cause their heart to be of a certain type; but in order for them to believe the diag- nosis, they have to remain unaware mat they are controlling the outcome. When we really want to believe something, we have to convince ourselves fiat. Introspection cannot discriminate between Hose times when we hold a belief because we want to and those tunes when we hold a belief because the evidence is incontrovertible. Probably the most powerful force motivating our desire to protect our beliefs--from others' attacks, from our own questioning, and from the challenge of new evidence-- is commitment. Commimnent as motivation Modem social psychology came lo public consciousness with the development of Leon Festinger's (1957) theory of cognitive dissonance. This theory caught public imagination both because of its provocative real-life applications and because of the way it explained human irrationality in

33 simple and reasonable terms. Cognitive dissonance explains apparently irrational acts not in teens of troubled personality types but in terms of a general human "need" for consistency. People feel unpleasantly aroused when two cognitions are dissonant-- when they contradict one another-- or when behavior is dissonant with a stated belief. To avoid this unpleasant arousal, people win often react to disconfim~'ng evidence by strengthening Weir beliefs and creating more consonant explanations. This Unve to avoid dissonance is especially sing when the belief has led to public commitment. On a d~nanc field study of this phenomenon, Festinger and nvo colleagues O:esdnger, Riecken & Schachter,1956) joined a messianic movement to examine what would happen to the group when the "end of the world" did not occur as scheduled. A woman In the midwestem U.S. who claimed to be in contact with aliens in flying saucers had gathered a group of supporters who were convinced that a great flood would wash over the earth on December 21, 1955. These supported made great sacrifices lo be ready to be taken away by the flying saucers on that day, and suffered public ridicule. If the flood did not occur and the flying saucers did not arrive, the members of the group would individually and collectively feel great dissonance between their beliefs and the actual events. Festinger et al (1956) felt that the memben of the group had three altematives: they could give up their beliefs and restore consonance; Hey could deny He reality of He evidence Cat the Good had not come; or Hey could alter the meaning of He evidence to make it congruent who He rest of their belief system. Public commitment made it unlikely Cat He members of the group would deny their beliefs. The evidence of the existence of the unwooded world was Ho obvious to be repressed or denied. Therefore, the psychologically "easiest" resolution was to make the evidence congruent with the prior beliefs. No flying saucers arrived, no deluge covered the earth, but a few hours after the appointed time, the communication medium received a message: the earth had been spared due to the efforts of the faithful group. The "disconfim~ation" had fumed into a "confirmation" (Alcock, 1980, p. 56~.

34 When we are committed to a belief, it is unpleasant to even consider that contradictory evi- dence may be true. ~ this sense, it is generally easier to be a skeptic in the face of novel evidence; skeptics may be overly conservative, but they are rarely held up to ridicule. Researchers exposed beli- eve~s in and skeptics of ESP to either a "successful" or an "unsuccessful" demonstration of ESP (Russell ~ Jones, 1980). Skeptics recalled both demonstrations accurately, but believers showed dis- torted memories of the unsuccessful trial. In a follow-up study, the researchers also measured arousal, since dissonance is claimed to operate through unpleasant arousal. In this study, believers who success- fully distorted their memories in the "ESP disproven" condition suffered less arousal than believers who remembered the result. Overcoming our discomfort and actually considering the truth of some threatening idea or evi- dence does not always lead to a weakening of our commitment. Batson (1975) studied the reaction of teenage girls who were committed Christians to scholarly attacks on the divinity of Christ. He found that only those girls who gave some credence to the evidence became more religious as a result of their exposure to the attacks. Those who thought about the evidence became sufficiently distressed to be motivated to resolve the dissonance by strengthening their own beliefs. Though most of the research on dissonance theory has involved attitudinal or emotional com- mitment, financial commitment to an enterprise sets up a similar psychological system. A person with an economic stake in the worth of some process may behave in a way indistinguishable from someone with an emotional stake in a belief. The process of "selling" a product, even if that product is an idea, involves public commitment to a position. Such a situation makes it unpleasant to consider that one's stated position may be invalid and may increase the strength of one's belief in the efficacy of the pro- duct.

35 Self-perception theory also provides mechanisms by which financial commitment can lead to a stronger belief. Self-perception theory (Bem, 1972) builds on the evidence that people do not always have privileged access to their own motivations or low-level decision processes. Instead, this theory claims that people infer their own motivation by observing their own behavior. During the process of selling a product, a person observes his own claims for that product-- and unless the salesperson is content to conclude that he or she is motivated only by the money-- will likely conclude that he or she has very good reason tO believe in the quality of the produce Beyond perceptual and judgmental biases, misunderstandings of chance phenomena and motivated distortions, lies the essential reason why personal experience cannot be decisive: We can never determine the true cause of our behavior or our experience. When an experimenter manipulates variables, he or she is briefly omniscient. He or she can alter one variable (or even a combination of variables) between groups and observe the effect. A single actor relying on intuitive judgment cannot accurately estimate what it would be like to be in the other group. If we think we have evidence that a process is useful and we have a financial stake in the outcome, we may attempt tO "discount" for our financial involvement and assess how much of our belief is due only to the evidence that we have gathered. Or if we have an emotional investment in an ideology we may ~ to determine how much of our outrage at an opponent is caused by our emotional reaction and how much by the obviously spe- cious set of arguments presented. But we have no way of reliably examining our internal processes, and psychologists have found it surprisingly easy to manipulate preferences, and choices without the awareness of Me actor (Nisbett & Schachter, 1966).

36 Problems in Evaluating Evidence I11: Mediated Evidence An modem society, much of He information that we evaluate is already processed-- by eelevi- sion, by Be print media, by schools, by government, or by research scientists themselves. Our infom~a- donal and motivational biases operate on this processed evidence just as Hey operate on the evidence from our own eyes. The primary difference is that we start out with a suspicion Hat we are not receiv- ing a complete or representative set of information, but infomlaiion Hat has been selected by opens who may or may not share our goals and values. This suspicion can foster a healthy skepticism, a demand for conveying evidence from a number of sources, but it can also give us a mason for ignor- ing evidence that contradicts our belief. Much of the infonnation Hat reaches us thrum a processing medium is technically complex-- and we must evaluate the credibility of the channel Trough which it passed. While media presentations may do lithe to sway He beliefs of the committed, they have a powerful effect on the fonnai~on of beliefs in the uncommitted. This is appropriate to the extent Hat people are able to evaluate second-hand information rationally. But aD the atteni~onad biases that are active in our personal experience are doubly pernicious when we evaluate processed evidence because the media further emphasizes the trivia, emotionally gripping aspects of infomlation while ignoring or downplaying cautions and unexciting statistical summanes. The Fist step In a reasonable evaluation of evidence Cat is channeled Hugh some medium is to assess the source credibility. A common-sense finding In research on persuasion and attune change is that people change their amtudes more in response to a commun~caton from an unbiased expert source (Hoviand, pants, & KeDey, 19S3~. But this same research tradition also revealed He "sleeper effect", a phenomenon in which the information received is separated In memos from its source (Hov- land, L~umsda~ne, & Sheffield, 1949~. Thus while the claims of an unreliable source are immediately

37 discounted, the information obtained may become part of the general knowledge of the the recipient. In the classic demonstration of this phenomenon, students were given persuasive arguments about the use of nuclear power and were told that the source of the arguments was either Pravda or an American nuclear scientist. The only students who showed immediate attitude change were those who read the statements attributed to the scientist, the credible trusted source. A delayed measurement, however, showed that Pravda had as much eject on attitude change once the source was forgotten. ~ A serious problem with the dual role of media as entertainer and informer is He tendency to stress the excitement of the message and ignore the credibility of the source. For example the San Francisco Chronicle, in its generally respectable book review section, carried a review of "The Serpent and the Rainbow" (Teish, 1986). This book, described in the review as "an account of a Harvard scientist's search for the ingredients of the Haitian zombi formula", is discussed in terms of its impor- tance for both academic anthropology and medicine. A cursory or even careful reading of the article reveals that science has established the existence of Voodoo zombie! But who is the reviewer? The small print on the second page of the article reveals the source is "Lusiah Teish, a priest of the Voudou". Exacerbating the problems associated with the media's "excitement" criterion is the human ten- dency to retain the beliefs created by discredited evidence. This is especially Glen ant because of the great prevalence of fraud involved in the sewing of paranormal phenomena After some unexplainable case study receives media publicity, there is a great rush to explain it on the basis of new theories- most commonly by reference to quantum mechanics and non-deterministic physics. Then the original evidence is found to be a fraudulent attempt to gain attention or wealth (see Gardner, 1981) but the ~ More recent research (Cook and Flay, 1978) has set out specific conditions under which the sleeper effect is likely to be found.

38 belief in the new ~anscenderd physics remains. Some recent examples of the "vividness" criteria in media reports are Me press coverage given ~ me~-bend~ng children (e.g. Defty, Washington Post' March 2, 1980) and the tremendous attention given the Columbus, Ohio, poltergeist (Safran, Reader's Digest, December 1984; San Francisco Chron- icle, March 7, 1984, from Associated Press). Both stones developed Croup extremely unreliable per- sonal experience (Rand), 1983; Kurtz, 1984b) and demonstrate the way Mat personal reports fit me requirements of the media better than caution or rigor. Expenmental analysis is rarely as dramatic or newsworthy as personal reports, especially since rigorous analysis emphasizes a cautious conservative approach. FoDow-up stones on the "debunking" of these phenomena rarely receive comparable atten- non to the first excited reports. The public television program Nova is regarded as one of the best popular freemen of scientific affairs in any communicator medium. Yet itS program on ESP has been vilified by skeptics of paranonnal phenomena (Lutz, 1984b). It tried to show both sides of the issue-- it included dramatic "recreations" of the most famous ESP experiments and interviews with civics of ESP who proposed altemadve explanations of these experiments. The mcreated stones were more exciting and vividly memorable than the interviews. The enthusiasm and hopefulness of the believed was. more gripping than the skeptics' "accenn~abon of Me negative". What were the producers of Nova to do about the fact that what made a good story also was memorable and persuasive-- even though these elements were irrelevant to what was ~e? In this case, Hey went for the good story.

39 Perceptual biases and mediated information People with strong pm-exisiing beliefs are rarely affected by any presentation of evidence. Instead, Hey manage to find some confim~adon in aU presentations. The "biased assimilator" of evi- dence relevant to our beliefs is a phenomenon that seems obviously true of omen, but sometimes difficult to believe In ourselves. Consider a classic social psychological study of students' perceptions of the annual Pr~nceton-Darunouth football game. (Hastorf and Cant, 19541. Students from the opposing schools watched a movie of the rough 1951 football game and were asked to carefully record an infractions. The Ho groups ended up with different scorecards based on the same game. Of coupe, this is not remarkable at an. We see this in sports enthusiasts and portico pow even day. But what is worth noting is that the students used objective trial by trim recording techniques and they sod saw different games if Hey were on different sides. This is a clue to the reason that people cannot understand why others continue to disagree wad them, even after they have been shown the "truth". We construct our perceived wodd on the basis of expectations and theones, and then we fall to take this constructed nature of the world into account. When we talk about the same "facts" we may not be arguing on the basis of the same construed evi- dence. This is especially important when we are faced with ~nte~prei~ng mixed evidence. IN almost an real-world cases, evidence does not come neatly packaged as "pro" or "con", and we have to interpret how each piece of evidence supports each side. IN a more recent extension of this idea, social psychologists at Stanford University presented proponents and opponents of capital punishment with some studies that purported to show Hat deter- rence worked, and some studies apparently showing that capital punishment had no deterrence effect Cord, Ross & Lepper, 1979~. They reasoned that common sense must dictate that mixed evidence should lead to a decrease in certainty In the beliefs of both partisan groups. But if partisans accept

40 supportive evidence at face value, critically scrutinize contradictory evidence, and construe ambiguous evidence according to their theones, both sides might actuary strengthen their beliefs on the basis of the mixed evidence. '1he answer was clear in our subjects assessment of the pertinent deterrence studies. Both groups believed that the methodology that had yielded evidence supportive of Weir view had been clearly superior, both in its relevance and freedom from artifact, lo the methodology that had yielded non-suppor~ve evidence. ~ fact however, the sum jects were evaluating exactly the same designs and procedures, wad only me purported results vaned....To put the matter more bluntly, the two opposing groups had each con- strued the "box-score" vis a vis empirical evidence as tone good study supporting my view, and one lousy study supporting the opposite view'-- a state of affairs that seem- ingly justified the maintenance and even the strengthening of their paracular viewpoint" (Ross, 1986, p. 14). This result leads to a sense of pessimism for those of us who Mink that "truth" comes from the objective scientific collection of data, and from a solid replicable base of research. Giving the same mixed evidence to two opposing groups may drive the partisans farther apart. How is intellectual and emotional rapprochement possible? One possible source of optimism comes from related work by Ross and his colleagues (Ross, Lepper & Hubbard, 1975) in which the experimenters gave subjects false information about their abil- ity on some task. After subjects built up a theory to explain this ability, the experimenters discredited the original information, but the subjects retained a weaker form of the theory they had built up. The only form of debriefing that effectively abolished the (inappropriate) theory involved telling the sub- jects about the perseverance phenomenon itself. This debriefing about the actual psychological process involved finally aDowed the subjects to remove the effect of the false information. Biased assimilation may be weakened in a similar way: when we understand that our most "objective" evaluations of evi- dence involves such bias, we may be more able to understand that our opponents truly are reasonable people.

41 Another reaction to processed evidence is the perception of hostile media bias. Why should politicians from both ends of the spectrum believe that the media is particularly hostile to their side? At first glance, this widespread phenomenon seems to contradict assimilative biases-- often, we don't react to stories in the press by selectively choosing supportive evidence; instead we perceive that the news story is deliberately slanted in favor of evidence against our side. Ross and colleagues speculated that the same biasing construal processes are at work. A partisan has a rigid construction of the truth that lines up with his or her beliefs, and when "evenhanded" evaluations am presented, they seem to stress the questionable evidence for the opposition. Support for these speculations came from studies on the news coverage of both the 1980 and 1984 presidential election and the 1982 "Beirut Massacre" (Vallone, 1986; Vallone, Ross ~ Lepper, 1985). These issues were chosen because there were actively involved partisans on both sides avail- able. The opposing parties watched clips of television news coverage. Not only did they disagree about the validity of the facts presented, and about the likely beliefs of the producers of the program, but they acted as if they saw different news clips. "Viewem of the the same 30-minute videotapes reported that the other side had enjoyed a greater proportion of favorable facts and references, and a smaller proportion of negative ones, than their own side" (Ross, 1986, p. 18). However, objective viewers tended to rate the broadcasts as relatively unbiased. These "objective" viewers were defined by the experimenters as those without personal involvement or strong opinions about the issues. But the partisans themselves-- if they are involved in college football, the capital punishment debate, party politics or the Arab-Israeli conflict-- claim to be evaluating the evidence on its own merits. And in a sense they are: They evaluate the quality of the evidence as they have constructed it in their mind. It is the illusion of "direct perception" that is the fatal barrier to understanding why others disagree with us. To the extent that we "fill in" ambiguities

42 in me infom~a~aon given we can find inte~pretatons mat make the evidence fit our model. Because scientific practice demands public definition of concepts, measures and phenomena, personal constn~c- dons are minimized and meanings debate can take place. But when we rely on casual observation, personal experience and entenain~ng narratives as sources of evidence, we have too much room to create our own persuasive consnual of the evidence. Problems in Evaluanng Evidence [V: The Elect of Formal Research FonTIal research structure and quantitative analysis may not be me only, or best, route to "understanding" problems. Often, an in~epth qualitative familianty with a subject area is necessary ~ truly grasp the nature of a problem. But in all public policy programs, a private understanding must be followed by a public demonstration of the efficacy of the program. Only quantitative analysis leads to such a demonstration, and only quantitative evidence will force partisans to take the other side seriously. The effect of the acceptance of this argument can be seen in different ways in two domains: parapsycholog~cal research, and medicine. The effect of the rejec~acn of this argument can be seen In the development of the human potential movement. Modem parapsychology is almost entirely an experimental science, as any CUmOIy look Hugh its influential journals vAI1 demonstrate. Articles published in He Journal of Parapsychology or the Journal of the Sociery for Psychical Research explicitly discuss the statistical assumptions and con- trolled research design used in their studies. Most active parapsychological researchers believe that the path to scientific acceptance lies Hugh me adoption of rigorous experimental method. Robert Jahn, formerly dean of engineering and applied sciences at Princeton University and an active experimenter in this field, argues that "further careful study of this formidable field seems justified, but only within Be context of very well conceived and technically impeccable experiments of

43 large data-base capability, with disciplined attention to the pertinent aesthetic factors, and with more constructive involvement of the critical community" (Jahn, 1982, quoted in Hyman, 1985, p. 4). This attitude has not caused the traditional scientific institutions to embrace parapsychology, so what have parapsychologists gained from it? Parapsycholog~sts have now amassed a large literature of experiments, and this compendium of studies and results can now be assessed using the language of science. Discussions of the status of parapsychological theories can be argued on the evidence: quantified, explicit evidence. As it stands, the evidence for psychic phenomena is not convincing to most traditional scientists (Hymen, 1981). But critical discussions of the evidence can take place on the basis of specifiable problems, and not only on the basis of beliefs and attitudes (e.g. the exchange between Hyman and HonoIton on the qual- ity of the design and analysis of the psi ganzfeld experiments, starting with Hym an, 1977; and Honor- mn, 1979). In direct contrast to this progression is the attitude of the human potential movement towards evaluation and measurement. Kurt Back (1972) titled his personal history of the human potential movement "Beyond Words" but it could have been just as accurately called "Beyond Measurement". He begins his book and his history with an examination of the roots of the movement in the post-war enthusiasm for applied psychology. Academic psychologists and sociologists were anxious to measure the increase in efficiency that would result from group educational activities. They examined group productivity, the solidarity and cohesion of the groups themselves, as wed as the weD-being of the group members. Few measurable changes were found, and this led the research-oriented scientists to either lose interest in these group phenomena or to lose interest in quantitative measurement. Many of those involved in the group experiments-- even some of the scientist who began with clearly experimental

outlooks-- were caught up in the phenomenology, the experience of the group processes. 44 Back describes many influendal workers in this movement who started out with keen beliefs that controBed experiments wad groups processes would reveal significant observable effects. When these were not forthcoming, Me believers made two claims: We effects of group processes were too subtle, diffuse and holistic to be measure by reductionist science, and the only evidence that really mattered was subjective experience-- the individual case was We only level of interest, and this level could never be cape by extemal "objective" measurements. "Believing the language of the movement, one might look for msearch, proof, and the acceptability, of disproof. In fact, me followers of me movement are quote immune to rational argument or persuasion. The experience they are seeking exists, and me believ- ers are happy in their closed system which shows Hem mat Hey alone have mue Lights and emotional beliefs....Seen in this light, He history of sensitivity mining is a struggle to get beyond science" (Back, p. 204). The dangem in trying to get beyond science In an important policy area am best described by an example from surgical medicine. This example is often used in introductory statistics' classes because it demonstrates that good research really makers In the world. It shows how opt bask on personal experience or even unconsoled research can cause the adoption or condnuanon of dangerous policies. One treannent for severe bleeding caused by cinhosis of me liver is to send the blood Trough a portacaval she. This operation is time-consuming and risky. Many studies (at least 50y, of varying sophistication' have been undertaken to determine if the benefits outweigh the risks. These studies are reviewed in Grace, Muench, and Chahners, 1966; He stadsucal meaning is discussed in Freedman' Pisan~ & Pukes, 1978~. The message of the studies is clear: the poorer studies exaggerate He benefits of the surgery. Seventy-five percent of the studies without control groups (24 out of 32) were very enthusiastic about

45 tile benefits of the shunt. In the studies which had control groups which were not randomly assigned, 67% (IO out of IS) were very en~usiasuc about Me benefits. But none of the studies widi random assignment to contm} and experimental groups had results that led to a high degree of enthusiasm. He of these studies showed Be shunt to have no value whatsoever. on the experiments without controls, me physicians wem accidentally biasing the outcome by including only the most healthy patients in the study. In the e~cpenments with non~domued controls, me physicians were accidentally biasing the outcome by assigning Be poorest padents to the control group that did not receive the shunt. Ordy when the confound of patient heath was removed by mn- domizanon was it clear that the risky operation was of lime or no vague. Good research does matter. Even physicians, highly selected for intelligence and highly trained in intuitive assessment, were misled by their daily expenence. Because He Amoral studies were publicly available, and because the quality of the studies could be evaluated on the basis of their exper- unental method, the overall conclusions were decisive. Until the human potential movement agrees on the importance of quantitative evaluation, it win remain spUt into factions based on ideologies main- tained by personal experience. Focal research methods are not He orgy or necessarily best way to team about the true state of nature. But good research is He ordy way to ensure Hat real phenomena win drive out illusions. The story of the "discovery" of N-rays in France in 1903 reveals how even physics, the hardest of the hard sciences, could be led astray by subjective evaluation (Broad ~ Wade, 1982, p. Ilk. This "new" form of X-rays made sparks brighten when viewed by the naked eye. The best physical scientists in France accepted this breakOuough because they wanted to believe in iL It took considerable logical and experiment effort to convince the scientific establishment that He actual phenomenon was self- decepi~on. Good research can disconfirm theones, subjective judgment rawly does.

46 In his clique of the use of poor research practices, Pitfalls of Hwnan Research, Barber (1976) points out that many Saws of natural inference can creep into scientific research. "The validity and generalizability of experiments can be significantly improved by malting more explicit the pitfalls Mat are integral to their planning...and by keeping the pitfalls in full view of researchers who conduct experimental studies" (pp. 90-91). While scientists and scientific methods are not immune to the flaws of subjective judgment, good research is designed to minimize the impact of these problems. We proper use of science in public policy involves replacing a "person-onent~" approach win a "me~od-onented" approach (Hammond, 1978~. When cndcs or supporters focus on the person who is setting policy cntena, the debate involves the bias and mobvanons of me people involved. But attempts to precisely define the variables of interest and to gather data that relate to these variables focus the adversanal debate on He quality of He nietho~s used. This "is sc~entificaBy defensible not because it is flawless (it isn't), but because it is readily subject to scientific cnucism" (Hammond, 1978, p. 135~. Intuitive Judgment and the evaluation of evidence: A summary Personal experience seems a compeding source of evidence because it involves He most basic processing of information: perception, attention, and memory storage and retrieval. Yet while we have great confidence in the accuracy of our subjective impressions, we do not have conscious access to the actual processes unsolved. Psychological expenment~ion has revealed that we have too much confidence in our own accuracy and objectivity. Humans are designed for quick thinking rather than accurate thinking. Quick, confident assessment of evidence is adaptive when hesitation, uncertainty and self-doubt have high costs. But natural shortcut methods are subject to systematic errors and our in~s- peci~ve feelings of accuracy are misleading.

47 These ergo of intuitive judgment lead people to search OUt confirming evidence, to interpret mixed evidence in ways that confirm their expectations, and to see meaning in chance phenomena. This same biased processing of information makes it very difficult to change our beliefs and tO under- stand the point of view of those with opposing beliefs. These errors and biases are now well- documented by psychologists and decision theorists, and the improvement of human judgment is of central concern in current research. Me long-tenn response to this knowledge requites broad educa- tional programs in basic statistical inference, and formal decision-m~king, such as those proposed and examined by various authors in Kahneman et al (1982). Already, business schools include "de- biasing" procedures in their programs of formal decision-making. But with the complex technological nature of our society, most researchers believe that some instruction on how ~ be a better consumer of infonnadon should start In public schools. The immediate response should be a renewed commitment to formal structures In deciding important policy, and a new realization that personal experience cannot be decisive in forming such policy. As Gilbert, Light and Mosteller (1978) post Out ~ their review of me efficacy of social ~nno- vations, only true experiment teals can yield knowledge that is reliable and cumulate. While for- mal research is slow and expensive, and scientific knowledge increases by Any increments, Me final result is impressively useful. Perhaps most important, explicit public evidence is our best hope for moving toward a consensus on appropriate public policy.

Next: Part II. Learning, Learning During Sleep »

Enhancing Human Performance: Issues, Theories, and Techniques, Background Papers (Complete Set) (1988)

Chapter: Part I. Issues of Theory and Methodology

Welcome to OpenBook!

Get Email Updates