Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Evaluation Issues Implementation of an enhancement technique, in the committee's view, should depend on two general kinds, or levels, of evaluation. The first examines primarily the scientific justification for the effectiveness of the technique and the potential of the technique for improving performance in practice. The second kind examines field tests of a pilot program incorporating the technique to determine how feasible it is and to what extent it brings about effects that Army officials consider useful. Convincing scientific justification can come only from basic research, that is, from carefully controlled studies that usually take place in laboratory settings and that preferably are related to a body of theory. Such research can provide evidence for the existence of the causal effect on which a technique is based and can help explain, or indicate a mechanism for, the effect. Analysis in connection with basic research should go beyond scientific justification to operational potential and likely cost-effectiveness. Only field tests can assess a program's actual opera- tions and effects, however, and for such tests a broader array of evaluative criteria are needed, related primarily to the technique's utility. Because strong claims of support from basic research have been made for some of the techniques the committee examined, we review here what it takes to justify a scientific claim, specifically, we review some standards for evaluating basic research. We then examine in more detail some standards for evaluating field tests of pilot programs. In the third section of this chapter, we set forth briefly some of our impressions of how the Army now manages the solicitation and evaluation of new performance-enhancing techniques. This chapter concludes with a note 24
EVALUA TION ISSUES 25 on informal, qualitative approaches to evaluation, which are sometimes suggested as alternatives to basic research and field tests. This chapter does not aspire to a comprehensive treatment of evaluation issues, and it barely touches on research methods. Articles, journals, books, and handbooks testify to the scope and complexity of this burgeoning field (e.g., Barber, 1976; Cook and Campbell, 1979~. Our objective here is to highlight the topics that have impressed us as most germane. The various sources just mentioned would need to be consulted for even a minimal elaboration of these topics, and other committees would be required if recipes for evaluation of the Army's enhancement programs were sought as extensions of our work. Still, we believe this chapter will help the Army set general evaluation standards. STANDARDS FOR EVALUATING BASIC RESEARCH The purpose of basic research is to permit inferences to be drawn in accordance with scientific standards, including inferences about novel concepts, about causation, about alternative explanations of causal relations, and about the generalizability of causal relations. For novel concepts, evidence must be gathered that both the purported enhancement technique and the relevant performance have been (1) defined in a way to highlight their critical elements, (2) differentiated from related variables that might bring about similar effects, and (3) put into operation (manipulated or measured) in ways that include the critical parts. The burden is on the evaluator to analyze how the components of each new technique differ from concepts already in the literature. The need for this standard is illustrated well by packages for accelerated learning, as discussed in Chapter 4. Evidence needs also to be adduced that supposed cause and effect variables vary together ire a systematic manner. Relevant procedures include comparison of performance before and after introduction of the technique, contrasts of experimental and control groups in an experimental design, and calculation of statistical significance. Illusory covariation can occur more easily in nonstatistical studies, which are used often to support the existence of paranormal effects, as discussed in Chapter 9. Especially demanding is the need for evidence that the performance effect observed is due to the postulated cause and not to some other variable. Ruling out alternative explanations or mechanisms requires intimate knowledge of a research area. Historical findings and critical commentary are needed to identify alternatives, determine their plausi- bility' and judge how well they have been ruled out in particular sets of experiments. Common threats to the validity of any presumed cause
26 ENHANCING HUMAN PERFORMANCE effect relation include effects stemming from subject selection, unexpected changes in organizational forces, the spontaneous maturation of subjects, and the sensitizing effects of a pretest measurement on a pastiest assessment. Experiments with random assignment of subjects to treat- ments are preferred, but some of the better quasi-experimental designs are also useful. Another class of threats to validity is associated with subject reactions to such conceptual irrelevancies as experimenter ex- pectations about how subjects should perform or subjects' performing better merely because they are receiving attention. Procedures that have evolved to reduce this sort of threat include double-blind experiments, placebo control groups, mechanical delivery of treatments, and the elimination of all communication between experimenters and subjects or among subjects. These safeguards, however, are not certain, and imple- menting them is not a simple matter. Finally, for a technique to be of value, one must ascertain that a causal relation observed in one setting is likely to be observed in other settings in which the technique is to be employed. Replication of an experiment by an independent investigator is a first step. Another step is to produce the cause and effect with different samples of people, settings, and times. Systematic reviews of the literature, perhaps aided by what is referred to as meta-analysis of studies (as illustrated in Chapter 5), are also helpful. Beyond these steps, a thorough theoretical understanding of causal processes, which is a fundamental goal of science, permits increased practical control. Our point perhaps seeming obvious to many but nonetheless needing emphasis here is that a planned or existing program for implementing an enhancement technique is much more likely to bear fruit if evidence for the technique's effectiveness is properly derived from basic research. A complex set of ground rules exists for conducting and drawing inferences from basic research, and waiving those rules greatly increases the chances of incorrect conclusions. STANDARDS FOR EVALUATING FIELD TESTS OF PROGRAMS An adequate appraisal of an actual enhancement program requires attention to three general factors. First, the organizational (i.e., political, administrative) context in which the program is embedded should be described. That context strongly influences the choice of evaluation criteria, the types of evaluations considered feasible, and the extent to which evaluation results will be used. Second, the programs conse- quences should be described and explained, including planned and unplanned, short-term and long-term consequences. The way the program
EVALUATION ISSUES 27 is construed influences the claims resulting from an evaluation and the degree of confidence that can be placed in what was learned. Third, value or merit should be explicitly assigned to a program. Valuing relates an enhancement technique to an Army need and to feasible alternatives. In the following sections we comment on these three factors in turn. THE ORGANIZATIONAL CONTEXT A description of the broader context of an enhancement program would include an assessment both of the various constituencies with a stake in its implementation and of the priorities of the larger institution. We do not discuss stakeholder interests in general at this point because we refer to some specifically later in this chapter, in the section on the committee's impressions of current Army evaluation practices. We do comment here on the Army's institutional priorities as they may relate to scientific standards. We understand that the Army, like other organizations in society, may have-and quite possibly should have~ifferent standards for evaluating knowledge claims, or technique effectiveness, than science has. The scientific establishment is conservative in the tests it administers to discipline its conjectures; in particular, its goal is to reduce uncertainty as far as possible, no matter how long that takes. In the Army, by contrast, the need for timely information and decisions may lead to an acceptance of greater uncertainty and a higher risk of being wrong. There is no Army doctrine of which we are aware concerning the degree of risk that is acceptable in evaluations of pilot programs. Yet surely one objective of evaluations of pilot programs should be to describe the costs to the Army of drawing incorrect conclusions so that inferential standards can be made commensurate with those costs. If the costs are relatively low, the riskier approach of most commercial research (as, for example, in management consulting or marketing) may be preferred to the more conservative approach of basic science. DESCRIBING A PROGRAM S CONSEQUENCES In evaluating a program, it is desirable to present an analysis and defense of the questions probed and not probed, together with justification for the priorities accorded to various issues. Primary issues usually include the program's immediate effects and its organizational side effects. Immediate Effects A primary problem in evaluation is to decide on the criteria by which a program is to be assessed. The major sources for identifying potential
28 ENHANCING HUMAN PERFORMANCE criteria include program goals, interviews with interested persons, con- sideration of plausible consequences found in the literature, and insights gained from preliminary field work. Such criteria specify only potential effects, however. They do not speak to the matter of whether the relation between a supposed cause and effect is truly causal. In this respect, a fundamental issue of methodology is the use of randomized experiments. Although logistic reasons abound in any practical context for not going to the trouble to use such research designs, one might nonetheless argue that the Army is in a better position to conduct randomized experiments than are organi- zations in such fields as education, job training, and public health. The reason for going to such trouble is that randomized experiments give a lower risk of incorrect causal conclusions than the alternatives. Alternatives at the next level of confidence are quasi-experimental designs that include pretest measures and comparison (control) groups. Relatively little confidence can be placed either in before-after measure- ments of a single group exposed to a technique without an external comparison, or in comparisons of nonequivalent intact groups for which pretest measures are not available. Side Effects Unintended side effects include impacts on the broader organization, and these should be monitored. For example, trainers from other (non- experimental) units may copy what they think is going on, or they may simply be upset by the implementation of new instructional packages in the experimental units. Units not treated in the same way as the experimental units may be unwilling to cooperate when cooperation would seem to be in their best interest. They may also suffer by comparison, as is thought to be the case, for example, when COHORT units are introduced into a division (see Chapter 8~. Evaluators should strive to see any program as fitting into a wider system of Army activities on which it may have unintended positive or negative effects. ASSIGNING VALUE TO PILOT PROGRAMS The described consequences of a program tell us what a program has achieved but not how valuable it is. Three other factors are important in inferring value: Does the new technique meet a demonstrable Army need to the extent that without it the organization would be less effective? How likely is it that the program can be transferred to other Army settings, either as a total package or in part? How well does the new
EVALUATION ISSUES 29 program fare when compared with current practice and with alternatives for bringing about the same results? Meeting Needs Representatives of the commercial world who seek outlets for their products often confound wants with needs, enthusiasm with proof, and hope with reality. While it is axiomatic that all field tests should aim to meet genuine Army needs, it is not clear how needs are now assessed when the developers of new products approach Army personnel for permission to do general research or field tests. It is clear that a needs analysis should be part of the documentation about every field test. What should a needs analysis look like? At the minimum, it should document the current level of performance at some task, why the level is inadequate, what reason there is to believe that performance can change, and what the Armywide impacts would probably be if the performance in question were improved. In addition, an analysis should question why a particular program is needed for solving the problem. Such an analysis would describe the program, critically examine its justification in basic research, identify the financial and human resources required to make the program work, relate the resources required to the funds available, examine other ways of bringing about the same intended results, and justify the program at hand in terms of its anticipated cost- effectiveness. To facilitate critical feedback, such reports should be independent of the persons who sponsor a program, though based on a thorough, firsthand acquaintance with the program and its developers and sponsors. As just described, needs analysis is a planning exercise to justify mounting a pilot program. It is not a review of program achievements relative to needs, for which a description of a program's consequences is required. At that later stage in evaluation a judgment is required about whether the magnitude of a program's effects is sufficient to reduce needs to a degree that makes a practical difference. More is at stake than whether the program makes a statistically reliable difference in perform- ance. Size of effect relative to need is the crucial concern. When the magnitude of change required for practical significance has been specified in advance, it is easy to use such a specification to probe how well a need has been met. But the level of change required to alleviate need is not usually predetermined, and there are political reasons why developers are not always eager to have their programs evaluated in terms of effect sizes they themselves have clearly promised or that others have set for them. Needs can be specified only by Army officials, and it is vital that such
30 ENHANCING HUMAN PERFORMANCE officials inspect the results a program has achieved, relating them to their perception of need. Since the Army is heterogeneous, it would be naive to believe that there are no significant differences within it about how important various needs are and how far a particular effect goes in meeting a particular need. Some theorists relate needs primarily to the number of persons performing below a desired level, while others emphasize the seriousness of consequences for unit performance, for which deficiencies in only one or two persons may be crucial. Some practitioners are likely to think a deficit in skill X is worse than a deficit in skill Y. while others may believe the opposite. Evaluators who take the concept of need seriously have to take cognizance of such hetero- geneity, perhaps using group approaches like the Delphi technique to bring about consensus on both the level of need and the extent to which a particular pattern of evaluative results helps meet that need. Likelihood of Transfer Although some local commanders may sponsor field trials for the benefit of their command alone, the more widely a successful new practice can be implemented within the Army, the more important it is likely to be. Consequently, evaluations of pilot programs should seek to draw conclu- sions about the likelihood that findings will transfer to populations and settings different from those studied. In this regard, it is particularly important to probe the extent to which any findings from a pilot study might depend on the special knowledge and enthusiasm of those persons who deliver or sponsor the program. Such persons are often strongly committed to a program, treating it with a concern and intensity that most regular Army personnel could not be expected to match. While it is sometimes possible to transfer such committed persons from one Army site to another in order to implement a program, in many instances this cannot be done. Transfer is partly a question of the psychology of ownership; authorities who did not sponsor a product will sometimes reject out of hand what others have developed, including their immediate predecessors. Since Army leaders in any position turn over with some regularity due to transfers, promotions, and retirement, successors will probably not identify with a program as strongly as the original sponsors and developers did. The likelihood of transfer also affects the degree to which program implementation is monitored. Pilot programs are likely to be more obtrusively monitored than other programs. Not only is this obtrusiveness due to developers' and evaluators' fussing over their charge, it is also due to teams of experts brought in to inspect what is novel and to responsible officers wanting to show others the unique programs they
EVAL UA TI ON ISS UES 31 are leading (and on which the success of their careers may depend). For at least these reasons pilot programs tend to stand out more than the regular programs they may engender. Research suggests that the quality with which programs are delivered may in fact increase when outside personnel are obviously monitoring individual and group performance. It is naive to believe that one can go confidently from a single pilot program to full-blown Armywide implementation. Even if this were feasible politically, it would not be technically advisable unless there were compelling evidence from a great deal of prior research indicating that the program was indeed built on valid substantive foundations. Given a single pilot program, decisions about transfer are best made if the program is tested again, at a larger but still restricted set of sites and under conditions that more closely approximate those that would pertain if the new enhancement technique were implemented as routine policy. Only then might serious plans for Armywide implementation be feasible. Contrast with Alternatives Most of the evaluation we have discussed contrasts a novel program with standard practices that are believed worth improving; yet rational models of decision making are usually predicated on managers' having to choose among several different options for performing a particular task. One would hope that every sponsor of a novel performance enhancement technique is conversant with the practical alternatives to it and has cogent arguments for rejecting them. Many novel techniques have some components that are already in standard practice or can be clearly derived from established theories. Upon close inspection, pilot programs often turn out to be less novel than their developers and sponsors claim. Of course, the Army may often find it convenient to order complete packages in the form offered and may not have much latitude to interact with developers in order to modify package contents to emphasize what is truly a novel alternative and to downplay that which is merely standard practice. Ultimately, alternatives have to do with costs. Although many forms of cost are at issue including those associated with how much a new practice disrupts normal Army activities and how much stress it puts on personnel the major cost usually considered is financial. Cost analysis is always difficult, nowhere more so than in the Army, which uses many ways to calculate personnel costs. Nonetheless, in planning an evaluation, some evidence about the total cost of a pilot program to the Army will usually be available and can be critically scrutinized. It is also useful, as far as possible, to ascribe accurate Army costs to each of the major components of such an intervention. In our view, what is called cost
32 ENHANCING HUMAN PERFORMANCE effectiveness analysis lends itself better than what is called cost-benefit analysis to the comparison of different programs. The purpose of cost- effectiveness research is to express the total cost for each program in dollar terms and to relate this to the amount of effect as expressed in its original metrics unlike cost-benefit research, in which even the effects have to be expressed in dollar terms. Sophisticated consumers of eval- uation should want something akin to cost-effectiveness knowledge, for it reflects decisions they should be making. Is it not useful to know, for example, that the best available computer-assisted instruction packages are much less cost-effective than peer tutoring? CURRENT STATUS OF ARMY EVALUATIONS We set forth here some of our impressions of the way in which the Army currently manages the solicitation and evaluation of novel tech- niques to enhance performance. We must stress that these are only impressions, gained through the limited investigative capabilities of a committee such as ours, not hard conclusions based on systematic research directed at the particular question. Furthermore, although the opinions that follow are largely critical of Army procedures, they are not accompanied by much detail. As noted earlier, the focus here is on the identification of the various Army constituencies that have a stake in enhancement programs and on the role they play in evaluation. How the Army decides which among competing proposals should be sponsored for development or for field tests is not clear. What is clear is that decision making is diffuse both geographically and institutionally. Sponsorship may come from senior managers in the Pentagon or from local personnel of varying rank. While differences in the quality of program design, implementation, or evaluation may be correlated with the source of sponsorship, such a correlation is not clear at present in the Army context. A particular concern is that Army sponsors of pilot programs may base their judgment about the value of a program either on their own ideas about what is desirable or effective or on the persuasiveness of the arguments presented to them by program developers, who stand to gain financially if the Army adopts their program. Judgments of value should depend on broader analysis of Army needs and resources, as well as on realistic assessment of the quality of proposed ideas based on a thorough and independent knowledge of the relevant research literatures. Sponsors should examine what is being advocated at every stage: proposal, testing, and implementation. Also of concern when pilot programs are planned is how decisions are reached about funding and about the quality of implementation expected
EVALUATION ISSUES 33 from them. Although systematic evidence is lacking, it seemed to committee members that pilot programs are not generally implemented well and, except for fiscal accountability, are not closely monitored by their Army sponsors. Evaluations of pilot programs should try to char- acterize resources required by the program and the resources actually available. We found little evidence that sponsors, advocates, or local implementers had aspirations to evaluations that use state-of-the-art methods. We found no guidelines about the standards expected for evaluative work, whether in the form of published minimal standards or published statements of preferred practices. When it comes to field trials of novel ideas for enhancing human performance, the monitoring of evaluation quality does not seem to be part of the organizational context. Given the absence of formal expectations in these regards, it is not surprising that the pilot programs we saw and the evaluation materials we read were usually disappointing in the technical quality of the research conducted. In settings in which program sponsors or advocates control an evaluation, weaker evaluations (e.g., based on testimony) will sometimes be preferred to stronger methods (e.g., experiments) because the latter are usually more disruptive when implemented and are more likely to result in effects that are disappointing, however much more accurate they may be. The weaker methods are easier to implement when few units are available, are less disruptive of ongoing activities, are easier to manipulate for self- interested ends, and need not be as expensive for data collection. We saw little evidence that the Army requires evaluations by persons independent of the pilot program under review. Moreover, the noninde- pendent evaluations we saw did not seem to have been subjected to any of the peer review procedures to which research results (and plans) are subjected not only in academic sciences, but also in much of the corporate world, as with, say, pharmaceutical testing. While in-house evaluation is highly valuable for gaining feedback for program improvement, many experienced evaluators contend that it is inadequate for assigning overall value because in-house evaluators cannot divorce themselves from their own stake in the program under examination. Although it is not easy to specify organizational standards adequate for a high-quality field test of some novel technique, it is also not difficult to detect the inadequacies associated with local program sponsors' having few clear expectations about the desirable qualities of program operations or evaluative practices. In the absence of such expectations, program developers and evaluators may believe that few officials care about the small-scale field tests of techniques on which the developers' and, all too often, the evaluators'- own welfare depends. Since the organizational climate we have just described is not optimal
34 ENHANCING [IUMAN PERFORMANCE for gaining trustworthy information about program value, future evaluators of Army field trials might do well to characterize: (1) what program managers expect in terms of the quality of the program and its evaluation; (2) who is paying attention to the trials; and (3) for what purposes they want to use any information provided by the evaluation. This kind of information, as mentioned above, contributes to a description of the organizational context of a program, which is a major part of an adequate evaluation. QUALITATIVE APPROACHES Alternatives to experimentation are the largely qualitative traditions, which rely mostly on direct observation, sometimes supplemented by archival data. Investigative journalists operate in this mode; so do many cultural anthropologists, political scientists, and historians. These profes- sions use clues to suggest hypotheses about possible causes and investigate the empirical evidence in ever-greater detail in an attempt to rule out hypotheses until they are left with just one. A critical aspect of their work is the use of substantive theories and ad hoc findings from the past to help in ruling out alternative explanations. Also working in this tradition are committees of psychologists who seek to make statements about the causes of enhanced human performance. Rarely conducting studies themselves, they instead sift through historical evidence provided by reviews of the literature and make on-site observations in the manner of detectives, pathologists, investigative journalists, and cultural anthropol- ogists. These traditions rely strongly on personal testimony. Respondents' reports are taken seriously and, indeed, should be. Any method can, in principle, generate strong causal evidence, provided that plausible alter- natives to a preferred hypothesis have been ruled out. The general issues are: Can personal testimony usually rule out all the plausible alternative interpretations? Does use of it engender the very threats to validity that militate against strong inferences? Dale Griffin, in a paper prepared for the committee (see Appendix B), suggests `'no" to the first question and ''yes" to the second. His analysis of biases that operate when people attempt to explain how and why they changed after an experience reveals many of the shortcomings associated with relying on testimony as a major means of testing causal hypotheses. While testimony can be regarded as a form of confirmatory evidence, it does not provide any of the disconfirming evidence needed to reduce uncertainty. Rarely are there the kinds of comprehensive probes needed to discover why respondents believe that the effects are due to a treatment rather than to maturation, statistical regression, or the pleasant feelings
EVALUATION ISSUES 35 aroused by the experiences. People are typically weak at identifying the range of such alternatives, however simply they may be described, and at distinguishing the different ways in which the causal forces might operate. How can people know how they would have matured over time in the absence of an intervention (technique) that is being assessed? How can people disentangle effects due to a pleasant experience, a dynamic leader, or a sense of doing something important from effects due to the critical components of the treatment per se? Much research has shown that individuals are poor intuitive scientists and that they recreate a set of known cognitive biases (Nisbett and Ross, 1980; Griffin). These include belief perseverance, selective memory, errors of attribution, and over- confidence. These biases influence experts and nonexperts alike, usually without one's awareness of them. Scientists hold these biases in partial check by using random assignment instead of testimony and by the tradition of public scrutiny to identify and analyze alternative interpre- tations for observed events. Such methodological traditions can be transmitted to consumers and producers of enhancement techniques through courses on statistical inference and formal decision making. These courses would have the salutary effect of calling attention to the shortcomings of testimony as evidence. We submit that experimental methods facilitate causal inferences better than the alternatives. They reduce more uncertainty by ruling out more of the contending interpretations for observed effects. However, we refer here to the relative superiority of experimentation; such superiority should not be confused with either the perfection or even the adequacy of experimentation. Its problems include the facts that experiments cannot be implemented under all conditions and that experimentation has its own set of unintended side effects. Thus, experimental methods do not guarantee causal inferences and so cannot obviate the need for critical analysis that, on a case-by-case basis, is sensitive to the contexts and traditions of particular institutions or communities, such as the Army, on one hand, and the various promoters of new enhancement techniques, on the other. Moreover, well-conceived research is costly: it requires specially trained investigators, equipped facilities, and programs that may need extensive collaborations and review panels. It is also a demanding craft that requires sensitivity to detail and precision in order to ensure results that are interpretable. On balance, the benefits derived from careful experimentation outweigh the costs just mentioned. All other things being equal, experimentation is much the preferred strategy for judging the efficacy of techniques that purport to enhance performance, and it should be used whenever possible.