Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 24
Evaluation Issues
Implementation of an enhancement technique, in the committee's view,
should depend on two general kinds, or levels, of evaluation. The first
examines primarily the scientific justification for the effectiveness of the
technique and the potential of the technique for improving performance
in practice. The second kind examines field tests of a pilot program
incorporating the technique to determine how feasible it is and to what
extent it brings about effects that Army officials consider useful.
Convincing scientific justification can come only from basic research,
that is, from carefully controlled studies that usually take place in
laboratory settings and that preferably are related to a body of theory.
Such research can provide evidence for the existence of the causal effect
on which a technique is based and can help explain, or indicate a
mechanism for, the effect. Analysis in connection with basic research
should go beyond scientific justification to operational potential and likely
cost-effectiveness. Only field tests can assess a program's actual opera-
tions and effects, however, and for such tests a broader array of evaluative
criteria are needed, related primarily to the technique's utility.
Because strong claims of support from basic research have been made
for some of the techniques the committee examined, we review here
what it takes to justify a scientific claim, specifically, we review some
standards for evaluating basic research. We then examine in more detail
some standards for evaluating field tests of pilot programs. In the third
section of this chapter, we set forth briefly some of our impressions of
how the Army now manages the solicitation and evaluation of new
performance-enhancing techniques. This chapter concludes with a note
24
OCR for page 25
EVALUA TION ISSUES
25
on informal, qualitative approaches to evaluation, which are sometimes
suggested as alternatives to basic research and field tests.
This chapter does not aspire to a comprehensive treatment of evaluation
issues, and it barely touches on research methods. Articles, journals,
books, and handbooks testify to the scope and complexity of this
burgeoning field (e.g., Barber, 1976; Cook and Campbell, 1979~. Our
objective here is to highlight the topics that have impressed us as most
germane. The various sources just mentioned would need to be consulted
for even a minimal elaboration of these topics, and other committees
would be required if recipes for evaluation of the Army's enhancement
programs were sought as extensions of our work. Still, we believe this
chapter will help the Army set general evaluation standards.
STANDARDS FOR EVALUATING BASIC RESEARCH
The purpose of basic research is to permit inferences to be drawn in
accordance with scientific standards, including inferences about novel
concepts, about causation, about alternative explanations of causal
relations, and about the generalizability of causal relations.
For novel concepts, evidence must be gathered that both the purported
enhancement technique and the relevant performance have been (1)
defined in a way to highlight their critical elements, (2) differentiated
from related variables that might bring about similar effects, and (3) put
into operation (manipulated or measured) in ways that include the critical
parts. The burden is on the evaluator to analyze how the components of
each new technique differ from concepts already in the literature. The
need for this standard is illustrated well by packages for accelerated
learning, as discussed in Chapter 4.
Evidence needs also to be adduced that supposed cause and effect
variables vary together ire a systematic manner. Relevant procedures
include comparison of performance before and after introduction of the
technique, contrasts of experimental and control groups in an experimental
design, and calculation of statistical significance. Illusory covariation can
occur more easily in nonstatistical studies, which are used often to support
the existence of paranormal effects, as discussed in Chapter 9.
Especially demanding is the need for evidence that the performance
effect observed is due to the postulated cause and not to some other
variable. Ruling out alternative explanations or mechanisms requires
intimate knowledge of a research area. Historical findings and critical
commentary are needed to identify alternatives, determine their plausi-
bility' and judge how well they have been ruled out in particular sets of
experiments. Common threats to the validity of any presumed cause
OCR for page 26
26
ENHANCING HUMAN PERFORMANCE
effect relation include effects stemming from subject selection, unexpected
changes in organizational forces, the spontaneous maturation of subjects,
and the sensitizing effects of a pretest measurement on a pastiest
assessment. Experiments with random assignment of subjects to treat-
ments are preferred, but some of the better quasi-experimental designs
are also useful. Another class of threats to validity is associated with
subject reactions to such conceptual irrelevancies as experimenter ex-
pectations about how subjects should perform or subjects' performing
better merely because they are receiving attention. Procedures that have
evolved to reduce this sort of threat include double-blind experiments,
placebo control groups, mechanical delivery of treatments, and the
elimination of all communication between experimenters and subjects or
among subjects. These safeguards, however, are not certain, and imple-
menting them is not a simple matter.
Finally, for a technique to be of value, one must ascertain that a causal
relation observed in one setting is likely to be observed in other settings
in which the technique is to be employed. Replication of an experiment
by an independent investigator is a first step. Another step is to produce
the cause and effect with different samples of people, settings, and times.
Systematic reviews of the literature, perhaps aided by what is referred
to as meta-analysis of studies (as illustrated in Chapter 5), are also helpful.
Beyond these steps, a thorough theoretical understanding of causal
processes, which is a fundamental goal of science, permits increased
practical control.
Our point perhaps seeming obvious to many but nonetheless needing
emphasis here is that a planned or existing program for implementing
an enhancement technique is much more likely to bear fruit if evidence
for the technique's effectiveness is properly derived from basic research.
A complex set of ground rules exists for conducting and drawing inferences
from basic research, and waiving those rules greatly increases the chances
of incorrect conclusions.
STANDARDS FOR EVALUATING
FIELD TESTS OF PROGRAMS
An adequate appraisal of an actual enhancement program requires
attention to three general factors. First, the organizational (i.e., political,
administrative) context in which the program is embedded should be
described. That context strongly influences the choice of evaluation
criteria, the types of evaluations considered feasible, and the extent to
which evaluation results will be used. Second, the programs conse-
quences should be described and explained, including planned and
unplanned, short-term and long-term consequences. The way the program
OCR for page 27
EVALUATION ISSUES
27
is construed influences the claims resulting from an evaluation and the
degree of confidence that can be placed in what was learned. Third, value
or merit should be explicitly assigned to a program. Valuing relates an
enhancement technique to an Army need and to feasible alternatives. In
the following sections we comment on these three factors in turn.
THE ORGANIZATIONAL CONTEXT
A description of the broader context of an enhancement program would
include an assessment both of the various constituencies with a stake in
its implementation and of the priorities of the larger institution. We do
not discuss stakeholder interests in general at this point because we refer
to some specifically later in this chapter, in the section on the committee's
impressions of current Army evaluation practices. We do comment here
on the Army's institutional priorities as they may relate to scientific
standards.
We understand that the Army, like other organizations in society, may
have-and quite possibly should have~ifferent standards for evaluating
knowledge claims, or technique effectiveness, than science has. The
scientific establishment is conservative in the tests it administers to
discipline its conjectures; in particular, its goal is to reduce uncertainty
as far as possible, no matter how long that takes. In the Army, by
contrast, the need for timely information and decisions may lead to an
acceptance of greater uncertainty and a higher risk of being wrong.
There is no Army doctrine of which we are aware concerning the
degree of risk that is acceptable in evaluations of pilot programs. Yet
surely one objective of evaluations of pilot programs should be to describe
the costs to the Army of drawing incorrect conclusions so that inferential
standards can be made commensurate with those costs. If the costs are
relatively low, the riskier approach of most commercial research (as, for
example, in management consulting or marketing) may be preferred to
the more conservative approach of basic science.
DESCRIBING A PROGRAM S CONSEQUENCES
In evaluating a program, it is desirable to present an analysis and
defense of the questions probed and not probed, together with justification
for the priorities accorded to various issues. Primary issues usually
include the program's immediate effects and its organizational side effects.
Immediate Effects
A primary problem in evaluation is to decide on the criteria by which
a program is to be assessed. The major sources for identifying potential
OCR for page 28
28
ENHANCING HUMAN PERFORMANCE
criteria include program goals, interviews with interested persons, con-
sideration of plausible consequences found in the literature, and insights
gained from preliminary field work.
Such criteria specify only potential effects, however. They do not
speak to the matter of whether the relation between a supposed cause
and effect is truly causal. In this respect, a fundamental issue of
methodology is the use of randomized experiments. Although logistic
reasons abound in any practical context for not going to the trouble to
use such research designs, one might nonetheless argue that the Army is
in a better position to conduct randomized experiments than are organi-
zations in such fields as education, job training, and public health. The
reason for going to such trouble is that randomized experiments give a
lower risk of incorrect causal conclusions than the alternatives.
Alternatives at the next level of confidence are quasi-experimental
designs that include pretest measures and comparison (control) groups.
Relatively little confidence can be placed either in before-after measure-
ments of a single group exposed to a technique without an external
comparison, or in comparisons of nonequivalent intact groups for which
pretest measures are not available.
Side Effects
Unintended side effects include impacts on the broader organization,
and these should be monitored. For example, trainers from other (non-
experimental) units may copy what they think is going on, or they may
simply be upset by the implementation of new instructional packages in
the experimental units. Units not treated in the same way as the
experimental units may be unwilling to cooperate when cooperation
would seem to be in their best interest. They may also suffer by
comparison, as is thought to be the case, for example, when COHORT
units are introduced into a division (see Chapter 8~. Evaluators should
strive to see any program as fitting into a wider system of Army activities
on which it may have unintended positive or negative effects.
ASSIGNING VALUE TO PILOT PROGRAMS
The described consequences of a program tell us what a program has
achieved but not how valuable it is. Three other factors are important in
inferring value: Does the new technique meet a demonstrable Army need
to the extent that without it the organization would be less effective?
How likely is it that the program can be transferred to other Army
settings, either as a total package or in part? How well does the new
OCR for page 29
EVALUATION ISSUES
29
program fare when compared with current practice and with alternatives
for bringing about the same results?
Meeting Needs
Representatives of the commercial world who seek outlets for their
products often confound wants with needs, enthusiasm with proof, and
hope with reality. While it is axiomatic that all field tests should aim to
meet genuine Army needs, it is not clear how needs are now assessed
when the developers of new products approach Army personnel for
permission to do general research or field tests. It is clear that a needs
analysis should be part of the documentation about every field test.
What should a needs analysis look like? At the minimum, it should
document the current level of performance at some task, why the level
is inadequate, what reason there is to believe that performance can
change, and what the Armywide impacts would probably be if the
performance in question were improved. In addition, an analysis should
question why a particular program is needed for solving the problem.
Such an analysis would describe the program, critically examine its
justification in basic research, identify the financial and human resources
required to make the program work, relate the resources required to the
funds available, examine other ways of bringing about the same intended
results, and justify the program at hand in terms of its anticipated cost-
effectiveness. To facilitate critical feedback, such reports should be
independent of the persons who sponsor a program, though based on a
thorough, firsthand acquaintance with the program and its developers
and sponsors.
As just described, needs analysis is a planning exercise to justify
mounting a pilot program. It is not a review of program achievements
relative to needs, for which a description of a program's consequences
is required. At that later stage in evaluation a judgment is required about
whether the magnitude of a program's effects is sufficient to reduce needs
to a degree that makes a practical difference. More is at stake than
whether the program makes a statistically reliable difference in perform-
ance. Size of effect relative to need is the crucial concern. When the
magnitude of change required for practical significance has been specified
in advance, it is easy to use such a specification to probe how well a
need has been met. But the level of change required to alleviate need is
not usually predetermined, and there are political reasons why developers
are not always eager to have their programs evaluated in terms of effect
sizes they themselves have clearly promised or that others have set for
them.
Needs can be specified only by Army officials, and it is vital that such
OCR for page 30
30
ENHANCING HUMAN PERFORMANCE
officials inspect the results a program has achieved, relating them to their
perception of need. Since the Army is heterogeneous, it would be naive
to believe that there are no significant differences within it about how
important various needs are and how far a particular effect goes in
meeting a particular need. Some theorists relate needs primarily to the
number of persons performing below a desired level, while others
emphasize the seriousness of consequences for unit performance, for
which deficiencies in only one or two persons may be crucial. Some
practitioners are likely to think a deficit in skill X is worse than a deficit
in skill Y. while others may believe the opposite. Evaluators who take
the concept of need seriously have to take cognizance of such hetero-
geneity, perhaps using group approaches like the Delphi technique to
bring about consensus on both the level of need and the extent to which
a particular pattern of evaluative results helps meet that need.
Likelihood of Transfer
Although some local commanders may sponsor field trials for the benefit
of their command alone, the more widely a successful new practice can
be implemented within the Army, the more important it is likely to be.
Consequently, evaluations of pilot programs should seek to draw conclu-
sions about the likelihood that findings will transfer to populations and
settings different from those studied.
In this regard, it is particularly important to probe the extent to which
any findings from a pilot study might depend on the special knowledge
and enthusiasm of those persons who deliver or sponsor the program.
Such persons are often strongly committed to a program, treating it with
a concern and intensity that most regular Army personnel could not be
expected to match. While it is sometimes possible to transfer such
committed persons from one Army site to another in order to implement
a program, in many instances this cannot be done. Transfer is partly a
question of the psychology of ownership; authorities who did not sponsor
a product will sometimes reject out of hand what others have developed,
including their immediate predecessors. Since Army leaders in any
position turn over with some regularity due to transfers, promotions, and
retirement, successors will probably not identify with a program as
strongly as the original sponsors and developers did.
The likelihood of transfer also affects the degree to which program
implementation is monitored. Pilot programs are likely to be more
obtrusively monitored than other programs. Not only is this obtrusiveness
due to developers' and evaluators' fussing over their charge, it is also
due to teams of experts brought in to inspect what is novel and to
responsible officers wanting to show others the unique programs they
OCR for page 31
EVAL UA TI ON ISS UES
31
are leading (and on which the success of their careers may depend). For
at least these reasons pilot programs tend to stand out more than the
regular programs they may engender. Research suggests that the quality
with which programs are delivered may in fact increase when outside
personnel are obviously monitoring individual and group performance.
It is naive to believe that one can go confidently from a single pilot
program to full-blown Armywide implementation. Even if this were
feasible politically, it would not be technically advisable unless there
were compelling evidence from a great deal of prior research indicating
that the program was indeed built on valid substantive foundations. Given
a single pilot program, decisions about transfer are best made if the
program is tested again, at a larger but still restricted set of sites and
under conditions that more closely approximate those that would pertain
if the new enhancement technique were implemented as routine policy.
Only then might serious plans for Armywide implementation be feasible.
Contrast with Alternatives
Most of the evaluation we have discussed contrasts a novel program
with standard practices that are believed worth improving; yet rational
models of decision making are usually predicated on managers' having
to choose among several different options for performing a particular
task. One would hope that every sponsor of a novel performance
enhancement technique is conversant with the practical alternatives to it
and has cogent arguments for rejecting them.
Many novel techniques have some components that are already in
standard practice or can be clearly derived from established theories.
Upon close inspection, pilot programs often turn out to be less novel
than their developers and sponsors claim. Of course, the Army may often
find it convenient to order complete packages in the form offered and
may not have much latitude to interact with developers in order to modify
package contents to emphasize what is truly a novel alternative and to
downplay that which is merely standard practice.
Ultimately, alternatives have to do with costs. Although many forms
of cost are at issue including those associated with how much a new
practice disrupts normal Army activities and how much stress it puts on
personnel the major cost usually considered is financial. Cost analysis
is always difficult, nowhere more so than in the Army, which uses many
ways to calculate personnel costs. Nonetheless, in planning an evaluation,
some evidence about the total cost of a pilot program to the Army will
usually be available and can be critically scrutinized. It is also useful, as
far as possible, to ascribe accurate Army costs to each of the major
components of such an intervention. In our view, what is called cost
OCR for page 32
32
ENHANCING HUMAN PERFORMANCE
effectiveness analysis lends itself better than what is called cost-benefit
analysis to the comparison of different programs. The purpose of cost-
effectiveness research is to express the total cost for each program in
dollar terms and to relate this to the amount of effect as expressed in its
original metrics unlike cost-benefit research, in which even the effects
have to be expressed in dollar terms. Sophisticated consumers of eval-
uation should want something akin to cost-effectiveness knowledge, for
it reflects decisions they should be making. Is it not useful to know, for
example, that the best available computer-assisted instruction packages
are much less cost-effective than peer tutoring?
CURRENT STATUS OF ARMY EVALUATIONS
We set forth here some of our impressions of the way in which the
Army currently manages the solicitation and evaluation of novel tech-
niques to enhance performance. We must stress that these are only
impressions, gained through the limited investigative capabilities of a
committee such as ours, not hard conclusions based on systematic
research directed at the particular question. Furthermore, although the
opinions that follow are largely critical of Army procedures, they are not
accompanied by much detail. As noted earlier, the focus here is on the
identification of the various Army constituencies that have a stake in
enhancement programs and on the role they play in evaluation.
How the Army decides which among competing proposals should be
sponsored for development or for field tests is not clear. What is clear is
that decision making is diffuse both geographically and institutionally.
Sponsorship may come from senior managers in the Pentagon or from
local personnel of varying rank. While differences in the quality of
program design, implementation, or evaluation may be correlated with
the source of sponsorship, such a correlation is not clear at present in
the Army context.
A particular concern is that Army sponsors of pilot programs may base
their judgment about the value of a program either on their own ideas
about what is desirable or effective or on the persuasiveness of the
arguments presented to them by program developers, who stand to gain
financially if the Army adopts their program. Judgments of value should
depend on broader analysis of Army needs and resources, as well as on
realistic assessment of the quality of proposed ideas based on a thorough
and independent knowledge of the relevant research literatures. Sponsors
should examine what is being advocated at every stage: proposal, testing,
and implementation.
Also of concern when pilot programs are planned is how decisions are
reached about funding and about the quality of implementation expected
OCR for page 33
EVALUATION ISSUES
33
from them. Although systematic evidence is lacking, it seemed to
committee members that pilot programs are not generally implemented
well and, except for fiscal accountability, are not closely monitored by
their Army sponsors. Evaluations of pilot programs should try to char-
acterize resources required by the program and the resources actually
available.
We found little evidence that sponsors, advocates, or local implementers
had aspirations to evaluations that use state-of-the-art methods. We found
no guidelines about the standards expected for evaluative work, whether
in the form of published minimal standards or published statements of
preferred practices. When it comes to field trials of novel ideas for
enhancing human performance, the monitoring of evaluation quality does
not seem to be part of the organizational context. Given the absence of
formal expectations in these regards, it is not surprising that the pilot
programs we saw and the evaluation materials we read were usually
disappointing in the technical quality of the research conducted. In
settings in which program sponsors or advocates control an evaluation,
weaker evaluations (e.g., based on testimony) will sometimes be preferred
to stronger methods (e.g., experiments) because the latter are usually
more disruptive when implemented and are more likely to result in effects
that are disappointing, however much more accurate they may be. The
weaker methods are easier to implement when few units are available,
are less disruptive of ongoing activities, are easier to manipulate for self-
interested ends, and need not be as expensive for data collection.
We saw little evidence that the Army requires evaluations by persons
independent of the pilot program under review. Moreover, the noninde-
pendent evaluations we saw did not seem to have been subjected to any
of the peer review procedures to which research results (and plans) are
subjected not only in academic sciences, but also in much of the corporate
world, as with, say, pharmaceutical testing. While in-house evaluation is
highly valuable for gaining feedback for program improvement, many
experienced evaluators contend that it is inadequate for assigning overall
value because in-house evaluators cannot divorce themselves from their
own stake in the program under examination. Although it is not easy to
specify organizational standards adequate for a high-quality field test of
some novel technique, it is also not difficult to detect the inadequacies
associated with local program sponsors' having few clear expectations
about the desirable qualities of program operations or evaluative practices.
In the absence of such expectations, program developers and evaluators
may believe that few officials care about the small-scale field tests of
techniques on which the developers' and, all too often, the evaluators'-
own welfare depends.
Since the organizational climate we have just described is not optimal
OCR for page 34
34
ENHANCING [IUMAN PERFORMANCE
for gaining trustworthy information about program value, future evaluators
of Army field trials might do well to characterize: (1) what program
managers expect in terms of the quality of the program and its evaluation;
(2) who is paying attention to the trials; and (3) for what purposes they
want to use any information provided by the evaluation. This kind of
information, as mentioned above, contributes to a description of the
organizational context of a program, which is a major part of an adequate
evaluation.
QUALITATIVE APPROACHES
Alternatives to experimentation are the largely qualitative traditions,
which rely mostly on direct observation, sometimes supplemented by
archival data. Investigative journalists operate in this mode; so do many
cultural anthropologists, political scientists, and historians. These profes-
sions use clues to suggest hypotheses about possible causes and investigate
the empirical evidence in ever-greater detail in an attempt to rule out
hypotheses until they are left with just one. A critical aspect of their
work is the use of substantive theories and ad hoc findings from the past
to help in ruling out alternative explanations. Also working in this tradition
are committees of psychologists who seek to make statements about the
causes of enhanced human performance. Rarely conducting studies
themselves, they instead sift through historical evidence provided by
reviews of the literature and make on-site observations in the manner of
detectives, pathologists, investigative journalists, and cultural anthropol-
ogists.
These traditions rely strongly on personal testimony. Respondents'
reports are taken seriously and, indeed, should be. Any method can, in
principle, generate strong causal evidence, provided that plausible alter-
natives to a preferred hypothesis have been ruled out. The general issues
are: Can personal testimony usually rule out all the plausible alternative
interpretations? Does use of it engender the very threats to validity that
militate against strong inferences? Dale Griffin, in a paper prepared for
the committee (see Appendix B), suggests `'no" to the first question and
''yes" to the second. His analysis of biases that operate when people
attempt to explain how and why they changed after an experience reveals
many of the shortcomings associated with relying on testimony as a major
means of testing causal hypotheses.
While testimony can be regarded as a form of confirmatory evidence,
it does not provide any of the disconfirming evidence needed to reduce
uncertainty. Rarely are there the kinds of comprehensive probes needed
to discover why respondents believe that the effects are due to a treatment
rather than to maturation, statistical regression, or the pleasant feelings
OCR for page 35
EVALUATION ISSUES
35
aroused by the experiences. People are typically weak at identifying the
range of such alternatives, however simply they may be described, and
at distinguishing the different ways in which the causal forces might
operate. How can people know how they would have matured over time
in the absence of an intervention (technique) that is being assessed? How
can people disentangle effects due to a pleasant experience, a dynamic
leader, or a sense of doing something important from effects due to the
critical components of the treatment per se? Much research has shown
that individuals are poor intuitive scientists and that they recreate a set
of known cognitive biases (Nisbett and Ross, 1980; Griffin). These include
belief perseverance, selective memory, errors of attribution, and over-
confidence. These biases influence experts and nonexperts alike, usually
without one's awareness of them. Scientists hold these biases in partial
check by using random assignment instead of testimony and by the
tradition of public scrutiny to identify and analyze alternative interpre-
tations for observed events. Such methodological traditions can be
transmitted to consumers and producers of enhancement techniques
through courses on statistical inference and formal decision making.
These courses would have the salutary effect of calling attention to the
shortcomings of testimony as evidence.
We submit that experimental methods facilitate causal inferences better
than the alternatives. They reduce more uncertainty by ruling out more
of the contending interpretations for observed effects. However, we refer
here to the relative superiority of experimentation; such superiority
should not be confused with either the perfection or even the adequacy
of experimentation. Its problems include the facts that experiments
cannot be implemented under all conditions and that experimentation has
its own set of unintended side effects. Thus, experimental methods do
not guarantee causal inferences and so cannot obviate the need for critical
analysis that, on a case-by-case basis, is sensitive to the contexts and
traditions of particular institutions or communities, such as the Army,
on one hand, and the various promoters of new enhancement techniques,
on the other. Moreover, well-conceived research is costly: it requires
specially trained investigators, equipped facilities, and programs that may
need extensive collaborations and review panels. It is also a demanding
craft that requires sensitivity to detail and precision in order to ensure
results that are interpretable.
On balance, the benefits derived from careful experimentation outweigh
the costs just mentioned. All other things being equal, experimentation
is much the preferred strategy for judging the efficacy of techniques that
purport to enhance performance, and it should be used whenever possible.
OCR for page 36
Representative terms from entire chapter:
field tests