| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Executive Summary
This report provides an assessment of the U.S. Army's planned ini-
tial operational test (JOT) of the Stryker family of vehicles. Stryker
is intended to provide mobility and "situation awareness" for the
Interim Brigade Combat Team (IBCT). For this reason, the Army Test and
Evaluation Command (ATEC) has been asked to take on the unusual re-
sponsibility of testing both the vehicle and the IBCT concept.
Building on the recommendations of an earlier National Research
Council study and report (National Research Council, 1998a), the Panel
on Operational Test Design and Evaluation of the Interim Armored Ve-
hicle considers the Stryker IOT an excellent opportunity to examine how
the defense community might effectively use test resources and analyze test
data. The panel's judgments are based on information gathered during a
series of open forums and meetings involving ATEC personnel and experts
in the test and evaluation of systems. Perhaps equally important, in our
view the assessment process itself has had a salutary influence on the IOT
design for the IBCT/Stryker system.
We focus in this report on two aspects of the operational test design
and evaluation of the Stryker: (1) the measures of performance and effec-
tiveness used to compare the IBCT equipped with the Stryker against the
baseline force, the Light Infantry Brigade (LIB), and (2) whether the cur-
rent operational test design is consistent with state-of-the-art methods.
Our next report will discuss combining information obtained from the
OCR for page 2
2
IMPROVED OPERATIONAL TESTING AND EVALUATION
IOT with other tests, engineering judgment, experience, and the like. The
panel's final report will encompass both earlier reports and any additional
developments.
OVERALL TEST PLANNING
Two specific purposes of the IOT are to determine whether the IBCT/
Stryker performs more effectively than the baseline force, and whether the
Stryker family of vehicles meets its capability and performance require-
ments. Our primary recommendation is to supplement these purposes:
when evaluating a large, complex, and critical weapon system such as
the Stryker, operational tests should be designed, carried out, and evalu-
ated with a view toward improving the capabilities and performance of the
system.
MEASURES OF EFFECTIVENESS
We begin by considering the definition and analysis of measures of
effectiveness (MOEs). In particular, we address problems associated with
rolling up disparate MOEs into a single overall number, the use of untested
or ad hoc force ratio measures, and the requirements for calibration and
scaling of subjective evaluations made by subject-matter experts (SMEs).
We also identify a need to develop scenario-specific MOEs for noncombat
missions, and we suggest some possible candidates for these. Studying the
question of whether a single measure for the "value" of situation awareness
can be devised, we reached the tentative conclusion that there is no single
appropriate MOE for this multidimensional capability. Modeling and
simulation tools can be used to this end by augmenting test data during the
evaluation. These tools should be also used, however, to develop a better
understanding of the capabilities and limitations of the system in general
and the value of situation awareness in particular.
With respect to determining critical measures of reliability and main-
tainability (RAM), we observe that the IOT will provide a relatively small
amount of vehicle operating data (compared with that obtained in training
exercises and developmental testing) and thus may not be sufficient to ad-
dress all of the reliability and maintainability concerns of ATEC. This lack
of useful RAM information will be exacerbated by the fact that the IOT is
to be performed without using add-on armor.
For this reason, we stress that RAM data collection should be an ongo-
OCR for page 3
EXECUTIVE SUMMARY
3
ing enterprise, with failure times, failure modes, and maintenance informa-
tion tracked for the entire life of the vehicle (and its parts), including data
from developmental testing and training, and recorded in appropriate data-
bases. Failure modes should be considered separately, rather than assign-
ing a single failure rate for a vehicle using simple exponential models.
EXPERIMENTAL DESIGN
With respect to the experimental design itself, we are very concerned
that observed differences will be confounded by important sources of un-
controlled variation. In particular, as pointed out in the panel's letter re-
port (Appendix A), the current test design calls for the IBCT/Stryker trials
to be run at a different time from the baseline trials. This design may
confound time of year with the primary measure of interest: the difference
in effectiveness between the baseline force and the IBCT/Stryker force. We
therefore recommend that these events be scheduled as closely together in
time as possible, and interspersed if feasible. Also, additional potential
sources of confounding, including player learning and nighttime versus
daytime operations, should be addressed with alternative designs. One
alternative to address confounding due to player learning is to use four
separate groups of players, one for each of the two opposing forces
(OPFORs), one for the IBCT/Stryker, and one for the baseline system.
Intergroup variability appears likely to be a lesser problem than player learn-
ing. Also, alternating teams from test replication to test replication be-
tween the two systems under test would be a reasonable way to address
differences in learning, training, fatigue, and competence.
We also point out the difficulty in identifying a test design that is
simultaneously "optimized" with respect to determining how various fac-
tors affect system performance for dozens of measures, and also confirming
performance either against a baseline system or against a set of require-
ments. For example, the current test design, constructed to compare IBCT/
Stryker with the baseline, is balanced for a limited number of factors. How-
ever, it does not provide as much information about the system's advan-
tages as other approaches could. In particular, the current design allocates
test samples to missions and environments in approximatley the same pro-
portion as would be expected in field use. This precludes focusing test
samples on environments in which Stryker is designed to have advantages
over the baseline system, and it allocates numerous test samples to environ-
ments for which Stryker is anticipated to provide no benefits over the
OCR for page 4
4
IMPROVED OPERATIONAL TESTING AND EVALUATION
baseline system. This reduces the opportunity to learn the size of the ben-
efit that Stryker provides in various environments, as well as the reasons
underlying its advantages. In support of such an approach, we present a
number of specific technical suggestions for test design, including making
use of test design in learning and confirming stages as well as small-scale
pilot tests. Staged testing, presented as an alternative to the current design,
would be particularly useful in coming to grips with the difficult problem
of understanding the contribution of situation awareness to system perfor-
mance. For example, it would be informative to run pilot tests with the
Stryker situation awareness capabilities intentionally degraded or turned
off, to determine the value they provide in particular missions or scenarios.
We make technical suggestions in several areas, including statistical
power calculations, identifying the appropriate test unit of analysis, com-
bining SME ratings, aggregation, and graphical methods.
SYSTEM EVALUATION AND IMPROVEMENT
More generally, we examined the implications of this particular IOT
for future tests of similar systems, particularly those that operationally in-
teract so strongly with a novel force concept. Since the size of the opera-
tional test (i.e., number of test replications) for this complex system (or
systems of systems) will be inadequate to support hypothesis tests leading
to a decision on whether Stryker should be passed to full-rate production,
ATEC should augment this decision with other techniques. At the very
least, estimates and associated measures of precision (e.g., confidence inter-
vals) should be reported for various MOEs. In addition, the reporting and
use of numerical and graphical assessments, based on data from other tests
and trials, should be explored. In general, complex systems should not be
forwarded to operational testing, absent strategic considerations, until the
system design is relatively mature. Forwarding an immature system to op-
erational test is an expensive way to discover errors that could have been
detected in developmental testing, and it reduces the ability of an opera-
tional test to carry out its proper function.
As pointed out in the panel's letter report (Appendix A), it is extremely
important, when testing complex systems, to prepare a straw man test evalu-
ation report (TER), as if the IOT had been completed. It should include
examples of how the representative data will be analyzed, specific presenta-
tion formats (including graphs) with expected results, insights to develop
from the data, draft recommendations, and so on. The content of this
straw man report should be based on the experience and intuition of the
OCR for page 5
EXECUTIVE SUMMARY
analysts and what they think the results of the IOT might look like. To do
this and to ensure the validity and persuasiveness of evaluations drawn from
the testing, ATEC needs a cadre of statistically trained personnel with "own-
ership" of the design and the subsequent test and evaluation. Thus, the
Department of Defense in general and ATEC in particular should give a
high priority to developing a contractual relationship with leading practi-
tioners in the fields of reliability estimation, experimental design, and data
analysis to help them with future IOTs.
In summary, the panel has a substantial concern about confounding in
the current test design for the IBCT/Stryker IOT that needs to be ad-
dressed. If the confounding issues were reduced or eliminated, the remain-
der of the test design, aside from the power calculations, has been compe-
tently developed from a statistical point of view. Furthermore, this report
provides a number of evaluations and resulting conclusions and recom-
mendations for improvement of the design, the selection and validation of
MOEs, the evaluation process, and the conduct of future tests of highly
complex systems. We attach greater priority to several of these recommen-
dations and therefore highlight them here, organized by chapters to assist
those interested in locating the supporting arguments.
RECOMMENDATIONS
Chapter 3
· Different MOEs should not be rolled up into a single overall num-
ber that tries to capture effectiveness or suitability.
· To help in the calibration of SMEs, each should be asked to review
his or her own assessment of the Stryker IOT missions, for each scenario,
immediately before he or she assesses the baseline missions (or vice versa).
· ATEC should review the opportunities and possibilities for SMEs
to contribute to the collection of objective data, such as times to complete
certain subtasks, distances at critical times, etc.
.
ATEC should consider using two separate SME rating scales: one
r cc r · '' r cc ''
tor tel. .ures ant ~ anotner tor successes.
.
FER (and the LER when appropriate), but not the RLR, should be
used as the primary mission-level MOE for analyses of engagement results.
· ATEC should use fratricide frequency and civilian casualty fre-
quency to measure the amount of fratricide and collateral damage in a
. .
mission.
OCR for page 6
6
IMPROVED OPERATIONAL TESTING AND EVALUATION
· Scenario-specific MOPs shoulcl be aclclecl for SOSE missions.
· Situation awareness shoulcl be introduced as an explicit test
. .
cone ration.
· RAM data collection shoulcl be an ongoing enterprise. Failure and
maintenance information shoulcl be trackocl on a vehicle or part/system
basis for the entire life of the vehicle or part/system. Appropriate databases
shoulcl be set up. This was probably not clone with those Stryker vehicles
already in existence, but it could be implemented for future maintenance
actions on all Stryker vehicles.
· With respect to the difficulty of reaching a decision regarding reli-
ability, given limited miles and absence of aclcl-on-armor, weight packs
shoulcl be used to provide information about the impact of additional
weight on reliability.
· Failure modes shoulcl be considered separately rather than trying to
develop failure rates for the entire vehicle using simple exponential mocl-
els. The data reporting requirements vary depending on the failure rate
r
tunctlon.
Chapter 4
· Given either a learning or a confirmatory objective, ignoring various
tactical considerations, a requisite for operational testing is that it shoulcl not
commence until the system design is mature.
· ATEC shoulcl consicler, for future test clesigns, relaxing various rules
of test design that it adheres to, by (a) not allocating sample size to sce-
narios according to the OMS/MP, but instead using principles from opti-
mal experimental design theory to allocate sample size to scenarios, (b)
testing under somewhat more extreme conditions than typically will be
faced in the fielcl, (c) using information from developmental testing to
improve test clesign, and (cl) separating the operational test into at least two
stages, learning and confirmatory.
· ATEC shoulcl consider applying to future operational testing in
general a two-phase test design that involves, first, learning phase studies
that examine the test object under different conclitions, thereby helping
testers design further tests to elucidate areas of greatest uncertainty and
importance, ancl, seconcl, a phase involving confirmatory tests to address
hypotheses concerning performance vis-a-vis a baseline system or in com-
parison with requirements. ATEC shoulcl consider taking advantage of
this approach for the IBCT/Stryker JOT. That is, examining in the first
phase IBCT/Stryker under different conclitions, to assess when this system
OCR for page 7
EXECUTIVE SUMMARY
works best, and why, and conducting a second phase to compare IBCT/
Stryker to a baseline, using this confirmation experiment to support the
decision to proceed to full-rate production. An important feature of the
learning phase is to test with factors at high stress levels in order to develop
a complete understanding of the system's capabilities and limitations.
· When specific performance or capability problems come up in the
early part of operational testing, small-scale pilot tests, focused on the analy-
sis of these problems, should be seriously considered. For example, ATEC
should consider test conditions that involve using Stryker with situation
awareness degraded or turned off to determine the value that it provides in
. . . .
particular missions.
· ATEC should eliminate from the IBCT/Stryker IOT one signifi-
cant potential source of confounding, seasonal variation, in accordance with
the recommendation provided earlier in the October 2002 letter report
from the panel to ATEC (see Appendix A). In addition, ATEC should also
seriously consider ways to reduce or eliminate possible confounding from
player learning, and day/night imbalance.
Chapter 5
· The IOT provides little vehicle operating data and thus may not be
sufficient to address all of the reliability and maintainability concerns of
ATEC. This highlights the need for improved data collection regarding
vehicle usage. In particular, data should be maintained for each vehicle
over that vehicle's entire life, including training, testing, and ultimately
field use; data should also be gathered separately for different failure modes.
· The panel reaffirms the recommendation of the 1998 NRC panel
that more use should be made of estimates and associated measures of pre-
cision (or confidence intervals) in addition to significance tests, because the
former enable the judging of the practical significance of observed effects.
, ~ ~
Chapter 6
· Operational tests should not be strongly geared toward estimation
of system suitability, since they cannot be expected to run long enough to
estimate fatigue life, estimate repair and replacement times, identify failure
modes, etc. Therefore, developmental testing should give greater priority
to measurement of system (operational) suitability and should be struc-
tured to provide its test events with greater operational realism.
.
In general, complex systems should not be forwarded to operational
OCR for page 8
8
IMPROVED OPERATIONAL TESTING AND EVALUATION
testing, in the absence of strategic considerations, until the system design is
relatively mature. Forwarding an immature system to operational test is an
expensive way to discover errors that could have been detected in develop-
mental testing, and it reduces the ability of an operational test to carry out
its proper function. System maturation should be expedited through previ-
ous testing that incorporates various aspects of operational realism in addi-
tion to the usual developmental testing.
· Because it is not yet clear that the test design and the subsequent
test analysis have been linked, ATEC should prepare a straw man test evalu-
ation report in advance of test design, as recommended in the panel's Octo-
ber 2002 letter to ATEC (see Appendix A).
· The goals of the initial operational test need to be more clearly
specified. Two important types of goals for operational test are learning
about system performance and confirming system performance in com-
parison to requirements and in comparison to the performance of baseline
systems. These two different types of goals argue for different stages of
operational test. Furthermore, to improve test designs that address these
different types of goals, information from previous stages of system devel-
opment need to be utilized.
Finally, we wish to make clear that the panel was constituted to address
the statistical questions raised by the selection of measures of performance
and measures of effectiveness, and the selection of an experimental design,
given the need to evaluate Stryker and the IBCT in scenarios identified in
the OMS/MP. A number of other important issues (about which the panel
provides some commentary) lie outside the panel's charge and expertise.
These include an assessment of (a) the selection of the baseline system to
compare with Stryker, (b) the problems raised by the simultaneous evalua-
tion of the Stryker vehicle and the IBCT system that incorporates it, (c)
whether the operational test can definitively answer specific tactical ques-
tions, such as the degree to which the increased vulnerability of Stryker is
offset by the availability of greater situational awareness, and (~) whether or
not scenarios to be acted out by OPFOR represent a legitimate test suite.
Let us elaborate each of these ancillary but important issues.
The first is whether the current choice of a baseline system (or multiple
baselines) is best from a military point of view, including whether a baseline
system could have been tested taking advantage of the IBCT infrastructure,
to help understand the value of Stryker without the IBCT system. It does
not seem to be necessary to require that only a system that could be trans-
ported as quickly as Stryker could serve as a baseline for comparison.
OCR for page 9
EXECUTIVE SUMMARY
9
The second issue (related to the first) is the extent to which the current
test provides information not only about comparison of the IBCT/Stryker
system with a baseline system, but also about comparison of the Stryker
suite of vehicles with those used in the baseline. For example, how much
more or less maneuverable is Stryker in rural versus urban terrain and what
impact does that have on its utility in those environments? These questions
require considerable military expertise to address.
The third issue is whether the current operational test design can pro-
vide adequate information on how to tactically employ the IBCT/Stryker
system. For example, how should the greater situational awareness be taken
advantage of, and how should the greater situational awareness be balanced
against greater vulnerability for various types of environments and against
various threats? Clearly, this issue is not fundamentally a technical statisti-
cal one, but is rather an essential feature of scenario design that the panel
was not constituted to evaluate.
The final issue (related to the third) is whether the various missions,
types of terrain, and intensity of conflict are the correct choices for opera-
tional testing to support the decision on whether to pass Stryker to full-rate
production. One can imagine other missions, types of terrain, intensities,
and other factors that are not varied in the current test design that might
have an impact on the performance of Stryker, the baseline system, or both.
These factors include temperature, precipitation, the density of buildings,
the height of buildings, types of roads, etc. Moreover, there are the serious
problems raised by the unavailability of add-on armor for the early stages of
the operational test. The panel has been obligated to take the OMS/MP as
given, but it is not clear whether additional factors that might have an
important impact on performance should have been included as test fac-
tors. All of these issues are raised here in order to emphasize their impor-
tance and worthiness for consideration by other groups better constituted
to address them.
Thus, the panel wishes to make very clear that this assessment of the
operational test as currently designed reflects only its statistical merits. It is
certainly possible that the IBCT/Stryker operational test may be deficient
in other respects, some of them listed above, that may subordinate the
statistical aspects of the test. Even if the statistical issues addressed in this
report were to be mitigated, we cannot determine whether the resulting
operational test design would be fully informative as to whether Stryker
should be promoted to full-rate production.
OCR for page 10
Representative terms from entire chapter:
operational test