Read "Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report" at NAP.edu

« Previous: Front Matter

Page 1 Cite

Suggested Citation:"Executive Summary." National Research Council. 2003. Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report. Washington, DC: The National Academies Press. doi: 10.17226/10710.

Page 2 Cite

Page 3 Cite

Page 4 Cite

Page 5 Cite

Page 6 Cite

Page 7 Cite

Page 8 Cite

Page 9 Cite

Page 10 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Executive Summary This report provides an assessment of the U.S. Army's planned ini- tial operational test (JOT) of the Stryker family of vehicles. Stryker is intended to provide mobility and "situation awareness" for the Interim Brigade Combat Team (IBCT). For this reason, the Army Test and Evaluation Command (ATEC) has been asked to take on the unusual re- sponsibility of testing both the vehicle and the IBCT concept. Building on the recommendations of an earlier National Research Council study and report (National Research Council, 1998a), the Panel on Operational Test Design and Evaluation of the Interim Armored Ve- hicle considers the Stryker IOT an excellent opportunity to examine how the defense community might effectively use test resources and analyze test data. The panel's judgments are based on information gathered during a series of open forums and meetings involving ATEC personnel and experts in the test and evaluation of systems. Perhaps equally important, in our view the assessment process itself has had a salutary influence on the IOT design for the IBCT/Stryker system. We focus in this report on two aspects of the operational test design and evaluation of the Stryker: (1) the measures of performance and effec- tiveness used to compare the IBCT equipped with the Stryker against the baseline force, the Light Infantry Brigade (LIB), and (2) whether the cur- rent operational test design is consistent with state-of-the-art methods. Our next report will discuss combining information obtained from the

2 IMPROVED OPERATIONAL TESTING AND EVALUATION IOT with other tests, engineering judgment, experience, and the like. The panel's final report will encompass both earlier reports and any additional developments. OVERALL TEST PLANNING Two specific purposes of the IOT are to determine whether the IBCT/ Stryker performs more effectively than the baseline force, and whether the Stryker family of vehicles meets its capability and performance require- ments. Our primary recommendation is to supplement these purposes: when evaluating a large, complex, and critical weapon system such as the Stryker, operational tests should be designed, carried out, and evalu- ated with a view toward improving the capabilities and performance of the system. MEASURES OF EFFECTIVENESS We begin by considering the definition and analysis of measures of effectiveness (MOEs). In particular, we address problems associated with rolling up disparate MOEs into a single overall number, the use of untested or ad hoc force ratio measures, and the requirements for calibration and scaling of subjective evaluations made by subject-matter experts (SMEs). We also identify a need to develop scenario-specific MOEs for noncombat missions, and we suggest some possible candidates for these. Studying the question of whether a single measure for the "value" of situation awareness can be devised, we reached the tentative conclusion that there is no single appropriate MOE for this multidimensional capability. Modeling and simulation tools can be used to this end by augmenting test data during the evaluation. These tools should be also used, however, to develop a better understanding of the capabilities and limitations of the system in general and the value of situation awareness in particular. With respect to determining critical measures of reliability and main- tainability (RAM), we observe that the IOT will provide a relatively small amount of vehicle operating data (compared with that obtained in training exercises and developmental testing) and thus may not be sufficient to ad- dress all of the reliability and maintainability concerns of ATEC. This lack of useful RAM information will be exacerbated by the fact that the IOT is to be performed without using add-on armor. For this reason, we stress that RAM data collection should be an ongo-

EXECUTIVE SUMMARY 3 ing enterprise, with failure times, failure modes, and maintenance informa- tion tracked for the entire life of the vehicle (and its parts), including data from developmental testing and training, and recorded in appropriate data- bases. Failure modes should be considered separately, rather than assign- ing a single failure rate for a vehicle using simple exponential models. EXPERIMENTAL DESIGN With respect to the experimental design itself, we are very concerned that observed differences will be confounded by important sources of un- controlled variation. In particular, as pointed out in the panel's letter re- port (Appendix A), the current test design calls for the IBCT/Stryker trials to be run at a different time from the baseline trials. This design may confound time of year with the primary measure of interest: the difference in effectiveness between the baseline force and the IBCT/Stryker force. We therefore recommend that these events be scheduled as closely together in time as possible, and interspersed if feasible. Also, additional potential sources of confounding, including player learning and nighttime versus daytime operations, should be addressed with alternative designs. One alternative to address confounding due to player learning is to use four separate groups of players, one for each of the two opposing forces (OPFORs), one for the IBCT/Stryker, and one for the baseline system. Intergroup variability appears likely to be a lesser problem than player learn- ing. Also, alternating teams from test replication to test replication be- tween the two systems under test would be a reasonable way to address differences in learning, training, fatigue, and competence. We also point out the difficulty in identifying a test design that is simultaneously "optimized" with respect to determining how various fac- tors affect system performance for dozens of measures, and also confirming performance either against a baseline system or against a set of require- ments. For example, the current test design, constructed to compare IBCT/ Stryker with the baseline, is balanced for a limited number of factors. How- ever, it does not provide as much information about the system's advan- tages as other approaches could. In particular, the current design allocates test samples to missions and environments in approximatley the same pro- portion as would be expected in field use. This precludes focusing test samples on environments in which Stryker is designed to have advantages over the baseline system, and it allocates numerous test samples to environ- ments for which Stryker is anticipated to provide no benefits over the

4 IMPROVED OPERATIONAL TESTING AND EVALUATION baseline system. This reduces the opportunity to learn the size of the ben- efit that Stryker provides in various environments, as well as the reasons underlying its advantages. In support of such an approach, we present a number of specific technical suggestions for test design, including making use of test design in learning and confirming stages as well as small-scale pilot tests. Staged testing, presented as an alternative to the current design, would be particularly useful in coming to grips with the difficult problem of understanding the contribution of situation awareness to system perfor- mance. For example, it would be informative to run pilot tests with the Stryker situation awareness capabilities intentionally degraded or turned off, to determine the value they provide in particular missions or scenarios. We make technical suggestions in several areas, including statistical power calculations, identifying the appropriate test unit of analysis, com- bining SME ratings, aggregation, and graphical methods. SYSTEM EVALUATION AND IMPROVEMENT More generally, we examined the implications of this particular IOT for future tests of similar systems, particularly those that operationally in- teract so strongly with a novel force concept. Since the size of the opera- tional test (i.e., number of test replications) for this complex system (or systems of systems) will be inadequate to support hypothesis tests leading to a decision on whether Stryker should be passed to full-rate production, ATEC should augment this decision with other techniques. At the very least, estimates and associated measures of precision (e.g., confidence inter- vals) should be reported for various MOEs. In addition, the reporting and use of numerical and graphical assessments, based on data from other tests and trials, should be explored. In general, complex systems should not be forwarded to operational testing, absent strategic considerations, until the system design is relatively mature. Forwarding an immature system to op- erational test is an expensive way to discover errors that could have been detected in developmental testing, and it reduces the ability of an opera- tional test to carry out its proper function. As pointed out in the panel's letter report (Appendix A), it is extremely important, when testing complex systems, to prepare a straw man test evalu- ation report (TER), as if the IOT had been completed. It should include examples of how the representative data will be analyzed, specific presenta- tion formats (including graphs) with expected results, insights to develop from the data, draft recommendations, and so on. The content of this straw man report should be based on the experience and intuition of the

EXECUTIVE SUMMARY analysts and what they think the results of the IOT might look like. To do this and to ensure the validity and persuasiveness of evaluations drawn from the testing, ATEC needs a cadre of statistically trained personnel with "own- ership" of the design and the subsequent test and evaluation. Thus, the Department of Defense in general and ATEC in particular should give a high priority to developing a contractual relationship with leading practi- tioners in the fields of reliability estimation, experimental design, and data analysis to help them with future IOTs. In summary, the panel has a substantial concern about confounding in the current test design for the IBCT/Stryker IOT that needs to be ad- dressed. If the confounding issues were reduced or eliminated, the remain- der of the test design, aside from the power calculations, has been compe- tently developed from a statistical point of view. Furthermore, this report provides a number of evaluations and resulting conclusions and recom- mendations for improvement of the design, the selection and validation of MOEs, the evaluation process, and the conduct of future tests of highly complex systems. We attach greater priority to several of these recommen- dations and therefore highlight them here, organized by chapters to assist those interested in locating the supporting arguments. RECOMMENDATIONS Chapter 3 · Different MOEs should not be rolled up into a single overall num- ber that tries to capture effectiveness or suitability. · To help in the calibration of SMEs, each should be asked to review his or her own assessment of the Stryker IOT missions, for each scenario, immediately before he or she assesses the baseline missions (or vice versa). · ATEC should review the opportunities and possibilities for SMEs to contribute to the collection of objective data, such as times to complete certain subtasks, distances at critical times, etc. . ATEC should consider using two separate SME rating scales: one r cc r · '' r cc '' tor tel. .ures ant ~ anotner tor successes. . FER (and the LER when appropriate), but not the RLR, should be used as the primary mission-level MOE for analyses of engagement results. · ATEC should use fratricide frequency and civilian casualty fre- quency to measure the amount of fratricide and collateral damage in a . . mission.

6 IMPROVED OPERATIONAL TESTING AND EVALUATION · Scenario-specific MOPs shoulcl be aclclecl for SOSE missions. · Situation awareness shoulcl be introduced as an explicit test . . cone ration. · RAM data collection shoulcl be an ongoing enterprise. Failure and maintenance information shoulcl be trackocl on a vehicle or part/system basis for the entire life of the vehicle or part/system. Appropriate databases shoulcl be set up. This was probably not clone with those Stryker vehicles already in existence, but it could be implemented for future maintenance actions on all Stryker vehicles. · With respect to the difficulty of reaching a decision regarding reli- ability, given limited miles and absence of aclcl-on-armor, weight packs shoulcl be used to provide information about the impact of additional weight on reliability. · Failure modes shoulcl be considered separately rather than trying to develop failure rates for the entire vehicle using simple exponential mocl- els. The data reporting requirements vary depending on the failure rate r tunctlon. Chapter 4 · Given either a learning or a confirmatory objective, ignoring various tactical considerations, a requisite for operational testing is that it shoulcl not commence until the system design is mature. · ATEC shoulcl consicler, for future test clesigns, relaxing various rules of test design that it adheres to, by (a) not allocating sample size to sce- narios according to the OMS/MP, but instead using principles from opti- mal experimental design theory to allocate sample size to scenarios, (b) testing under somewhat more extreme conditions than typically will be faced in the fielcl, (c) using information from developmental testing to improve test clesign, and (cl) separating the operational test into at least two stages, learning and confirmatory. · ATEC shoulcl consider applying to future operational testing in general a two-phase test design that involves, first, learning phase studies that examine the test object under different conclitions, thereby helping testers design further tests to elucidate areas of greatest uncertainty and importance, ancl, seconcl, a phase involving confirmatory tests to address hypotheses concerning performance vis-a-vis a baseline system or in com- parison with requirements. ATEC shoulcl consider taking advantage of this approach for the IBCT/Stryker JOT. That is, examining in the first phase IBCT/Stryker under different conclitions, to assess when this system

EXECUTIVE SUMMARY works best, and why, and conducting a second phase to compare IBCT/ Stryker to a baseline, using this confirmation experiment to support the decision to proceed to full-rate production. An important feature of the learning phase is to test with factors at high stress levels in order to develop a complete understanding of the system's capabilities and limitations. · When specific performance or capability problems come up in the early part of operational testing, small-scale pilot tests, focused on the analy- sis of these problems, should be seriously considered. For example, ATEC should consider test conditions that involve using Stryker with situation awareness degraded or turned off to determine the value that it provides in . . . . particular missions. · ATEC should eliminate from the IBCT/Stryker IOT one signifi- cant potential source of confounding, seasonal variation, in accordance with the recommendation provided earlier in the October 2002 letter report from the panel to ATEC (see Appendix A). In addition, ATEC should also seriously consider ways to reduce or eliminate possible confounding from player learning, and day/night imbalance. Chapter 5 · The IOT provides little vehicle operating data and thus may not be sufficient to address all of the reliability and maintainability concerns of ATEC. This highlights the need for improved data collection regarding vehicle usage. In particular, data should be maintained for each vehicle over that vehicle's entire life, including training, testing, and ultimately field use; data should also be gathered separately for different failure modes. · The panel reaffirms the recommendation of the 1998 NRC panel that more use should be made of estimates and associated measures of pre- cision (or confidence intervals) in addition to significance tests, because the former enable the judging of the practical significance of observed effects. , ~ ~ Chapter 6 · Operational tests should not be strongly geared toward estimation of system suitability, since they cannot be expected to run long enough to estimate fatigue life, estimate repair and replacement times, identify failure modes, etc. Therefore, developmental testing should give greater priority to measurement of system (operational) suitability and should be struc- tured to provide its test events with greater operational realism. . In general, complex systems should not be forwarded to operational

8 IMPROVED OPERATIONAL TESTING AND EVALUATION testing, in the absence of strategic considerations, until the system design is relatively mature. Forwarding an immature system to operational test is an expensive way to discover errors that could have been detected in develop- mental testing, and it reduces the ability of an operational test to carry out its proper function. System maturation should be expedited through previ- ous testing that incorporates various aspects of operational realism in addi- tion to the usual developmental testing. · Because it is not yet clear that the test design and the subsequent test analysis have been linked, ATEC should prepare a straw man test evalu- ation report in advance of test design, as recommended in the panel's Octo- ber 2002 letter to ATEC (see Appendix A). · The goals of the initial operational test need to be more clearly specified. Two important types of goals for operational test are learning about system performance and confirming system performance in com- parison to requirements and in comparison to the performance of baseline systems. These two different types of goals argue for different stages of operational test. Furthermore, to improve test designs that address these different types of goals, information from previous stages of system devel- opment need to be utilized. Finally, we wish to make clear that the panel was constituted to address the statistical questions raised by the selection of measures of performance and measures of effectiveness, and the selection of an experimental design, given the need to evaluate Stryker and the IBCT in scenarios identified in the OMS/MP. A number of other important issues (about which the panel provides some commentary) lie outside the panel's charge and expertise. These include an assessment of (a) the selection of the baseline system to compare with Stryker, (b) the problems raised by the simultaneous evalua- tion of the Stryker vehicle and the IBCT system that incorporates it, (c) whether the operational test can definitively answer specific tactical ques- tions, such as the degree to which the increased vulnerability of Stryker is offset by the availability of greater situational awareness, and (~) whether or not scenarios to be acted out by OPFOR represent a legitimate test suite. Let us elaborate each of these ancillary but important issues. The first is whether the current choice of a baseline system (or multiple baselines) is best from a military point of view, including whether a baseline system could have been tested taking advantage of the IBCT infrastructure, to help understand the value of Stryker without the IBCT system. It does not seem to be necessary to require that only a system that could be trans- ported as quickly as Stryker could serve as a baseline for comparison.

EXECUTIVE SUMMARY 9 The second issue (related to the first) is the extent to which the current test provides information not only about comparison of the IBCT/Stryker system with a baseline system, but also about comparison of the Stryker suite of vehicles with those used in the baseline. For example, how much more or less maneuverable is Stryker in rural versus urban terrain and what impact does that have on its utility in those environments? These questions require considerable military expertise to address. The third issue is whether the current operational test design can pro- vide adequate information on how to tactically employ the IBCT/Stryker system. For example, how should the greater situational awareness be taken advantage of, and how should the greater situational awareness be balanced against greater vulnerability for various types of environments and against various threats? Clearly, this issue is not fundamentally a technical statisti- cal one, but is rather an essential feature of scenario design that the panel was not constituted to evaluate. The final issue (related to the third) is whether the various missions, types of terrain, and intensity of conflict are the correct choices for opera- tional testing to support the decision on whether to pass Stryker to full-rate production. One can imagine other missions, types of terrain, intensities, and other factors that are not varied in the current test design that might have an impact on the performance of Stryker, the baseline system, or both. These factors include temperature, precipitation, the density of buildings, the height of buildings, types of roads, etc. Moreover, there are the serious problems raised by the unavailability of add-on armor for the early stages of the operational test. The panel has been obligated to take the OMS/MP as given, but it is not clear whether additional factors that might have an important impact on performance should have been included as test fac- tors. All of these issues are raised here in order to emphasize their impor- tance and worthiness for consideration by other groups better constituted to address them. Thus, the panel wishes to make very clear that this assessment of the operational test as currently designed reflects only its statistical merits. It is certainly possible that the IBCT/Stryker operational test may be deficient in other respects, some of them listed above, that may subordinate the statistical aspects of the test. Even if the statistical issues addressed in this report were to be mitigated, we cannot determine whether the resulting operational test design would be fully informative as to whether Stryker should be promoted to full-rate production.

Next: 1. Introduction »

Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report (2003)

Chapter: Executive Summary

Welcome to OpenBook!

Get Email Updates