Read "Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report" at NAP.edu

« Previous: Appendix D: Acronyms and Abbreviations

Page 103 Cite

Suggested Citation:"Phase I Report: Operational Test Design and Evaluation of the Interim Armored Vehicle." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Page 134 Cite

Page 135 Cite

Page 136 Cite

Page 137 Cite

Page 138 Cite

Page 139 Cite

Page 140 Cite

Page 141 Cite

Page 142 Cite

Page 143 Cite

Page 144 Cite

Page 145 Cite

Page 146 Cite

Page 147 Cite

Page 148 Cite

Page 149 Cite

Page 150 Cite

Page 151 Cite

Page 152 Cite

Page 153 Cite

Page 154 Cite

Page 155 Cite

Page 156 Cite

Page 157 Cite

Page 158 Cite

Page 159 Cite

Page 160 Cite

Page 161 Cite

Page 162 Cite

Page 163 Cite

Page 164 Cite

Page 165 Cite

Page 166 Cite

Page 167 Cite

Page 168 Cite

Page 169 Cite

Page 170 Cite

Page 171 Cite

Page 172 Cite

Page 173 Cite

Page 174 Cite

Page 175 Cite

Page 176 Cite

Page 177 Cite

Page 178 Cite

Page 179 Cite

Page 180 Cite

Page 181 Cite

Page 182 Cite

Page 183 Cite

Page 184 Cite

Page 185 Cite

Page 186 Cite

Page 187 Cite

Page 188 Cite

Page 189 Cite

Page 190 Cite

Page 191 Cite

Page 192 Cite

Page 193 Cite

Page 194 Cite

Page 195 Cite

Page 196 Cite

Page 197 Cite

Page 198 Cite

Page 199 Cite

Page 200 Cite

Page 201 Cite

Page 202 Cite

Page 203 Cite

Page 204 Cite

Page 205 Cite

Page 206 Cite

Page 207 Cite

Page 208 Cite

Page 209 Cite

Page 210 Cite

Page 211 Cite

Page 212 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Phase I Report Operational Test Design and Evaluation of the Interim Armored Vehicle

Executive Summary This report provides an assessment of the U.S. Army's planned ini- tial operational test (JOT) of the Stryker family of vehicles. Stryker is intended to provide mobility and "situation awareness" for the Interim Brigade Combat Team (IBCT). For this reason, the Army Test and Evaluation Command (ATEC) has been asked to take on the unusual re- sponsibility of testing both the vehicle and the IBCT concept. Building on the recommendations of an earlier National Research Council study and report (National Research Council, 1998a), the Panel on Operational Test Design and Evaluation of the Interim Armored Ve- hicle considers the Stryker IOT an excellent opportunity to examine how the defense community might effectively use test resources and analyze test data. The panel's judgments are based on information gathered during a series of open forums and meetings involving ATEC personnel and experts in the test and evaluation of systems. Perhaps equally important, in our view the assessment process itself has had a salutary influence on the IOT design for the IBCT/Stryker system. We focus in this report on two aspects of the operational test design and evaluation of the Stryker: (1) the measures of performance and effec- tiveness used to compare the IBCT equipped with the Stryker against the baseline force, the Light Infantry Brigade (LIB), and (2) whether the cur- rent operational test design is consistent with state-of-the-art methods. Our next report will discuss combining information obtained from the 105

106 IMPROVED OPERATIONAL TESTING AND EVALUATION IOT with other tests, engineering judgment, experience, and the like. The panel's final report will encompass both earlier reports and any additional developments. OVERALL TEST PLANNING Two specific purposes of the IOT are to determine whether the IBCT/ Stryker performs more effectively than the baseline force, and whether the Stryker family of vehicles meets its capability and performance require- ments. Our primary recommendation is to supplement these purposes: when evaluating a large, complex, and critical weapon system such as the Stryker, operational tests should be designed, carried out, and evalu- ated with a view toward improving the capabilities and performance of the system. MEASURES OF EFFECTIVENESS We begin by considering the definition and analysis of measures of effectiveness (MOEs). In particular, we address problems associated with rolling up disparate MOEs into a single overall number, the use of untested or ad hoc force ratio measures, and the requirements for calibration and scaling of subjective evaluations made by subject-matter experts (SMEs). We also identify a need to develop scenario-specific MOEs for noncombat missions, and we suggest some possible candidates for these. Studying the question of whether a single measure for the "value" of situation awareness can be devised, we reached the tentative conclusion that there is no single appropriate MOE for this multidimensional capability. Modeling and simulation tools can be used to this end by augmenting test data during the evaluation. These tools should be also used, however, to develop a better understanding of the capabilities and limitations of the system in general and the value of situation awareness in particular. With respect to determining critical measures of reliability and main- tainability (RAM), we observe that the IOT will provide a relatively small amount of vehicle operating data (compared with that obtained in training exercises and developmental testing) and thus may not be sufficient to ad- dress all of the reliability and maintainability concerns of ATEC. This lack of useful RAM information will be exacerbated by the fact that the IOT is to be performed without using add-on armor. For this reason, we stress that RAM data collection should be an ongo-

PHASE I REPORT: EXECUTIVE SUMMARY 107 ing enterprise, with failure times, failure modes, and maintenance informa- tion tracked for the entire life of the vehicle (and its parts), including data from developmental testing and training, and recorded in appropriate data- bases. Failure modes should be considered separately, rather than assign- ing a single failure rate for a vehicle using simple exponential models. EXPERIMENTAL DESIGN With respect to the experimental design itself, we are very concerned that observed differences will be confounded by important sources of un- controlled variation. In particular, as pointed out in the panel's letter re- port (Appendix A), the current test design calls for the IBCT/Stryker trials to be run at a different time from the baseline trials. This design may confound time of year with the primary measure of interest: the difference in effectiveness between the baseline force and the IBCT/Stryker force. We therefore recommend that these events be scheduled as closely together in time as possible, and interspersed if feasible. Also, additional potential sources of confounding, including player learning and nighttime versus daytime operations, should be addressed with alternative designs. One alternative to address confounding due to player learning is to use four separate groups of players, one for each of the two opposing forces (OPFORs), one for the IBCT/Stryker, and one for the baseline system. Intergroup variability appears likely to be a lesser problem than player learn- ing. Also, alternating teams from test replication to test replication be- tween the two systems under test would be a reasonable way to address differences in learning, training, fatigue, and competence. We also point out the difficulty in identifying a test design that is simultaneously "optimized" with respect to determining how various fac- tors affect system performance for dozens of measures, and also confirming performance either against a baseline system or against a set of require- ments. For example, the current test design, constructed to compare IBCT/ Stryker with the baseline, is balanced for a limited number of factors. How- ever, it does not provide as much information about the system's advan- tages as other approaches could. In particular, the current design allocates test samples to missions and environments in approximatley the same pro- portion as would be expected in field use. This precludes focusing test samples on environments in which Stryker is designed to have advantages over the baseline system, and it allocates numerous test samples to environ- ments for which Stryker is anticipated to provide no benefits over the

108 IMPROVED OPERATIONAL TESTING AND EVALUATION baseline system. This reduces the opportunity to learn the size of the ben- efit that Stryker provides in various environments, as well as the reasons underlying its advantages. In support of such an approach, we present a number of specific technical suggestions for test design, including making use of test design in learning and confirming stages as well as small-scale pilot tests. Staged testing, presented as an alternative to the current design, would be particularly useful in coming to grips with the difficult problem of understanding the contribution of situation awareness to system perfor- mance. For example, it would be informative to run pilot tests with the Stryker situation awareness capabilities intentionally degraded or turned off, to determine the value they provide in particular missions or scenarios. We make technical suggestions in several areas, including statistical power calculations, identifying the appropriate test unit of analysis, com- bining SME ratings, aggregation, and graphical methods. SYSTEM EVALUATION AND IMPROVEMENT More generally, we examined the implications of this particular IOT for future tests of similar systems, particularly those that operationally in- teract so strongly with a novel force concept. Since the size of the opera- tional test (i.e., number of test replications) for this complex system (or systems of systems) will be inadequate to support hypothesis tests leading to a decision on whether Stryker should be passed to full-rate production, ATEC should augment this decision with other techniques. At the very least, estimates and associated measures of precision (e.g., confidence inter- vals) should be reported for various MOEs. In addition, the reporting and use of numerical and graphical assessments, based on data from other tests and trials, should be explored. In general, complex systems should not be forwarded to operational testing, absent strategic considerations, until the system design is relatively mature. Forwarding an immature system to op- erational test is an expensive way to discover errors that could have been detected in developmental testing, and it reduces the ability of an opera- tional test to carry out its proper function. As pointed out in the panel's letter report (Appendix A), it is extremely important, when testing complex systems, to prepare a straw man test evalu- ation report (TER), as if the IOT had been completed. It should include examples of how the representative data will be analyzed, specific presenta- tion formats (including graphs) with expected results, insights to develop from the data, draft recommendations, and so on. The content of this straw man report should be based on the experience and intuition of the

PHASE I REPORT: EXECUTIVE SUMMARY 109 analysts and what they think the results of the IOT might look like. To do this and to ensure the validity and persuasiveness of evaluations drawn from the testing, ATEC needs a cadre of statistically trained personnel with "own- ership" of the design and the subsequent test and evaluation. Thus, the Department of Defense in general and ATEC in particular should give a high priority to developing a contractual relationship with leading practi- tioners in the fields of reliability estimation, experimental design, and data analysis to help them with future IOTs. In summary, the panel has a substantial concern about confounding in the current test design for the IBCT/Stryker IOT that needs to be ad- dressed. If the confounding issues were reduced or eliminated, the remain- der of the test design, aside from the power calculations, has been compe- tently developed from a statistical point of view. Furthermore, this report provides a number of evaluations and resulting conclusions and recom- mendations for improvement of the design, the selection and validation of MOEs, the evaluation process, and the conduct of future tests of highly complex systems. We attach greater priority to several of these recommen- dations and therefore highlight them here, organized by chapters to assist those interested in locating the supporting arguments. RECOMMENDATIONS Chapter 3 · Different MOEs should not be rolled up into a single overall num- ber that tries to capture effectiveness or suitability. · To help in the calibration of SMEs, each should be asked to review his or her own assessment of the Stryker IOT missions, for each scenario, immediately before he or she assesses the baseline missions (or vice versa). · ATEC should review the opportunities and possibilities for SMEs to contribute to the collection of objective data, such as times to complete certain subtasks, distances at critical times, etc. . ATEC should consider using two separate SME rating scales: one r cc r · 1 '' 1 r cc '' tor "allures ant ~ anotner tor successes. . FER (and the LER when appropriate), but not the RLR, should be used as the primary mission-level MOE for analyses of engagement results. · ATEC should use fratricide frequency and civilian casualty fre- quency to measure the amount of fratricide and collateral damage in a . . mission.

110 IMPROVED OPERATIONAL TESTING AND EVALUATION · Scenario-specific MOPs shoulcl be aclclecl for SOSE missions. · Situation awareness shoulcl be introduced as an explicit test . . cone ration. · RAM data collection shoulcl be an ongoing enterprise. Failure and maintenance information shoulcl be trackocl on a vehicle or part/system basis for the entire life of the vehicle or part/system. Appropriate databases shoulcl be set up. This was probably not clone with those Stryker vehicles already in existence, but it could be implemented for future maintenance actions on all Stryker vehicles. · With respect to the difficulty of reaching a decision regarding reli- ability, given limited miles and absence of aclcl-on-armor, weight packs shoulcl be used to provide information about the impact of additional weight on reliability. · Failure modes shoulcl be considered separately rather than trying to develop failure rates for the entire vehicle using simple exponential mocl- els. The data reporting requirements vary depending on the failure rate r tunctlon. Chapter 4 · Given either a learning or a confirmatory objective, ignoring various tactical considerations, a requisite for operational testing is that it shoulcl not commence until the system design is mature. · ATEC shoulcl consicler, for future test clesigns, relaxing various rules of test design that it adheres to, by (a) not allocating sample size to sce- narios according to the OMS/MP, but instead using principles from opti- mal experimental design theory to allocate sample size to scenarios, (b) testing under somewhat more extreme conditions than typically will be faced in the fielcl, (c) using information from developmental testing to improve test clesign, and (cl) separating the operational test into at least two stages, learning and confirmatory. · ATEC shoulcl consider applying to future operational testing in general a two-phase test design that involves, first, learning phase studies that examine the test object under different conclitions, thereby helping testers design further tests to elucidate areas of greatest uncertainty and importance, ancl, seconcl, a phase involving confirmatory tests to address hypotheses concerning performance vis-a-vis a baseline system or in com- parison with requirements. ATEC shoulcl consider taking advantage of this approach for the IBCT/Stryker JOT. That is, examining in the first phase IBCT/Stryker under different conclitions, to assess when this system

PHASE I REPORT: EXECUTIVE SUMMARY 111 works best, and why, and conducting a second phase to compare IBCT/ Stryker to a baseline, using this confirmation experiment to support the decision to proceed to full-rate production. An important feature of the learning phase is to test with factors at high stress levels in order to develop a complete understanding of the system's capabilities and limitations. · When specific performance or capability problems come up in the early part of operational testing, small-scale pilot tests, focused on the analy- sis of these problems, should be seriously considered. For example, ATEC should consider test conditions that involve using Stryker with situation awareness degraded or turned off to determine the value that it provides in . . . . particular missions. · ATEC should eliminate from the IBCT/Stryker IOT one signifi- cant potential source of confounding, seasonal variation, in accordance with the recommendation provided earlier in the October 2002 letter report from the panel to ATEC (see Appendix A). In addition, ATEC should also seriously consider ways to reduce or eliminate possible confounding from player learning, and day/night imbalance. Chapter 5 · The IOT provides little vehicle operating data and thus may not be sufficient to address all of the reliability and maintainability concerns of ATEC. This highlights the need for improved data collection regarding vehicle usage. In particular, data should be maintained for each vehicle over that vehicle's entire life, including training, testing, and ultimately field use; data should also be gathered separately for different failure modes. · The panel reaffirms the recommendation of the 1998 NRC panel that more use should be made of estimates and associated measures of pre- cision (or confidence intervals) in addition to significance tests, because the former enable the judging of the practical significance of observed effects. , ~ ~ Chapter 6 · Operational tests should not be strongly geared toward estimation of system suitability, since they cannot be expected to run long enough to estimate fatigue life, estimate repair and replacement times, identify failure modes, etc. Therefore, developmental testing should give greater priority to measurement of system (operational) suitability and should be struc- tured to provide its test events with greater operational realism. . In general, complex systems should not be forwarded to operational

112 IMPROVED OPERATIONAL TESTING AND EVALUATION testing, in the absence of strategic considerations, until the system design is relatively mature. Forwarding an immature system to operational test is an expensive way to discover errors that could have been detected in develop- mental testing, and it reduces the ability of an operational test to carry out its proper function. System maturation should be expedited through previ- ous testing that incorporates various aspects of operational realism in addi- tion to the usual developmental testing. · Because it is not yet clear that the test design and the subsequent test analysis have been linked, ATEC should prepare a straw man test evalu- ation report in advance of test design, as recommended in the panel's Octo- ber 2002 letter to ATEC (see Appendix A). · The goals of the initial operational test need to be more clearly specified. Two important types of goals for operational test are learning about system performance and confirming system performance in com- parison to requirements and in comparison to the performance of baseline systems. These two different types of goals argue for different stages of operational test. Furthermore, to improve test designs that address these different types of goals, information from previous stages of system devel- opment need to be utilized. Finally, we wish to make clear that the panel was constituted to address the statistical questions raised by the selection of measures of performance and measures of effectiveness, and the selection of an experimental design, given the need to evaluate Stryker and the IBCT in scenarios identified in the OMS/MP. A number of other important issues (about which the panel provides some commentary) lie outside the panel's charge and expertise. These include an assessment of (a) the selection of the baseline system to compare with Stryker, (b) the problems raised by the simultaneous evalua- tion of the Stryker vehicle and the IBCT system that incorporates it, (c) whether the operational test can definitively answer specific tactical ques- tions, such as the degree to which the increased vulnerability of Stryker is offset by the availability of greater situational awareness, and (~) whether or not scenarios to be acted out by OPFOR represent a legitimate test suite. Let us elaborate each of these ancillary but important issues. The first is whether the current choice of a baseline system (or multiple baselines) is best from a military point of view, including whether a baseline system could have been tested taking advantage of the IBCT infrastructure, to help understand the value of Stryker without the IBCT system. It does not seem to be necessary to require that only a system that could be trans- ported as quickly as Stryker could serve as a baseline for comparison.

PHASE I REPORT: EXECUTIVE SUMMARY 113 The second issue (related to the first) is the extent to which the current test provides information not only about comparison of the IBCT/Stryker system with a baseline system, but also about comparison of the Stryker suite of vehicles with those used in the baseline. For example, how much more or less maneuverable is Stryker in rural versus urban terrain and what impact does that have on its utility in those environments? These questions require considerable military expertise to address. The third issue is whether the current operational test design can pro- vide adequate information on how to tactically employ the IBCT/Stryker system. For example, how should the greater situational awareness be taken advantage of, and how should the greater situational awareness be balanced against greater vulnerability for various types of environments and against various threats? Clearly, this issue is not fundamentally a technical statisti- cal one, but is rather an essential feature of scenario design that the panel was not constituted to evaluate. The final issue (related to the third) is whether the various missions, types of terrain, and intensity of conflict are the correct choices for opera- tional testing to support the decision on whether to pass Stryker to full-rate production. One can imagine other missions, types of terrain, intensities, and other factors that are not varied in the current test design that might have an impact on the performance of Stryker, the baseline system, or both. These factors include temperature, precipitation, the density of buildings, the height of buildings, types of roads, etc. Moreover, there are the serious problems raised by the unavailability of add-on armor for the early stages of the operational test. The panel has been obligated to take the OMS/MP as given, but it is not clear whether additional factors that might have an important impact on performance should have been included as test fac- tors. All of these issues are raised here in order to emphasize their impor- tance and worthiness for consideration by other groups better constituted to address them. Thus, the panel wishes to make very clear that this assessment of the operational test as currently designed reflects only its statistical merits. It is certainly possible that the IBCT/Stryker operational test may be deficient in other respects, some of them listed above, that may subordinate the statistical aspects of the test. Even if the statistical issues addressed in this report were to be mitigated, we cannot determine whether the resulting operational test design would be fully informative as to whether Stryker should be promoted to full-rate production.

1 Introduction This report provides an assessment of the U.S. Army's planned ini- tial operational test of the Stryker family of vehicles. It focuses on two aspects of the test design and evaluation: (1) the measures of performance and effectiveness used to compare the force equipped with the Stryker with a baseline force and (2) whether the current operational test design is consistent with state-of-the-art methods. ARMY'S NEED FOR AN INTERIM ARMORED VEHICLE (STRYKER) The United States Army anticipates increases in the number and types of asymmetric threats and will be required to support an expanding variety of missions (including military operations in urban terrain and operations other than war) that demand an effective combination of rapid deploy- ability, information superiority, and coordination of awareness and action. In order to respond to these threats and mission requirements, the Army has identified the need for a future combat system that leverages the capa- bilities of advancing technologies in such areas as vehicle power, sensors, weaponry, and information gathering and sharing. It will take years to develop and integrate these technologies into weapon systems that meet the needs of the Army of the future. The Army has therefore established a three-pronged plan to guide the transition of its weapons and forces, as illustrated in Figure 1-1. 115

116 Legacy Force Objective Force Interim Force IMPROVED OPERATIONAL TESTING AND EVALUATION l l 2000 First 2003 Interim BCT First Unit Equipped Objective of_ FIGURE 1-1 Army plans for transition to the Objective Force equipped with the future combat system. Acronyms: B CT, brigade combat team; R&D, research and development; SOT, sci- ence and technology; THE, test and evaluation. SOURCE: ATEC briefing to the panel. The Army intends to use this transition plan to sustain and upgrade (but not expand) its existing weapon systems, which are characterized as its Legacy Force. Heavy armored units, a key element of the Legacy Force, are not, however, adequate to address the challenge of rapid deployability around the globe. An immediate and urgent need exists for an air transportable Interim Force capable of deployment to anywhere on the globe in a combat-ready configuration. Until the future combat system is developed and fully fielded to support the Objective Force (the Army of the future), the Army intends to rely on an Interim Force, the critical warfighting component of which is the Interim Brigade Combat Team (IBCT). The mobility of the IBCT is to be provided by the Stryker vehicle (until recently referred to as the in- terim armored vehicle). Stryker Configurations The range of tasks to be accomplished by the IBCT calls for the Stryker to be not just a single vehicle, but a family of vehicles that are air transport-

PHASE I REPORT: INTRODUCTION 117 able, are capable of immediate employment upon arrival in the area of operations, and have a great degree of commonality in order to decrease its logistical "footprint." Table 1-1 identifies the various Stryker vehicle con- figurations (as identified in the Stryker system evaluation plane, the key government-furnished equipment (GFE) items that will be integrated into the configuration, and the role of each configuration. Stryker Capabilities The Army has identified two different but clearly dependent capability requirements for the Stryker-supported IBCT: operational capabilities for the IBCT force that will rely on the Stryker, and system capabilities for the Stryker family of vehicles. IBCT Operational Capabilities The Army's Operational Requirements Document for the Stryker (ORD, 2000) defines the following top-level requirement for the IBCT: The IBCT is a full spectrum, combat force. It has utility, confirmed through extensive analysis, in all operational environments against all projected future threats, but it is designed and optimized primarily for employment in small scale contingency (SSC) operations in complex and urban terrain, confront- ing low-end and mid-range threats that may employ both conventional and asymmetric capabilities. The IBCT deploys very rapidly, executes early entry, and conducts effective combat operations immediately on arrival to prevent, contain, stabilize, or resolve a conflict through shaping and decisive opera- tions (section 1.a.~3~) . . . As a full spectrum combat force, the IBCT is capable of conducting all major doctrinal operations including offensive, de- fensive, stability, and support actions.... Properly integrated through a mo- bile robust C4ISR network, these core capabilities compensate for platform limitations that may exist in the close fight, leading to enhanced force effec- tiveness (section 1.a.~4~. System Capabilities Each configuration of the Stryker vehicle must be properly integrated . . . ~ . . . . Wit. ~ sensing, information processing, communications, weapons, ant other essential GFE that has been developed independently of the Stryker. The Army notes that the Stryker-GFE system must provide a particular capability, termed situation awareness, to offset the above-mentioned plat- r 1- torm llmltatlons:

118 IMPROVED OPERATIONAL TESTING AND EVALUATION TABLE 1-1 Stryker Configurations Co nfiguratio n Government- Furnished Name Equipment Role Infantry carrier Force XXI Battle Command Carry and protect a vehicle Brigade and Below (FBCB2), nine-man infantry Enhanced Position Location and squad and a crew of Reporting System (EPLRS), two personnel. Global Positioning System (GPS), Thermal Weapon Sight (TWS-H) Mortar carrier M121 120-mm mortar, XM95 Provide fire support to Mortar Fire Control System maneuver forces. Antitank guided TOW II missile system, TOW Defeat armored threats. missile vehicle acquisition system Reconnaissance Long Range Advanced Scout Enable scouts to vehicle Surveillance System (LRAS3) perform reconnaissance and surveillance. Fire support Mission Equipment Package Provide automation- vehicle (MEP), Handheld Terminal Unit enhanced target (HTU), Common Hardware- acquisition, Software- Lightweight Computer identification, and Unit (CHS-LCU) detection; communicate fire support information. The IBCT will offset the lethality and survivability limitations of its plat- forms through the holistic integration of all other force capabilities, particu- larly the internetted actions of the combined arms company teams. The mounted systems equipped with Force XXI Battle Command, Brigade and Below (FBCB2) and other enhancements provide the IBCT a larger internetted web of situational awareness extending throughout the IBCT area of operations. The synergistic effects achieved by internetting highly trained soldiers and leaders with platforms and organizational design enable the force to avoid surprise, develop rapid decisions, control the time and place to en- gage in combat, conduct precision maneuver, shape the battlespace with pre- cision fires and effects, and achieve decisive outcomes. (ORD section 1.a.~7~. The Stryker ORD specifies several performance requirements that ap-

PHASE I REPORT: INTRODUCTION TABLE 1-1 Continued 119 Configuration Name Government- Furnished Equipment Role Engineer squad vehicle Commander's vehicle Medical evacuation vehicle Nuclear, biological, chemical (NBC) reconnaissance vehicle Mobile gun system FBCB2, EPLRS, GPS, I~WS-H All Source Analysis System (ASAS), Advanced Field Artillery Tactical Data System Command, Control, Communications and Computer (AFATDS Cal. MC-4 Medic's Aide NBC Sensor Suite Eyesafe Laser Range Finder (ELRF) Neutralize and mark obstacles, detect mines, . . transport engineering squad. Provide command, control, . . . communications, and computer attributes to enable commanders to direct the battle. Enable recovery and evacuation of casualties. Perform reconnaissance missions in NBC environment; detect NBC conditions. Provide weapons fire to support assaulting · r 1ntantry. ply across configurations of the Stryker system. Key performance require- ments, defined in more detail in the ORD, are that the Stryker vehicles must: · maximize commonality of components and subcomponents across few . configurations; · possess an "internetted interoperable capability" that enables it to host and integrate existing and planned Army command, control, commu- nications, computers, intelligence, surveillance, and reconnaissance (C4ISR) systems; · be transportable in a C-130 aircraft;

120 IMPROVED OPERATIONAL TESTING AND EVALUATION · have the ability to operate effectively 24 hours per day, including at night, in inclement weather, and during other periods of limited visibility in hot, temperate, and cold climates; · be mobile, demonstrable in terms of specified hard surface speeds, cross-country mobility, cruising range, ability to transverse obstacles and gaps, and maneuverability; · possess the capability of sustainability, indicated by specified abili- ties to tow and be towed, to be started with assistance, to be refueled rap- idly, and to provide auxiliary power in the event of loss of primary power; · be survivable, as evidenced by the ability to achieve specified accel- eration, to provide protection to its crew and equipment, to accept add-on armor, to mount weapons, and to rapidly self-obscure; · permit the survivability of its crew in external environments with nuclear, biological, and chemical contamination, by providing warning or protection; · possess the capability of lethality, demonstrated by its ability to in- flict lethal damage on opposing forces and weapon systems; · satisfy specified requirements for logistics and readiness, which con- tribute to its fundamental requirement to maintain the force in the field; · be transportable by air and by rail; · operate reliably (i.e., without critical failures) for a specified period of time and be maintainable within a specified period of time when failures do occur. The Army Test and Evaluation Command (ATEC) has been assigned the mission of testing, under operationally realistic conditions, and evaluat- ing the extent to which the IBCT force equipped with the Stryker system (IBCT/Stryker) meets its requirements, compared with a baseline force, the current Light Infantry Brigade (LIB), which will be augmented with . . . . . transportation assets appropriate to assigner ~ missions. Although we have major concerns about the appropriateness of using the LIB as an alternative comparison force, because our primary responsi- bility is to address broader statistical and test design issues, we have taken this choice of baseline to be a firm constraint. PANEL CHARGE The Stryker will soon be entering an 1 8-month period of operational test and evaluation to determine whether it is effective and suitable to enter

PHASE I REPORT: INTRODUCTION 121 into full-rate production, and how the vehicles already purchased can best be used to address Army needs. Typical of the operational test and evalua- tion of a major defense acquisition program, the test and evaluation of the Stryker is an extremely complicated undertaking involving several separate test events, the use of modeling and simulation, and a wide variety of re- quirements that need to be satisfied. Reacting to a very high level of congressional interest in the Stryker program, ATEC must develop and use an evaluation approach that applies statistical rigor to determining the contribution of the Stryker to the IBCT mission, as well as the effectiveness of the IBCT itself. Affirming the value of obtaining an independent assessment of its approach, and desiring assis- tance in developing innovative measures of effectiveness, ATEC requested that the National Research Council's (NRC) Committee on National Sta- tistics convene a panel of experts to examine its plans for the operational test design and subsequent test and evaluation for the IBCT/Stryker. This resulted in the formation of the Panel on Operational Test Design and Evaluation of the Interim Armored Vehicle. The panel was specifically charged to examine three aspects of the operational test and evaluation of the IBCT/Stryker: 1. the measures of performance and effectiveness used to compare the IBCT force equipped with the Stryker system to the baseline, the LIB, and the measures used to assess the extent to which the Stryker system meets its requirements; 2. the design of the operational test, to determine the extent to which the design is consistent with state-of-the art methods in statistical experi- mental design; and 3. the applicability of combining information models, as well as of combining information from testing and field use of related systems and from developmental test results for the Stryker, with operational test results for the Stryker. The panel was also asked to identify alternative measures (e.g., of situ- ation awareness) and experimental designs that could better reflect the ad- vantages and disadvantages of the IBCT force equipped with the Stryker system relative to the LIB force. In addition, the panel was asked to address the use of modeling and simulation as part of the program evaluation and the analysis strategy proposed for the evaluation of the IBCT/Stryker.

122 IMPROVED OPERATIONAL TESTING AND EVALUATION PANEL APPROACH In its 1998 report Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements, the NRC Committee on National Statistics' Panel on Testing and Evaluating Defense Systems established broad perspectives and fundamental principles applicable to the examination of statistical aspects of operational testing (National Research Council, 1998a). Our panel has adopted the findings, conclu- sions, and recommendations of that report as a key starting point for our deliberations. We also reviewed in detail all key government documents pertaining to the operational testing of the IBCT/Stryker, including: the Operational Requirements Document, · the System Evaluation Plan, · the Test and Evaluation Master Plan, · the Organizational and Operational Description, · the Failure Definition and Scoring Document, · the Mission Needs Statement, · the Operational Mode Summary and Mission Profile, and sample Operational Orders applicable to operational tests. With the cooperation of the management and staff of ATEC, the panel conducted two forums and two subgroup meetings at which ATEC staff presented, in response to panel queries: descriptions of measures of effec- tiveness, suitability, and survivability under consideration for the initial operational test; details of the proposed experimental design; planned use of modeling and simulation; and planned methods of analysis of data that will result from the testing. At these forums, panel members and ATEC staff engaged in interactive discussion of proposed and alternative mea- sures, test designs, and analytical methods. At the invitation of the panel, two forums one on measures and one on test designs were attended by representatives from the Office of the Director, Operational Test and Evalu- ation; the Institute for Defense Analysis; the U.S. Government Accounting Office; and the U.S. Military Academy at West Point. Beyond the specific recommendations and conclusions presented here, it is our view that the open and pointed discussions have created a process that in itself has had a salutary influence on the decision making and design for the testing of the Stryker system of vehicles and the IBCT.

PHASE I REPORT: INTRODUCTION 123 This report summarizes the panel's assessment regarding (1) the mea- sures of performance and effectiveness used to compare the IBCT force equipped with the Stryker system with the baseline force and the extent to which the Stryker system meets its requirements and (2) the experimental design of the operational test. This report also addresses measures for situ- ation awareness, alternative measures for force effectiveness, analysis strate- gies, and some issues pertaining to modeling and simulation. Box 1-1 presents a number of terms used in operational testing. After additional forums and deliberations, the panel intends to prepare a second report that addresses the applicability of combining information from other sources with that from the IBCT/Stryker initial operational test. Those sources include developmental tests for the Stryker, testing and field experience with related systems, and modeling and simulation. Our

124 IMPROVED OPERATIONAL TESTING AND EVALUATION

PHASE I REPORT: INTRODUCTION 125 final report will incorporate both Phase I and II reports and any additional developments. Finally, we wish to make clear that the panel was constituted to address the statistical questions raised by the selection of measures of performance and measures of effectiveness, and the selection of an experimental design, given the need to evaluate Stryker and the IBCT in scenarios identified in the OMS/MP. A number of other important issues (about which the panel provides some commentary) lie outside the panel's charge and expertise. These include an assessment of (a) the selection of the baseline system to compare with Stryker, (b) the problems raised by the simultaneous evalua- tion of the Stryker vehicle and the IBCT system that incorporates it, (c) whether the operational test can definitively answer specific tactical ques- tions, such as the degree to which the increased vulnerability of Stryker is offset by the availability of greater situational awareness, and (~) whether or not scenarios to be acted out by OPFOR represent a legitimate test suite. Let us elaborate each of these ancillary but important issues. The first is whether the current choice of a baseline system (or multiple baselines) is best from a military point of view, including whether a baseline system could have been tested taking advantage of the IBCT infrastructure, to help understand the value of Stryker without the IBCT system). It does not seem to be necessary to require that only a system that could be trans- ported as quickly as Stryker could serve as a baseline for comparison. The second issue (related to the first) is the extent to which the current test provides information not only about comparison of the IBCT/Stryker system with a baseline system, but also about comparison of the Stryker suite of vehicles with those used in the baseline. For example, how much more or less maneuverable is Stryker in rural versus urban terrain and what impact does that have on its utility in those environments? These questions require considerable military expertise to address. The third issue is whether the current operational test design can pro- vide adequate information on how to tactically employ the IBCT/Stryker system. For example, how should the greater situational awareness be taken advantage of, and how should the greater situational awareness be balanced against greater vulnerability for various types of environments and against various threats? Clearly, this issue is not fundamentally a technical statisti- cal one, but is rather an essential feature of scenario design that the panel was not constituted to evaluate. The final issue (related to the third) is whether the various missions, types of terrain, and intensity of conflict are the correct choices for opera-

126 IMPROVED OPERATIONAL TESTING AND EVALUATION tional testing to support the decision on whether to pass Stryker to full-rate production. One can imagine other missions, types of terrain, intensities, and other factors that are not varied in the current test design that might have an impact on the performance of Stryker, the baseline system, or both. These factors include temperature, precipitation, the density of buildings, the height of buildings, types of roads, etc. Moreover, there are the serious problems raised by the unavailability of add-on armor for the early stages of the operational test. The panel has been obligated to take the OMS/MP as given, but it is not clear whether additional factors that might have an important impact on performance should have been included as test fac- tors. All of these issues are raised here in order to emphasize their impor- tance and worthiness for consideration by other groups better constituted to address them. Thus, the panel wishes to make very clear that this assessment of the operational test as currently designed reflects only its statistical merits. It is certainly possible that the IBCT/Stryker operational test may be deficient in other respects, some of them listed above, that may subordinate the statistical aspects of the test. Even if the statistical issues addressed in this report were to be mitigated, we cannot determine whether the resulting operational test design would be fully informative as to whether Stryker should be promoted to full-rate production.

Test Process The Army's interim armored combat vehicle, now called the Stryker, is in the latter stages of development. As is the case in all major acquisitions, it is necessary for the Army to subject the vehicle to a set of tests and evaluations to be sure it understands what it is buying and how well it works in the hands of the users. The Army Test and Evaluation Command (ATEC) has been charged to do these tests and evaluations. A key element in this series of tests is the operational testing, commencing with the initial operational test (IOT). The basic thought in the development of any operational test plan is to test the equipment in an environment as similar as possible to the environ- ment in which the equipment will actually operate. For combat vehicles such as the Stryker, the standard practice is to create combat situations similar to those in which the test vehicle would be expected to perform. The system is then inserted into the combat situations with trained opera- tors and with an opposition force (OPFOR) of the type expected. Preplanned scenarios and training schedules for the players in the test are developed, the nature of the force in which the test vehicle will be embed- ded is identified, and the test plans are developed by ATEC. Testing and evaluation of the Stryker is especially challenging, because several issues must be addressed together: · To what extent does the Stryker in various configurations, equipped with integrated government-furnished equipment (GFE) meet (or fail to meet) its requirements (e.g., for suitability and survivability)? 127

128 IMPROVED OPERATIONAL TESTING AND EVALUATION · How effective is the Interim Brigade Combat Team (IBCT), equipped with the Stryker system, and how does its effectiveness compare with that of a baseline force? · What factors (in the forces and in the systems) account for successes and failures in performance and for any performance differences between the forces and the systems? Thus, a primary objective of the IOT is to compare an organization that includes the Stryker with a baseline organization that does not include the Stryker. This makes the evaluation of the data particularly important and challenging, because the effects of the differences in organizations will be confounded with the differences in their supporting equipment systems. The planning for the test of the IBCT/Stryker is particularly difficult because of the complex interactions among these issues, the varying mis- sions in which the IBCT/Stryker will be tested, the number of variants of the vehicle itself, time and budget constraints (which affect the feasible size and length of tests), uncertainty about the characteristics of the planned add-on armor, and the times of year at which the test will be run. Factors that must be considered in the evaluation of the test data are: . Modeling and simulation using the results of the live test, account- ing for the uncertainty due to small sample size in the live tests. 2. The incorporation of developmental test data, manufacturer test data, and historical data in the evaluation. 3. Extrapolation of IOT field data to higher echelons that, due to re- source constraints, will not be tested by a live, representative force in the Stryker JOT. In particular, one of the three companies in the battalion that will be played in the Stryker IOT is notional (i.e., its communications, disposition, and effects will be simulated); the battalion headquarters has no other companies to worry about; and the brigade headquarters is played as a "white force" (neutral entity that directs and monitors operations). 4. The relative weight to give to "hard" instrument-gathered data vis- a-vis the observations and judgments of subject-matter experts (SMEs). OVERALL TESTING AND EVALUATION PLAN Two organizations, the government and the contractors that build the systems, are involved in the test and evaluation of Army systems. The

PHASEIREPORT: TEST PROCESS 129 Army's Developmental Test Command (within the ATEC organization) conducts, with contractor support, the production verification test. The purpose of this test is to ensure that the system, as manufactured, meets all of the specifications given in the contract. This information can be valu- able in the design and evaluation of results of subsequent developmental tests, particularly the testing of reliability, availability, and maintainability (RAM). Within the Army, responsibility for test and evaluation is given to ATEC. When ATEC is assigned the responsibility for performing test and evaluation for a given system, several documents are developed: · the test and evaluation master plan, · the test design plan, · the detailed test plan, · the system evaluation plan, and · the failure definition and scoring criteria. Testers perform a developmental test on the early production items in order to verify that the specifications have been met or exceeded (e.g., a confirmation, by noncontractor personnel, of the product verification test results on delivered systems). Following the developmental test, ATEC designs and executes one or more operational tests, commencing with the JOT. Modeling and simulation are often used to assist in test design, to verify test results, and to add information that cannot be obtained from the JOT. EVALUATION OF THE DATA When the results of the product verification test, the developmental test, the initial operational test, the modeling and simulation, and the his- tory of use have been gathered, ATEC is responsible for compiling all data relevant to the system into a final evaluation report. The Director, Opera- tional Test and Evaluation, approves the ATEC IOT event design plan and conducts an independent evaluation. As noted above, IBCT/Stryker test- ing will address two fundamental questions: (1) To what extent does the Stryker system (i.e., integration of the Stryker vehicle and its GFE in vari- ous configurations) meet its requirements? (2) How well does the IBCT force, equipped with the Stryker system, perform and meet its require-

130 IMPROVED OPERATIONAL TESTING AND EVALUATION meets, compared with the baseline Light Infantry Brigade (LIB) force? Evaluators will also assess the ways in which the IBCT force employs the Stryker and the extent to which the Stryker GFE provides situation aware- ness to the IBCT. They will also use the test data to help develop an understanding of why the IBCT and Stryker perform as well (or poorly) as they do. The ATEC evaluator is often asked about the most effective way to employ the system. If the test has been designed properly, extrapolation from the test data can often shed light on this question. In tests like those planned for the IBCT/Stryker, the judgment of the SMEs is highly valu- able. In the design, the SMEs may be asked to recommend, after early trials of the Stryker, changes to make the force more effective. This process of testing, recommending improvements, and implementing the recom- mendations can be done iteratively. Clearly, although the baseline trials can provide helpful insights, they are not intended primarily to support this kind of analysis. With the outcome of each test event recorded and with the aid of modeling, the evaluator will also extrapolate from the outcome of the IOT trials to what the outcome would have been under different circumstances. This extrapolation can involve the expected outcomes at different loca- tions, with different force sizes, or with a full brigade being present, for example. C7 ' 1 1 TEST PROCESS Scripting In the test design documents, each activity is scripted, or planned in advance. The IOT consists of two sets of operational trials, currently planned to be separated by approximately three weeks: one using the IBCT/ Stryker, the other the baseline LIB. Each trial is scheduled to have a nine- day duration, incorporating three types of mission events (raid, perimeter defense, and security operations in a stability environment) scripted during the nine days. The scripting indicates where and when each of these events in the test period occurs. It also establishes starting and stopping criteria for each event. There will be three separate nine-day trials using the IBCT test force and three separate nine-day trials using the LIB baseline force.

PHASEIREPORT: TEST PROCESS 131 Integrated Logistics Support Logistics is always a consideration during an JOT. It will be especially important in Stryker, since the length of a trial (clays) is longer than for typical weapon system IOT trials (hours). The supporting unit will be assigned in aclvance, ancl its actions will be controllecl. It will be precleter- minecl whether a unit can continue based on the logistical problems en- counterecl. The handling of repairs ancl replacements will be scriptecl. The role of the contractor in logistics support is always a key issue: contractors often maintain systems during introduction to a force, ancl both the level of "raining of Army maintenance personnel ancl the extent of con- tractor involvement in maintenance can affect force ancl system perfor- mance during operational testing. The contractor will not be present clur- ing actual combat, so it could be arguccl that the contractor should not be permitted in the areas reserved for the JOT. A counterargument is that the IOT can represent an opportunity for the contractor to learn where ancl how system failures occur in a combat environment. Safety A safety officer, present at all times, attempts to ensure that safety rules are followocl ancl is allowocl to stop the trial if it becomes apparent that an r . . unsafe cone citron exists. Constraints on Test and Evaluation The test ancl evaluation design ancl execution are greatly influenced by constraints on time, money, availability of trained participants, ancl avail- ability of test vehicles, as well as by demands by the contractor, the project manager, the director of operational test ancl evaluation, ancl Congress. In the IBCT/Stryker JOT, the time constraint is especially critical. The avail- ability of test units, test assets, test players, ancl test sites has created con- straints on test clesign; implications of these constraints are discussed later in this report. One key constraint is the selection of the baseline force (LIB) ancl its equipment. We note that there are alternative baselines (e.g., the Mechanizecl Infantry Brigacle ancl variations to the baseline equipment configurations) that could have been selected for the Stryker IOT but con- sicler it beyond the scope of the panel's charge to assess the choice of baseline.

132 IMPROVED OPERATIONAL TESTING AND EVALUATION One possibility that might be considered by ATEC would be to have the subject-matter experts also tasked to identify test results that might have been affected had the baseline force been different. Although this might involve more speculation than would be typical for SME's given their training, their responses could provide (with suitable caveats) valuable . . nslg. Its. One salient example ofthe effects of resource constraints on the Stryker IOT is the limited number of options available to test the situation aware- ness features of the Stryker's C4ISR (command, control, communications, computers, intelligence, surveillance, and reconnaissance). The evaluation of the C4ISR and the ensuing situation awareness is difficult. If time would permit, it would be valuable to run one full trial with complete informa- tion and communication and a matched trial with the information and the transmission thereof degraded. It is unlikely that this will be feasible in the Stryker JOT. In fact, it will be feasible to do only a few of the possible treatment combinations needed to consider the quality of the intelligence, the countermeasures against it, the quality of transmission, and how much information should be given to whom. CURRENT STATISTICAL DESIGN The IBCT/Stryker IOT will be conducted using two live companies operating simultaneously with a simulated company. These companies will carry out three types of missions: raid, perimeter defense, and security operations in a stable environment. The stated objective of the operational test is to compare Stryker-equipped companies with a baseline of light in- fantry companies for these three types of missions. The operational test consists of three nine-day scenarios of seven mis- sions per scenario for each of two live companies, generating a total of 42 missions, 21 for each company, carried out in 27 days. These missions are to be carried out by both the IBCT/Stryker and the baseline force/system. We have been informed that only one force (IBCT or the baseline) can carry out these missions at Fort Knox at one time, so two separate blocks of lThe following material is taken from a slide presentation to the panel: April 15, 2002 (U.S. Department of Defense, 2002a). A number of details, for example about the treatment of failed equipment and simulated casualties, are omitted in this very brief design summary.

PHASEIREPORT: TEST PROCESS 133 27 days have been reserved for testing there. The baseline LIB portion of the test will be conducted first, followed by the IBCT/Stryker portion. ATEC has identified the four design variables to be controlled during the operational test: mission type (raid, perimeter defense, security opera- tions in a stable environment), terrain (urban, rural), time of dray (day, night), and opposition force intensity (low civilians and partisans; me- dium civilians, partisans, paramilitary units; and high civilians, parti- sans, paramilitary units, and conventional units). The scenario (a scripted description of what the OPFOR will do, the objectives and tasks for test units, etc.) for each test is also controlled for and is essentially a replication, since both the IBCT and the baseline force will execute the same scenarios. The panel has commented in our October 2002 letter report (Appendix A) that the use of the same OPFOR for both the IBCT and the baseline trials (though in different roles), which are conducted in sequence, will intro- duce a learning effect. We suggested in that letter that interspersing the trials for the two forces could, if feasible, reduce or eliminate that con- founding effect. ATEC has conducted previous analyses that demonstrate that a test sample size of 36 missions for the IBCT/Stryker and for the baseline would provide acceptable statistical power for overall comparisons between them and for some more focused comparisons, for example, in urban environ- ments. We comment on these power calculations in Chapter 4. The current design has the structure shown in Table 2-1. The variable "time of day," which refers to whether the mission is mainly carried out during daylight or nighttime, is not explicitly mentioned in the design matrix. Although we assume that efforts will be made in real time, oppor- tunistically, to begin missions so that a roughly constant percentage of test events by mission type, terrain, and intensity, and for both the IBCT/ Stryker and the baseline companies, are carried out during daylight and nighttime, time of day should be formalized as a test factor. It is also our understanding that, except for the allocation of the six extra missions, ATEC considers it infeasible at this point to modify the general layout of the design matrix shown in the table.

134 o ._ Cal o ._ ._ Cal o ._ .° .° o .~ _ ~ _ o %~= s.' to ~- ~ 1 . . . ~- . . . . B ~ B to Fill o ~ ma ~ . _ a ~ ~ ~ ~ ~ ~ ~. - ,~0, a H ~ ~ ~ ~ ~ O .= .bc ~ ~ ~ .= ~ ~ O C,, ~ o D a ° ~ ~ 0 ~ ~ I ~ ~ ~ ~ .¢ => to ~ 0 ~ ~ 0 0 ~ 0 ~ .a ~ ca I._ .~ ~ ~ 5 5 .~ ~ Ma

3 Test Measures The Interim Brigade Combat Team (IBCT) equipped with the Stryker is intended to provide more combat capability than the current Light Infantry Brigade (LIB) and to be significantly more strategically deployable than a heavy Mechanized Infantry Brigade (MIB). It is anticipated that the IBCT will be used in at least two roles: 1. as part of an early entry combat capability against armed threats in small-scale contingencies (SSC). These IBCT engagements are likely to be against comparable forces forces that can inflict meaningful casualties on each other. 2. in stability and support operations against significantly smaller and less capable adversaries than anticipated in SSC. The Stryker system evalu- ation plan (SEP) uses the term security operations in a stability environ- ment (SOSE); that term will be used here. The IBCT/Stryker initial operational test (JOT) will include elements of both types of IBCT missions to address many of the issues described in the Stryker SEP. This chapter provides an assessment of ATEC's plans for mea- sures to use in analyzing results of the JOT. We begin by offering some definitions and general information about measures as background for spe- cific comments in subsequent sections. 135

136 IMPROVED OPERATIONAL TESTING AND EVALUATION INTRODUCTION TO MEASURES Using the IBCT and the IOT as context, the following definitions are used as a basis for subsequent discussions. The objective of the IBCT is synonymous with the mission it is as- signed to perform. For example: · "Attack to seize and secure the opposition force's (OPFOR) de- fended position" (SSC mission) · "Defend the perimeter around . . . for x hours" (SSC mission) · "Provide area presence to . . . " (SOSE mission) Objectives will clearly vary at different levels in the IBCT organization (brigade, battalion, company, platoon), and several objectives may exist at one level and may in fact conflict (e.g., "Attack to seize the position and minimize friendly casualties". Electiveness is the extent to which the objectives of the IBCT in a mission are attained. Performance is the extent to which the IBCT demon- strates a capability needed to fulfill its missions effectively. Thus, perfor- mance could include, for example, the Stryker vehicle's survivability, reli- ability, and lethality; the IBCT's C4ISR (command, control communications, computers, intelligence, surveillance, and reconnais- sance); and situation awareness, among other things. A measure of performance (MOP) is a metric that describes the amount (or level) of a performance capability that exists in the IBCT or some of its systems. A measure of effectiveness (MOE) is a quantitative index that indicates the degree to which a mission objective of the IBCT is attained. Often many MOEs are used in an analysis because the mission may have multiple objectives or, more likely, there is a single objective with more than one MOE. For example, in a perimeter defense mission, these may include the probability that no penetration occurs, the expected value of the time until a penetration occurs, and the expected value of the number of friendly casualties, all of which are of interest to the analyst. For the IBCT JOT, mission-level MOEs can provide useful information to: evaluate how well a particular mission or operation was (or will be) performed. Given appropriate data collection, they provide an objective and quantitative means of indicating to appropriate decision makers the degree of mission accomplishment;

PHASE I REPORT: TESTMEASURES 137 2. provide a means of quantitatively comparing alternative forces (IBCT versus LIB); and 3. provide a means of determining the contribution of various incom- mensurate IBCT performance capabilities (survivability, lethality, C4ISR, etc.) to mission success (if they are varied during experiments) and there- fore information about the utility of changing the level of particular capabilities. Although numerical values of mission-level MOEs provide quantita- tive information about the degree of mission success, the analysis of opera- tional test results should also be a diagnostic process, involving the use of various MOEs, MOPs, and other information to determine why certain mission results occurred. Using only summary MOE values as a rationale for decision recommendations (e.g., select A over B because MOEA = 3.2 > MOEB = 2.9) can lead to a tyranny of numbers, in which precisely stated values can be used to reach inappropriate decisions. The most important role of the analyst is to develop a causal understanding of the various factors (force size, force design, tactics, specific performance capabilities, environ- mental conditions, etc.) that appear to drive mission results and to report on these as well as highlight potential problem areas. Much has been written about pitfalls and caveats in developing and using MOEs in military analyses. We mention two here because of their relevance to MOEs and analysis concepts presented in the IBCT/Stryker test and evaluation master plan (TEMP) documentation. 1. As noted above, multiple MOEs may be used to describe how well a specific mission was accomplished. Some analysts often combine these into a single overall number for presentation to decision makers. In our view, this is inappropriate, for a number of reasons. More often than not, the different component MOEs will have incommensurate dimensions (e.g., casualties, cost, time) that cannot be combined without using an explicit formula that implicitly weights them. For example, the most common for- mula is a linear additive weighting scheme. Such a weighting scheme as- signs importance (or value) to each of the individual component MOEs, a task that is more appropriately done by the decision maker and not the analyst. Moreover, the many-to-one transformation of the formula may well mask information that is likely to be useful to the decision maker's deliberations.

138 IMPROVED OPERATIONAL TESTING AND EVALUATION 2. Some MOEs are the ratio of two values, each of which, by itself, is useful in analyzing mission success. However, since both the numerator and the denominator affect the ratio, changes in (or errors in estimating) the numerator have linear effects on the ratio value, while changes (or er- rors) in the denominator affect the ratio hyperbolically. This effect makes the use of such measures particularly suspect when the denominator can become very small, perhaps even zero. In addition, using a ratio measure to compare a proposed organization or system with an existing one implies a specific value relationship between dimensions of the numerator and the denominator. Although ratio MOE values may be useful in assessing degrees of mis- sion success, reporting only this ratio may be misleading. Analysis of each of its components will usually be required to interpret the results and de- velop an understanding of why the mission was successful. ATEC plans to use IOT data to calculate MOEs and MOPs for the IBCT/Stryker. These data will be collected in two ways: subjectively, using subject-matter experts (SMEs), and objectively, using instrumentation. Our assessment of these plans is presented in the remainder of this chapter, which discusses subjective measures (garnered through the use of SMEs) and objective measures of mission effectiveness and of reliability, availabil- ity, and maintainability. SMEs are used to subjectively collect data for MOEs (and MOPs) to assess the performance and effectiveness of a force in both SSC missions (e.g., raid and perimeter defense) and SOSE mis- sions. Objective measures of effectiveness (including casualty-related mea- sures, scenario-specif1c measures, and system degradation measures) may also be applied across these mission types, although objective casualty-re- lated MOEs are especially useful for evaluating SSC engagements, in which both the IBCT and the OPFOR casualties are indicators of mission suc- cess. Casualty-related measures are less commonly applied to SOSE mis- sions, in which enemy losses may have little to do with mission success. Objective measures of reliability, availability, and maintainability are ap- plied to assess the performance and effectiveness of the system. SUBJECTWE SUBJECT-~TTER EXPERT MEASURES Military judgment is an important part of the operational evaluation and will provide the bulk of numerical MOEs for the Stryker JOT. Trained SMEs observe mission tasks and subtasks and grade the results, according

PHASE I REPORT: TESTMEASURES 139 to agreed-upon standards and rating scales. The SMEs observe and follow each platoon throughout its mission set. Although two SMEs are assigned to each platoon and make independent assessments, they are not necessar- ily at the same point at the same time. SME ratings can be binary (pass/fail, yes/no) judgments, comparisons (e.g., against baseline), or indicators on a numerical task performance rat- ing scale. In addition to assigning a rating, the SME keeps notes with the reasoning behind the assessment. The mix of binary and continuous mea- sures, as well as the fact that the rating scales are not particularly on a cardinal (much less a ratio) scale, makes it inappropriate to combine them in any meaningful way. Moreover, since close combat tactical training data show that the con- ventional 10-point rating scale provides values that were rarely (if ever) used by SMEs, ATEC has proposed using an 8-point scale. However, it has also been observed in pretesting that the substantive difference between task performance ratings of 4 and 5 is very much greater than between 3 and 4. This is because, by agreement, ratings between 1 and 4 indicate various levels of task "failure" and ratings between 5 and 8 indicate levels of task "success." The resulting bimodal distribution has been identified by ATEC analysts as representing a technical challenge with respect to tradi- tional statistical analysis. We prefer to regard this phenomenon as being indicative of a more fundamental psychometric issue, having to do with rating scale development and validation. Although there has also been some discussion of using two or three separate rating scales, this would be a useful approach only if there were no attempt to then combine (roll up) these separate scales by means of some arbitrary weighting scheme. SME judgments are clearly subjective: they combine experience with observations, so that two SMEs could easily come up with different ratings based on the same observations, or a single SME, presented twice with the same observation, could produce different ratings. Using subjective data is by itself no barrier to making sound statistical or operational inferences (National Research Council, 1998b; Veit, 19961. However, to do so, care must be taken to ensure that the SME ratings have the usual properties of subjective data used in other scientific studies, that is, that they can be calibrated, are repeatable, and have been validated. One good way to sup- port the use of SME ratings in an IOT is to present a careful analysis of SME training data, with particular attention paid to demonstrating small inter-SME variance.

140 IMPROVED OPERATIONAL TESTING AND EVALUATION OBJECTIVE MEASURES OF EFFECTIVENESS In this section we discuss objective measures of effectiveness. Although these involve "objective" data, in the sense that two different observers will agree as to their values, experts do apply judgment in selecting the particu- lar variables to be measured in specific test scenarios. While it is useful to provide summary statistics (e.g., for casualty measures, as discussed below), decision makers should also be provided (as we suggest earlier in this chap- ter) with the values of the component statistics used to calculate summary statistics, since these component statistics may (depending on analytical methods) provide important information in themselves. For example, sum- mary brigade-level casualties (discussed below) are computed by aggregat- ing company and squad-level casualties, which by themselves can be of use in understanding complicated situations, events, and scenarios. There are many thousands of objective component statistics that must support com- plex analyses that depend on specific test scenarios. Our discussion below of casualty-related measures and of scenario-specific measures is intended to illustrate fruitful analyses. Casualty-Related Measures In this section we discuss some of the casualty-related MOEs for evalu- ating IBCT mission success, appropriate for both combat and SOSE mis- sions, but particularly appropriate for SS C-like engagements in which both sides can inflict significant casualties on each other. Specifically, we discuss the motivation and utility of three casualty ratio MOEs presented by ATEC in its operational test plan. Ideally, an operational test with unlimited resources would produce estimates of the probability of mission "success" (or any given degree of success), or the distribution of the number of casualties, as a function of force ratios, assets committed and lost, etc. However, given the limited replications of any particular scenario, producing such estimates is infea- sible. Still, a variety of casualty-related proxy MOES can be used, as long as they can be shown to correlate (empirically or theoretically) with these ultimate performance measures. We begin by introducing some notation and conventions.] The conventions are based on analyses of the cold war security environment that led to the development and rationale underlying two of the ratio MOEs.

PHASE I REPORT: TESTMEASURES Let: 141 N _ initial number of enemy forces (OPFOR) in an engagement (battle, campaign) against friendly forces M _ initial number of friendly forces in an engagement FRo - N/M= initial force ratio nits _ number of surviving enemy forces at time tin the engagement myth _ number of surviving friendly forces at time t in the engage- ment FR6t9_ n~t9/m~t) - force ratio at time t Chit) = N- nits _ number of enemy casualties by time t Cm~t9= M- mitt _ number of friendly casualties by time t Although survivors and casualties vary over time during the engagement, we will drop the time notation for ease in subsequent discussions of casu- alty-related MOE. In addition, we use the term "casualties" as personnel losses, even though much of the motivation for using ratio measures has been to assess losses of weapon systems (tanks, etch. It is relatively straight- forward to convert a system loss to personnel casualties by knowing the kind of system and type of system kill. Loss Exchange Ration A measure of force imbalance, the loss exchange ratio (LER) is defined to be the ratio of enemy (usually the attacker) losses to friendly (usually defender) losses. That is3 N-n C (1) M-m Cm 2During the cold war era, measures of warfighting capability were needed to help the Army make resource allocation decisions. The LER measure was created a number of decades ago for use in simulation-based analyses of war between the Soviet-led Warsaw Pact (WP) and the U.S.-led NATO alliance. The WP possessed an overall strategic advantage in ar- mored systems of 2:1 over NATO and a much greater operational-tactical advantage of up to 6:1. Prior to the demise of the Soviet Union in 1989-1991, NATO's warfighting objective was to reduce the conventional force imbalance in campaigns, battles, and engagements to preclude penetration of the Inter-German Border. 3Enemy losses will always be counted in the numerator and friendly losses in the de- nominator regardless of who is attacking.

142 IMPROVED OPERATIONAL TESTING AND EVALUATION Thus, LER is an indicator of the degree to which the force imbalance is reduced in an engagement: the force imbalance is clearly being reduced while the conditions holds. LER > FRo = N/M Since casualty-producing capability varies throughout a battle, it is of- ten useful to examine the instantaneous LER the ratio of the rates of enemy attacker and defender losses as a function of battle time t, in order to develop a causal understanding ofthe battle dynamics. Early in the battle, the instantaneous LER is high and relatively independent of the initial force ratio (and particularly threat size) because of concealment and first shot advantages held by the defender. The LER advantage moves to the attacker as the forces become decisively engaged, because more attackers find and engage targets, and concentration and saturation phenomena come into play for the attacker. However, this pattern is not relevant in today's security environment, with new technologies (e.g., precision munitions, second-generation night vision devices, and FBCB21; more U.S. offensive engagements; and threats that employ asymmetric warfare tactics. The utility of the LER is further evidenced by its current use by analysts of the TRADOC Analysis Com- mand and the Center for Army Analysis (CAA, formerly the Army's Con- cepts Analysis Agency) in studies of the Army's Interim Force and Objec- . . rive ~ force. Force Exchange Ratios The LER indicates the degree of mission success in tactical-level en- gagements and allows an examination of the impact of different weapon systems, weapon mixes, tactics, etc. At this level, each alternative in a study traditionally has the same initial U.S. force size (e.g., a battalion, a com- pany). As analysis moves to operational-level issues (e.g., force design/struc- 4The rER is usually measured at the time during an engagement when either the attacker or defender reaches a breakpoint level of casualties. 5This MOE is also referred to as the fractional loss exchange ratio and the fractional exchange ratio.

PHASE I REPORT: TESTMEASURES 143 sure, operational concepts) with nonlinear battlefields, alternatives in a study often have different initial force sizes. This suggests considering a measure that "normalizes" casualties with respect to initial force size, which gives rise to the force exchange ratio (FER): ,~N-n) FER = M-m M . = C,' N = LER Cm FRo M (2, The FER and the LER are equally effective as indicators of the degree by which force imbalance is reduced in a campaign: an enemy's initial force size advantage is being reduced as long as the FER > 1. Some of the history behind the use of FER is summarized in Appendix B. Relative Loss Ratio ATEC has proposed using a third casualty ratio, referred to as the rela- tive loss ratio (RLR) and, at times, the "odds ratio." They briefly define and demonstrate its computation in a number of documents (e.g., TEMP, December 2001; briefing to USA-OR, June 2002) and (equally briefly) argue for its potential advantages over the LER and the FER The basic RLR is defined by ATEC to be the ratio of Lenemy to friendly casualty ratio] to Lenemy to friendly survivor ratio] at some time t in the battle: Nn C 72 , x, ~ RLR = m m ~ ~ = ~ Cn 1~ m 1 = LER SVER Mm Cm t~ Cm )( n ) where SVER = m/n is referred to as the "survivor ratio." Since the reciprocal of SVER is the force ratio FRt = (n/m) at time t in the battle, RLR can be expressed as RLR = FRt which is structurally similar to the FER given by equation (21. It is interest- ing to note that the condition (4)

144 IMPROVED OPERATIONAL TESTING AND EVALUATION N-f~ RLR= ~ >1 M-m m implies that FRo>FRt i.e., that the initial force imbalance is being reduced at time t. However, the condition FER > 1 also implies the same thing. ATEC also proposes to use RLR, a relative force ratio normalized for . . . 1 r . ~ . Initial force ratios. 1 net IS (N-n) ~< N J RLR= n = M -m: n ~ m ~ M ) (M) m m ( N ) C,e N m = FER SVER= FER Cm n M FRt (5) ATEC does not discuss any specific properties or implications of us- ing the RLR but does suggest a number of advantages of its use relative to the LER and the FER These are listed below (in italics) with the panel's comments. 1. The RLR addresses casualties and survivors whereas the LER and the FER address only casualties. When calculating LER and FER the number of casualties is in fact the initial number of forces minus survivors. 2. The RLR can aggregate over dLi~erent levels offorce str?vct?vre (e.g., pla- toons, companies, IDattalions) while the LER and the FER cannot. The initial numbers of forces and casualties for multiple platoon engagements in a company can be aggregated to compute company-level LERs and FERs, and they can be aggregated again over all company engagements to com- pute battalion-level LERs and FERs. Indeed, this is how they are regularly computed in Army studies of battalion-level engagements. 3. The RLR can aggregate dLi~erent kinds of casualties (vehicles, personnel, civilians, fratricidLe) to present a decision maker with a single RLR measure of merit, while the LER and the FER cannot. Arbitrary linear additive func- tions combining these levels of measures are not useful for the reasons given in the section on introduction to measures above. In any event, personnel casualties associated with system/vehicle losses can be readily calculated

PHASE I REPORT: TESTMEASURES 145 using information from the Ballistics Research Laboratories/U.S. Army Material Systems Analysis Activity (BRL/AMSAA). It is not clear why the geometric mean computed for the RLR (p. 48 of December 2001 TEMP) could not be computed for the LER or the FER if such a computation were thought to be useful. 4. The RLR motivates commandeers "to seek an optimum tradLe-o~ IDe- tween friendly survivors and enemy casualties." This does not appear ger- mane to selecting an MOE that is intended to measure the degree of mis- sion success in the IBCT JOT. 5. The RLR has n?vmero?vs attractive statisticalproperties. ATEC has not delineated these advantages, and we have not been able to determine what they are. 6. The RLR has many goods statistical properties of a 'maximum likeli- hoodt statistic" including being most precise among other attractive measures of attrition (LER and FER9. It is not clear what advantage is suggested here. Maximum likelihood estimation is a technique for estimating parameters that has some useful properties, especially with large samples, but maxi- mum likelihood estimation does not appear to address the relative merits of LER, FER, and RLR 7. The IAV/Stryker; IOTis a designed experiment. To take advantage of it, there is a standard log-linear modeling approach for analyzing attrition dLata that uses RLR statistics. There are equally good statistical approaches that can be used with the FER and the LER Fratricide and Civilian Casualties ATEC has correctly raised the importance of developing suitable MOEs for fratricide (friendly casualties caused by friendly forces) and civil- ian casualties caused by friendly fires. It is hypothesized that the IBCT/ Stryker weapon capabilities and the capabilities of its C4ISR suite will re- duce its potential for fratricide and civilian casualties compared with the baseline. The lune 2002 SEP states that in order to test this hypothesis, the "standard" RLR and fratricide RLR (where casualties caused by friendly forces are used in place of OPFOR casualties) will be compared for both the IBCT and the LIB. A similar comparison would be done using a civil- ian casualties RLR However, the RLR (as well as the LER and the FER) is not an appropri- ate MOE to use, not only for the reasons noted above, but also because it does not consider the appropriate fundamental phenomena that lead to

146 IMPROVED OPERATIONAL TESTING AND EVALUATION fratricide (or civilian) casualties. These casualties occur when rounds fired at the enemy go astray (for a variety of possible reasons, including errone- ous intelligence information, false detections, target location errors, aiming errors, weapons malfunction, etch. Accordingly, we recommend that ATEC report, as one MOE, the number of such casualties for IBCT/Stryker and the baseline force and also compute a fratricide frequency (FF) defined as number of fratricide casualties number of rounds fired at the enemy and a similarly defined civilian casualty frequency (CF). The denominator could be replaced by any number of other measures of the intensity (or level) of friendly fire. Adtvantages of FER and LER Over RLR The FER and the LER have served the Army analysis community well for many decades as mission-level MOEs for campaigns, battles, and en- gagements. Numerous studies have evidenced their utility and correlation to mission success. Accordingly, until similar studies show that the RLR is demonstrably superior in these dimensions, ATEC should use FER (and LER when appropriate), but not the RLR, as the primary mission-level MOE for analyses of engagement results. Our preference for using the FER, instead of the RLR, is based on the following reasons: · The FER has been historically correlated with the probability of mission success (i.e., winning an engagement/battle), and the RLR has not. · There is strong historical and simulation-based evidence that the FER is a valid measure of a force's warfighting capability given its strong correlation with win probability and casualties. It has been useful as a measure of defining "decisive force" for victory. · The Army analysis community has used, and found useful, FER and LER as the principal MOEs in thousands of studies involving major theatre war and SSC combat between forces that can inflict noticeable ca- sualties on each other. There is no similar experience with the RLR · There is no compelling evidence that the purported advantages of the RLR presented by ATEC and summarized above are valid. There is little understanding of or support for its properties or value for analysis. · Using ratio measures such as FER and LER is already a challenge to the interpretation of results when seeking causal insights. The RLR adds

PHASE I REPORT: TESTMEASURES 147 another variable (survivors) to the LER ratio (making it more difficult to interpret the results) but does not add any new information, since it is perfectly (albeit negatively) correlated with the casualty variables already included in the FER and the LER Scenario-Specific and System Degradation Measures ATEC states that the main Army and Department of Defense (DoD) question that needs to be answered during the Stryker operational test is: Is a Stryker-equipped force more effective than the current baseline force? The TEMP states that: The Stryker has utility in all operational environments against all projected future threats; however, it is designed and optimized for contingency em- ployment in urban or complex terrain while confronting low- and mid- range threats that may display both conventional and asymmetric warfare capabilities. This statement points directly to the factors that have been used in the current test design: terrain (rural and urban), OPFOR intensity (low, me- dium, high), and mission type (raid, perimeter defense, security operations in a stability environment). These factors are the ones ATEC wants to use to characterize if and when the Stryker-equipped force is better than the baseline and to help explain why. The Stryker SEP defines effectiveness and performance criteria and assigns a numbering scheme to these criteria and their associated measures. In the discussion below, the numbering of criteria adheres to the Stryker SEP format (U.S. Department of Defense, 2002c). There are three sets of measures that are appropriate for assessing each of the three mission types. These are detailed in the measures associated with Criterion 4-1: Stryker systems must successfully support the accomplishment of required opera- tions and missions based on standards of performance matrices and associ- ated mobility and performance requirements. In particular, the measures of effectiveness for Criterion 4-1 are: Mission accomplishment. Performance ratings on selected tasks and subtasks from the applicable performance assessment matrices while conducting operations at company, platoon, squad, and section level. MOE 4-1-3 Relative attrition. MOE 4-1-1 MOE 4-1-2

148 IMPROVED OPERATIONAL TESTING AND EVALUATION These measures of effectiveness have been addressed in the previous sections. In addition, however, ATEC would like to know why there are differ- ences in performance between the Stryker-equipped force and the baseline force. The reasons for performance differences can be divided into two categories: Stryker capabilities and test factors. Stryker capabilities include situation awareness (which contributes to survival by avoidance), responsiveness, maneuverability, reliability-availabil- ity-maintainability (RAM), lethality, survivability (both ballistic and non- ballistic), deployability, transportability, and logistics supportability. Test factors include time of day, time of year, weather, nuclear/biological/chemi- cal (NBC) environment, personnel, and training. Measures for reliability are addressed in detail later in this chapter; test factors are addressed in Chapter 4. With the exception of situation awareness, responsiveness, maneuver- ability, and RAM, the current SEP addresses each capability using more of a technical than an operational assessment. The IBCT/Stryker IOT is not designed to address (and cannot be redesigned to address) differences in performance due to lethality, survivability, deployability, transportability, or logistics supportability. Any difference in performance that might be attributed to these factors can only be assessed using the military judgment of the evaluator supported by technical and developmental testing and modeling and simulation. The current capability measures for situation awareness, responsive- ness, and maneuverability are associated with Criterion 4-2 (the Stryker systems must be capable of surviving by avoidance of contact through inte- gration of system speed, maneuverability, protection, and situation aware- ness during the conduct of operations) and Criterion 4-3 (the Stryker must be capable of hosting and effectively integrating existing and planned Army command, control, communications, computers, intelligence, surveillance, and reconnaissance or C4ISR systems). The associated MOEs are: MOE 4-2-1 Improvement of force protection MOE 4-2-2 Improvement in mission success attributed to informa- tlon MOE4-2-3 Contributions of Army battle command systems (ABCS) information to Stryker survival

PHASE I REPORT: TESTMEASURES 149 MOE 4-2-4 How well did the ABCS allow the commander and staff to gain and maintain situation awareness/understand- ing? MOE 4-3-1 Ability to host C4ISR equipment and its components MOE 4-3-2 Integration effectiveness of C4ISR demonstrated dur- ing the product verification test MOE 4-3-3 Interoperability performance for the Stryker C4ISR in technical testing MOE 4-3-4 Capability of the Stryker C4ISR to withstand external and internal environmental effects IAW MIL-STD 810F and/or DTC Test Operation Procedures (TOP) MOE 4-3-5 Capability to integrate MEP and FBCB2 data The measures associated with Criterion 4-3 are primarily technical and address the ability of the existing hardware to be integrated onto the Stryker platforms. As with many of the other capabilities, any difference in perfor- mance that might be attributed to hardware integration will be assessed using the military judgment of the evaluator supported by technical and developmental testing. The problem with most of the MOPs associated with Criterion 4-2 (see Table 3-1) is that they are not unambiguously measurable. For ex- ample, consider MOP 4-2-2-2, communications success. The definition of success is, of course, very subjective, even with the most rigorous and validated SME training. Moreover, the distinction between transfer of in- formation and the value of the information is important: communications can be successful in that there is a timely and complete transfer of critical information, but at the same time unsuccessful if that information is irrel- evant or misleading. Or, for another example, consider: MOP 4-2-1-3, Incidents of BLUFOR successful avoidance of the adversary. Whether this criterion has been met can be answered only by anecdote, which is not usually considered a reliable source of data. Note that there is no clear numerator or denominator for this measure, and merely counting the fre- quency of incidents does not provide a reference point for assessment. Two other categories of measures that could be more useful in assessing performance differences attributable to situation awareness, responsiveness, and maneuverability are scenario-specific and system degradation measures.

150 IMPROVED OPERATIONAL TESTING AND EVALUATION TABLE 3-1 MOPs for Criterion 4-2 MOE 4-2-1 Improvement in force protection MOP 4-2-1-1 Relative attrition MOP 4-2-1-2 Mission success rating MOP 4-2-1-3 Incidents of BLUFOR successful avoidance of the adversary MOP 4-2-1-4 Incidents where OPFOR surprises the BLUFOR MOE 4-2-2 Improvement in mission success attributed to information MOP 4-2-2-1 Initial mission, commander's intent and concept of the operations contained in the battalion and company operations and fragmentary orders MOP 4-2-2-2 Communications success (use MOE 4-3-5: Capability to integrate MEP and FBCB2 data) MOE 4-2-3 Contributions of ABCS information (C2, situation awareness, etc.) to Stryker survival MOP 4-2-3-1 What were the ABCS message/data transfer completion rates (MCR)? MOP 4-2-3-2 What were the ABCS message/data transfer completion times (speed of service)? MOP 4-2-3-3 How timely and relevant/useful was the battlefield information (C2 message, targeting information, friendly and enemy situation awareness updates, dissemination of order and plans, alerts and warning) provided by ABCS to commander and staffs? MOE 4-2-4 How well did the ABCS allow the commander and staff to gain and maintain situation awareness/understanding? MOP 4-2-4-1 Friendly force visibility MOP 4-2-4-2 Friendly position data distribution MOP 4-2-4-3 Survivability/entity data distribution Scenario-Specific Measures Scenario-specif1c measures are those that are tailored to the exigencies of the particular mission-script combinations used in the test. For example, in the perimeter defense mission, alternative measures could include an- swers to questions such as: · Did the red force penetrate the perimeter? How many times? · To what extent was the perimeter compromised (e.g., percentage of perimeter compromised, taking into account the perimeter shape)? · How far in from the perimeter was the red force when the penetra- tion was discovered?

PHASE I REPORT: TESTMEASURES 151 · How long did it take the red force to penetrate the perimeter? · What fraction of time was the force protected while the OPFOR was (or was not) actively engaged in attacking the perimeter? For a raid (or assault) mission, measures might include: . Was the objective achieved? · How long did it take to move to the objective? · How long did it take to secure the objective? · How long was the objective held (if required)? For SOSE missions, measures might include: · For "show the flag" and convoy escort: How far did the convoy progress? How long did it take to reach the convoy? How much time tran- spired before losses occurred? · For route and reconnaissance: How much information was ac- quired? What was the quality of the information? How long did it take to acquire the information? We present here the principle that useful objective measures can be tied to the specific events, tasks, and objectives of missions (the unit of measurement need not always be at the mission level or at the level of the individual soldier), and so the measures suggested are intended as exem- plary, not as exhaustive. Other measures could easily be tailored to such tasks as conducting presence patrols, reaching checkpoints, searching build- ings, securing buildings, enforcing curfews, etc. These kinds of measures readily allow for direct comparison to the baseline, and definitions can be written so that they are measurable. System Degradation Measures: Situation Awareness as an Experimental Factor The other type of measure that would be useful in attributing differ- ences to a specific capability results from degrading this capability in a controlled manner. The most extreme form of degradation is, of course, complete removal of the capability. One obvious Stryker capability to test in this way is situation awareness. The IBCT equipped with Stryker is in- tended to provide more combat effectiveness than the LIB and be more

152 IMPROVED OPERATIONAL TESTING AND EVALUATION strategically deployable than a heavy MIB. More combat effectiveness is achieved by providing the IBCT with significantly more firepower and tac- tical mobility (vehicles) than the LIB. Improving strategic mobility is pro- vided by designing the IBCT systems with significantly less armor, thus making them lighter than systems in the heavy MID, but at a potential price of being more vulnerable to enemy fire. The Army has hypothesized that this potential vulnerability will be mitigated by Striker's significantly improved day and night situation awareness and C4ISR systems such as FBCB2,6 second-generation forward-looking infrared systems, unmanned aerial vehicles, and other assets. If all C4ISR systems perform as expected and provide near-perfect situ- ation awareness, the IBCT should have the following types of advantages in tactical engagements over the LIB (which is expected to have much less . . situation awareness : · IBCT units should be able to move better (faster, more directly) by taking advantage of the terrain and having common knowledge of friendly ~ _ ~ . _ _O _ _ ~ _ . ~ and enemy forces. · With better knowledge of the enemy, IBCT units should be able to get in better positions for attack engagements and to attack more advanta- geously day or night by making effective use of cover in approaches to avoid enemy fires. They could structure attacks against the enemy in two directions (thus making him fight in two directions) with little or no risk of surprise ambushes by threat forces. · IBCT units and systems should be able to acquire more enemy tar- gets accurately at longer ranges, especially at night, facilitating more effec- tive long-range fire. · IBCT systems should be able to rapidly "hand off" targets to en- hance unit kill rates at all ranges. · Using combinations of the above situation awareness advantages, IBCT units should be capable of changing traditional attacker-defender battle dynamics favoring the defender at long ranges and the attacker at shorter ranges. Attacking IBCT systems should be able to avoid long-range defender fires or attrit many of the defenders at long range before closing with them. 6FBCB2 is a top-down fed command and control system that is supposed to provide the IBCT with timely and accurate information regarding all friendly and enemy systems.

PHASE I REPORT: TESTMEASURES 153 The Army has yet to test the underlying hypothesis that the enhanced situation awareness/C4ISR will in fact make the IBCT/Stryker less vulner- able and more effective. As currently designed, the IOT (which compares the effectiveness of IBCT/Stryker with the LIB in various missions) cannot test this hypothesis since the IBCT/Stryker is presumably more effective than the LIB for many criteria (mobility, lethality, survivability, etc.), not just in its situation awareness/C4ISR capability. To most effectively test the underlying hypothesis, the IOT design should make situation awareness/ C4ISR an explicit factor in the experiment, preferably with multiple levels, but at a minimum using a binary comparison. That is, the design should be modified to explicitly incorporate trials of the IBCT/Stryker both with and without its improved situation awareness/C4ISR in both daytime and . . . . nighttime scenarios. It is not sufficient to rely on test conditions (e.g., the unreliability of the hardware itself) to provide opportunities to observe missions without situation awareness. There must be a scripted turning off of the situation awareness hardware. This kind of controlled test condition leads to results that can be directly attributed to the situation awareness capability. If this type of test modification is not feasible, then the underlying hypothesis should be tested using appropriate simulations at either the In- telligence School or TRAC-FLVN (Ft. Leavenworth). Although the hy- pothesis may not be testable in the IOT as currently designed, ATEC may be able to determine some of the value of good situation awareness/C4ISR by assessing the degree to which the situation awareness-related advantages noted above are achieved by the IBCT/IAV in combat missions. To accom- plish this: · SMEs should assess whether the IBCT/Stryker units move through the terrain better (because of better information, not better mobility) then LIB units. . SMEs should assess whether IBCT/Stryker units get in better posi- tions (relative to enemy locations) for attack engagements than LIB units and are able to design and implement attack plans with more covered at- tack routes to avoid enemy fires (i.e., reduce their vulnerability). . ATEC should collect target acquisition data by range and by type (visual, pinpoint) for day and night missions to determine whether IBCT/ Stryker systems have the potential for more long-range fires than LIB sys- tems. ATEC should also record the time and range distribution of actual r 1 · · ~ rlre during missions.

154 IMPROVED OPERATIONAL TESTING AND EVALUATION · ATEC should determine the number of hand-off targets during en- gagements to see if the IBCT force is really more "net-centric" than the LIB. · From a broader perspective, ATEC should compute the instanta- neous LER throughout engagements to see if improved situation aware- ness/C4ISR allows the IBCT force to advantageously change traditional attacker-defender battle dynamics. OBJECTIVE MEASURES OF SUITABILITY The overall goal of the IOT is to assess baseline force versus IBCT/ Stryker force effectiveness. Because inadequate levels of reliability and main- tainability (R&M) would degrade or limit force effectiveness, R&M per- formance is important in evaluating the Stryker system. We note in passing that R&M performance will affect both sides of the comparison. It is not clear whether an assessment of baseline R&M performance is envisioned in the JOT. Such an assessment would provide an important basis for com- parison and might give insights on many differences in R&M effectiveness. Reliability Criterion 1-3 states: "The Stryker family of interim armored vehicles (excluding GFE components and systems) will have a reliability of 1,000 mean miles between critical failures (i.e., system aborts)." This require- ment is raised to 2,000 mean miles for some less stressed vehicle types. These failures could be mechanical vehicle failures or failures due to ve- hicle/GFE interface issues. Although GFE failures themselves don't con- tribute to this measure, they should and will be tracked to assess their role in the force effectiveness comparison. The IOT is not only key to decisions about meeting R&M criteria and systems comparisons, but it also should be viewed as a shakedown exercise. The IOT will provide the first view of the many mechanical and electronic pieces of equipment that can fail or go wrong in an operational environ- ment. Some failures may repeat, while others will take a fair amount of IOT exposure to manifest themselves for the first time. Thus the IOT pro- vides an opportunity for finding out how likely it is that other new failure issues may crop up. For this reason, failure incidents should be collected for all vehicles for their entire lives on a vehicle-by-vehicle basis, even though much of the

PHASE I REPORT: TESTMEASURES 155 data may not serve the express purposes of the JOT. Currently it appears that only the Army test incident reporting system will be used. Suitable databases to maintain this information should be established. In the remainder of this section we discuss four important aspects of reliability and maintainability assessment: · failure modes (distinguishing between them and modeling their fail- ure time characteristics separately); · infant mortality, durability/wearout, and random failures (types and consequences of these three types of failure modes); · durability accelerated testing and add-on armor; and · random failures, GFE integration, and scoring criteria. Failure ModLes Although the TEMP calls for reporting the number of identified fail- ures and the number of distinct failure modes, these are not sufficient metrics for making assessments about systems' RAM. Failures need to be classified by failure mode. Those modes that are due to wearout have differ- ent data-recording requirements from those that are due to random causes or infant mortality. For wearout modes, the life lengths of the failed parts/ systems should be observed, as well as the life lengths of all other equivalent parts that have not yet failed. Life lengths should be measured in the appro- priate time scale (units of operating time, or operating miles, whichever is more relevant mechanistically). Failure times should be recorded both in terms of the life of the vehicle (time/miles) and in terms of time since last maintenance. If there are several instances of failure of the same part on a given vehicle, a record of this should be made. If, for example, the brake or tire that fails or wears out is always in the same position, this would be a significant finding that would serve as input for corrective action. Different kinds of failure modes have different underlying hazard func- tions (e.g., constant, increasing, or decreasing). When considering the ef- fect of RAM on system effectiveness, it is potentially misleading to report the reliability of a system or subsystem in terms of a MOP that is based on a particular but untested assumption. For example, reporting of only the "mean time to failure" is sufficiently informative only when the underlying failure time distribution has only a single unknown parameter, such as a constant hazard function (e.g., an exponential distribution). One alterna- tive is to report reliability MOPs separately for random types of failure

156 IMPROVED OPERATIONAL TESTING AND EVALUATION modes (constant hazard function), wearout failure modes (increasing haz- ard function), and defect-related failure modes (decreasing hazard func- tion). These MOPs can then be used to assess the critical reliability perfor- mance measure: the overall probability of vehicle failure during a particular r tuture mission. Wearout failures may well be underrepresented in the JOT, since most vehicles are relatively new. They also depend heavily on the age mix of the vehicles in the fleet. For that reason, and to correct for this underrepresen- tation, it is important to model wearout failures separately. Some measure of criticality (not just "critical" or "not critical") should be assigned to each failure mode so as to better assess the effect~s) of that mode. Further subdivision (e.g. GFE versus non-GFE) may also be warranted. Data on the arrival process of new failure modes should be carefully documented, so that they can be used in developing a model of when new failure modes occur as a function of fleet exposure time or miles. The pre- sumably widening intervals7 between the occurrence of new failure modes will enable an assessment of the chance of encountering any further and as yet unseen failure modes. The use of these data to make projections about the remaining number of unseen failure modes should be done with great care and appreciation of the underlying assumptions used in the projection methodology. Although the different Stryker vehicle variants will probably have dif- ferent failure modes, there is a reasonable possibility that information across these modes can be combined when assessing the reliability of the family of vehicles. In the current TEMP, failure modes from developmental test (DT) and IOT are to be assessed across the variants and configurations to deter- mine the impact that the operational mission summary/mission profile and unique vehicle characteristics have on reliability estimates. This assessment can be handled by treating vehicle variant as a covariate. Other uncontrol- lable covariates, such as weather conditions, could certainly have an im- pact, but it is not clear whether these effects can be sorted out cleanly. For 70f course, these widening intervals are not likely to be true in the immediate period of transferring from developmental test to operational test, given the distinct nature of these . . . test activities.

PHASE I REPORT: TESTMEASURES 157 example, one could record the degree of wetness of soil conditions on a daily basis. This might help in sorting out the potential confounding of weather conditions under which a given force (IBCT or baseline) is operat- ing. For example, if the IBCT were to run into foul weather halfway through its testing, and if certain failures appeared only at that time, one would be able to make a better case for ascribing the failures to weather rather than to the difference in force, especially if the baseline force does not run into foul weather. Infant Mortality Operational tests, to some extent, serve the purpose of helping to un- cover and identify unknown system design flaws and manufacturing prob- lems and defects. Such "infant mortality" problems are normally corrected by making design or manufacturing changes or through the use of suffi- cient burn-in so that the discovered infant mortality failure modes will no longer be present in the mature system. The SEP describes no specific MOPs for this type of reliability prob- lem. Indeed, the SEP RAM MOPs (e.g., estimates of exponential distribu- tion mean times) assume a steady-state operation. Separate measures of the effects of infant mortality failures and the ability to eliminate these failure modes would be useful for the evaluation of Stryker system effectiveness. Durability and Wearout The IOT currently has no durability requirement, but issues may come up in the evaluation. Vehicles used in the IOT will not have sufficient operating time to produce reliable RAM data in general and especially for durability. Although the SEP mentions an historical 20,000-mile durabil- ity requirement, the Stryker system itself does not have a specified durabil- ity requirement. ATEC technical testing will, however, look at durability of the high-cost components. In particular, in DT, the infantry carrier vehicle will be tested in duration tests to 20,000 miles. A~-On Armor Whether or not vehicles are outfitted with their add-on armor (AoA) can be expected to have an important impact on certain reliability metrics. The AoA package is expected to increase vehicle weight by 20 percent. The

158 IMPROVED OPERATIONAL TESTING AND EVALUATION added weight will put additional strain on many operating components, particularly the vehicle power train and related bearings and hydraulic sys- tems. The additional weight can be expected to increase the failure rate for all types of failure modes: infant mortality, random, and, especially, dura- bility/wear. Because product verification test (PVT) and DT will be done in understressed conditions (that is, without AoA), any long-term durabil- ity problems that do show up can be judged to be extremely serious, and other problems that may exist are unlikely to be detected in JOT. Although the IOT will proceed without AoA (because it will not be ready for the test), weight packs should be used even if there is currently imperfect knowl- edge about the final weight distribution of the AoA. Doing this with dif- ferent weight packs will go a long way to assess the impact of the weight on the reliability metrics. The details of the actual AoA weight distribution will presumably amount to only a small effect compared with the effect of the presence or absence of armor. There is a need to use PVT, DT, and IOT results to support an early fielding decision for Stryker. Because of the absence of valid long-term durability data under realistic operating conditions (i.e., with AoA in- stalled), the planned tests will not provide a reasonable degree of assurance that Stryker will have durability that is sufficient to demonstrate long- term system effectiveness, given the potential for in-service failure of criti- cal components. Some wearout failure modes (not necessarily weight-related) may show up during the JOT, but they are likely to be underrepresented compared with steady-state operation of the Stryker fleet, because the vehicles used in the IOT will be relatively new. For such failure modes it is important to capture the time to failure for each failed part/system and the time exposed without failure for each other equivalent part/system. This will enable cor- rection for the underreporting of such failure modes and could lead to design or maintenance changes. RandLom Failures, OFF, and Scoring Criteria Random failures are those failures that are not characterized as either infant mortality or durability/wearout failures. These should be tracked by vehicle type and failure mode. Random failures are generally caused by events external to the system itself (e.g., shocks or accidents). The excessive occurrence of random failures of a particular failure mode during IOT may indicate the need for system design changes to make one or more vehicle

PHASE I REPORT: TESTMEASURES 159 types more robust to such failure modes. Because of such potential experi- ences, it is important to track all of these random failure modes separately, even though it is tempting to lump them together to reduce paperwork requirements. The reliability of the GFE integration is of special concern. The blend- ing of GFE with the new physical platform may introduce new failure modes at the interface, or it may introduce new failure modes for the GFE itself due to the rougher handling and environment. R&M data will be analyzed to determine the impact of GFE reliability on the system and the GFE interfaces. Although GFE reliability is not an issue to be studied by itself in JOT, it may have an impact on force effectiveness, and for this reason R&M GFE data should be tracked and analyzed separately. Since the GFE on Stryker is a software-intensive system, software failure modes can be expected to occur. To the extent possible, MOPs that distinguish among software-induced failures in the GFE, other problems with the GFE, and failures outside the GFE need to be used. R&M test data (e.g., test incidents) will be evaluated and scored at an official R&M scoring conference in accordance with the Stryker failure definition/scoring criteria. R&M MOPs will be calculated from the result- ing scores. Determination of mission-critical failure modes should not, however, be a binary decision. Scoring should be on an interval scale be- tween 0 and 1 rather than being restricted to 0 (failure) or 1 (nonfailure). For example, reporting 10 scores of 0.6 and 10 scores of 0.4 sends a differ- ent message, and contains much more information, than reporting 10 scores of 1 and 10 scores of 0. We also suggest the use of standard language in recording events to make scoring the events easier and more consistent. The use of standard language also allows for combining textual information across events and analyzing the failure event database. Availability and Maintainability MOPs for availability/maintainability, described in the SEP, include mean time to repair; the chargeable maintenance ratio (the ratio of charge- able maintenance time to the total amount of operating time); and preven- tive maintenance, checks, and services time required. Although these MOPs will be evaluated primarily using data obtained during DT, IOT informa- tion should be collected and used to complement this information. Given that some reliability criteria are expressed as number of failures

160 IMPROVED OPERATIONAL TESTING AND EVALUATION per 1,000 miles, and since repair time is not measured in miles, an attempt should be made to correlate time (operating time, mission time) with miles so that a supportable comparison or translation can take place. Contractors do initial maintenance and repair and then train the sol- diers to handle these tasks. MOPs computed on the basis of DT-developed contract maintainers and repairmen may not accurately reflect maintain- ability and repair when soldiers carry out these duties. Therefore, contrac- tor and soldier maintenance and repair data should not be pooled until it has been established that repair time distributions are sufficiently close to one another. SUMMARY Reporting Values of Measures of Effectiveness 1. Different MOEs should not be rolled up into a single overall number that tries to capture effectiveness or suitability. 2. Although ratio MOE values may be useful in assessing degrees of mission success, both the numerator and the denominator should be re- ported. Subject-Matter Expert Measures 3. To help in the calibration of SME measures, each should be asked to review his or her own assessment of the Stryker IOT missions, for each scenario, immediately before he or she assesses the baseline missions (or vice versa). 4. ATEC should review the opportunities and possibilities for SMEs to contribute to the collection of objective data, such as times to complete certain subtasks, distances at critical times, etc. 5. The inter-SME rating variances from training data should be consid- ered to be the equivalent of instrument error when making statistical infer- ences using ratings obtained from JOT. 6. The correlation between SME results and objective measures should be reported for each mission. 7. ATEC should consider using two separate SME rating scales: one for Tic r · 1 '' 1 1 r Tic '' "allures and another tor successes. 8. As an alternative to the preceding recommendation, SMEs could as- sign ratings on a qualitative scale (for example, the five-point scale: "excel-

PHASE I REPORT: TESTMEASURES 161 1 '' cc '' cc r · '' cc '' cc · r ''\ A 1 lent, gooc i, tan, poor, anc ~ unsatlstactory ). Anysuosequentstatlstl- cal analysis, particularly involving comparisons, would then involve the use of techniques suitable for ordered categorical variables. 9. If resources are available, more than one SME shoulcl be assigned to each unit and trained to make independent evaluations of the same tasks and subtasks. Objective Casualty-Related Measures 10. FER (ancl the LER when appropriate), but not the RLR, shoulcl be used as the primary mission-level MOE for analyses of engagement results. 11. ATEC shoulcl use fratricide frequency and civilian casualty frequency (as defined in this chapter) to measure the amount of fratricide and collat- eral damage in a mission. Objective Scenario-Specific and System Degradation Measures 12. Only MOPs that are unambiguously measurable shoulcl be usecl. 13. Scenario-specific MOPs shoulcl be aclclecl for SOSE missions. 14. Situation awareness shoulcl be introduced as an explicit test cone ition. 15. If situation awareness cannot be aclclecl as an explicit test conclition, additional MOPs Discussed in this chapter) shoulcl be aclclecl as indirect ~ . . measures ot situation awareness. 16. ATEC shoulcl use the "instantaneous LER" measure to determine changes in traditional attacker/clefencler engagement dynamics clue to im- . . prover ~ situation awareness. Measures of Reliability and Maintainability 17. The IOT shoulcl be viewocl as a shakoclown process and an opportu- nity to learn as much as possible about the RAM of the Stryker. 18. RAM data collection shoulcl be an ongoing enterprise. Failure and maintenance information shoulcl be trackocl on a vehicle or part/system basis for the entire life of the vehicle or part/system. Appropriate databases shoulcl be set up. This was probably not clone with those Stryker vehicles already in existence but it could be implemented for future maintenance actions on all Stryker vehicles. 19. With respect to the difficulty of reaching a decision regarding reli-

162 IMPROVED OPERATIONAL TESTING AND EVALUATION ability, given limited miles and absence of add-on armor, weight packs should be used to provide information about the impact of additional weight on reliability. 20. Accelerated testing of specific system components prior to operational testing should be considered in future contracts to enable testing in shorter and more realistic time frames. 21. Failure modes should be considered separately rather than trying to develop failure rates for the entire vehicle using simple exponential models. The data reporting requirements vary depending on the failure ~ . rate function.

4 Statistical Design _ ' n this chapter we first discuss some broader perspectives and statistical I issues associated with the design of any large-scale industrial experi- ment. We discuss the designs and design processes that could be imple- mented if a number of constraints in the operational test designs were ei- ther relaxed or abandoned. Since the operational test design for the IBCT/ Stryker is now relatively fixed, the discussion is intended to demonstrate to ATEC the advantages of various alternative approaches to operational test design that could be adopted in the future, and therefore the need to recon- sider these constraints. This is followed by a brief description of the cur- rent design of the IBCT/Stryker initial operational test (IOT), accompa- nied by a review of the design conditioned on adherence to the above- . . mentioner ~ constraints. BROAD PERSPECTIVE ON EXPERIMENTAL DESIGN OF OPERATIONAL TESTS Constrained Designs of ATEC Operational Tests ATEC has designed the IOT to be consistent with the following constraints: 1. Aside from statistical power calculations, little information on the performance of IBCT/Stryker or the baseline Light Infantry Brigade 163

164 . IMPROVED OPERATIONAL TESTING AND EVALUATION (LIB) from modeling or simulation, developmental testing, or the per- formance of similar systems is used to impact the allocation of test samples in the test design. In particular, this precludes increasing the test sample size for environments for which the IBCT/Stryker or the baseline has proved to be problematic in previous tests. 2. The allocation of test samples to environments is constrained to reflect the allocation of use detailed in the operational mission summary/ mission profile (OMS/MP). 3. Operational tests are designed to test the system for typical stresses that will be encountered in the field. This precludes testing systems in more extreme environments to provide information on the limitations of system performance. 4. Possibly most important, operational tests are, very roughly speak- ing, single test events. It is currently not typical for an operational test either to be carried out in stages or to include use of smaller-scale tests with operationally relevant features focused on specific issues of interest. Reconsidering Operational Test Design: Initial Operational Testing Should Not Commence Until System Design Is Mature The above constraints do not need strict adherence, which will result in designs that have substantial disadvantages compared with current meth- ods used in industrial settings. The following discussion provides some characteristics of operational test designs that could be implemented if these constraints were relaxed or removed. There are two broad goals of any operational test: to learn about a ~ r 1 · r 1- · · · · r systems performance and its performance llmltatlons In a variety or set- tings and to confirm either that a system meets its requirements or that it outperforms a baseline system (when this is with respect to average perfor- mance over a variety of environments). A fundamental concern with the current approach adopted by ATEC is that both of these objectives are unlikely to be well addressed by the same test design and, as a result, ATEC has (understandably) focused on the confirmatory objective, with emphasis on designs that support significance testing. Given either a learning or a confirmatory objective, a requisite for op- erational testing is that it should not commence until the system design is mature. Developmental testing should be used to find major design flaws, including many of those that would typically arise only in operationally realistic conditions. Even fine-tuning the system to improve performance

PHASE I REPORT: STATISTICAL DESIGN 165 should be carried out during developmental testing. This is especially true for suitability measures. Operational testing performs a difficult and cru- cial role in that it is the only test of the system as a whole in realistic opera- tional conditions. Operational testing can be used to determine the limita- tions and value, relative to a baseline system, of a new system in realistic operational conditions in carrying out various types of missions. While operational testing can reveal problems that cannot be discovered, or dis- covered as easily, in other types of testing, the primary learning that should take place during operational test should be the development of a better understanding of system limitations, i.e., the circumstances under which the system performs less well and under which the system excels (relative to a baseline system). Discovering major design flaws during an operational test that could have been discovered earlier compromises the ability of the operational test to carry out these important functions. The benefits of waiting until a system design is mature before begin- ning operational testing does not argue against the use of spiral develop- ment. In that situation, for a given stage of acquisition, one should wait until that stage of development is mature before entering operational test. That does not then preclude the use of evolutionary acquisition for subse- quent stages of development. (This issue is touched on in Chapter 6.) Multiple Objectives of Operational Testing and Operational Test Design: Moving Beyond Statistical Significance as a Goal Operational test designs need to satisfy a number of objectives. Major defense systems are enormously complicated, with performances that can change in important ways as a result of changes in many factors of interest. Furthermore, there are typically dozens of measures for which information on performance is needed. These measures usually come in two major types those used to compare a new system with a baseline systems and those used to compare a new system with its requirements, as provided in the Operational Requirements Document (ORD). In nearly all cases, it is impossible to identify a single operational test design that is simultaneously best for identifying how various factors affect system performance for doz- ~Though not generally feasible, the use of multiple baselines should sometimes be con- sidered, since for some environments some baselines would provide little information as comparison systems.

166 IMPROVED OPERATIONAL TESTING AND EVALUATION ens of measures of interest. Test designs that would be optimal for the task of comparing a system with requirements would not generally be as effec- tive for comparing a system with a baseline, and test designs that would be optimal for measures of suitability would not generally be excellent for measures of effectiveness. In practice, one commonly selects a primary measure (one that is of greatest interest), and the design is selected to per- form well for that measure. The hope is that the other measures of interest will be related in some fashion to the primary measure, and therefore the test design to evaluate the primary measure will be reasonably effective in evaluating most of the remaining measures of interest. (If there are two measures of greatest interest, a design can be found that strikes a balance between the performance for the two measures.) In addition, operational tests can have a number of broader goals: 1. to understand not only how much the various measures differ for the two systems but also why the measures differ; 2. to identify additional unknown factors that affect system perfor- mance or that affect the difference between the operation of the system being tested and the baseline system; 3. to acquire a better strategic understanding of the system, for ex- ample, to develop a greater understanding of the value of information, mobility, and lethality for performance; 4. to understand the demands on training and the need for system expertise in operating the system in the field; and 5. to collect sufficient information to support models and simulations on system performance. The test and evaluation master plan (TEMP) states that the Stryker: has utility in all operational environments against all projected future threats; however, it is designed and optimized for contingency employment in urban or complex terrain while confronting low- and mid-range threats that may display both conventional and asymmetric warfare capabilities. Clearly, the operational test for Stryker will be relied on for a number of widely varied purposes. As stated above, ATEC's general approach to this very challenging problem focuses on the objective of confirming performance and uses the statistical concept of significance testing: comparing the performance of IBCT/Stryker against the baseline (LIB) to establish that the former is pre- ferred to the latter. In addition, there is some testing against specific re- quirements (e.g., Stryker has a requirement for 1,000 mean miles traveled

PHASE I REPORT: STATISTICAL DESIGN 167 between failures). This approach, which results in the balanced design described in Chapter 2 (for a selected number of test design factors), does not provide as much information as other approaches could in assessing the performance of the system over a wide variety of settings. To indicate what might be done differently in the IBCT/Stryker IOT (and for other systems in the future), we discuss here modifications to the sample size, test design, and test factor levels. Sample Size Given that operational tests have multiple goals (i.e., learning and con- firming for multiple measures of interest), arguments for appropriate sample sizes for operational tests are complicated. Certainly, sample sizes that sup- port minimal power at reasonable significance levels for testing primary measures of importance provide a starting point for sample size discussions. However, for complicated, expensive systems, given the dynamic nature of system performance as a function of a number of different factors of impor- tance (e.g., environments, mission types), it is rare that one will have suffi- cient sample size to be able to achieve adequate power. (Some benefits in decreasing test sample size for confirmatory purposes can be achieved through use of sequential testing, when feasible.) Therefore, budgetary limitations will generally drive sample size calculations for operational tests. However, when that is not the case, the objectives of learning about system performance, in addition to that of confirming improvement over a baseline, argue for additional sample size so that these additional objectives can be addressed. Therefore, rather than base sample size arguments solely on power calculations, the Army needs to allocate as much funding as vari- ous external constraints permit to support operational test design. Testing in Scenarios in Which Performance Differences Are Anticipatedt As mentioned above, ATEC believes that it is constrained to allocate test samples to mission types and environments to reflect expected field use, as provided in the OMS/MP. This constraint is unnecessary, and it works against the more important goal of understanding the differences between the IBCT/Stryker and the baseline and the causes of these differ- ences. If a specific average (one that reflects the OMS/MP) of performance across mission type is desired as part of the test evaluation, a reweighting of the estimated performance measures within scenario can provide the de-

168 IMPROVED OPERATIONAL TESTING AND EVALUATION . , . . sired summary measures a posteriors. Therefore, the issue of providing specific averages in the evaluation needs to be separated from allocation of test samples to scenarios. As indicated above, test designs go hand-in-hand with test goals. If the primary goal for ATEC in carrying out an operational test is to confirm that, for a specific average over scenarios that conforms to the OMS/MP missions and environments, the new system significantly outperforms the baseline, then allocations that mimic the OMS/MP may be effective. How- ever, if the goal is one of learning about system performance for each sce- nario, then, assuming equal variances of the performance measure across scenarios, the allocation of test samples equally to test scenarios would be preferable to allocations that mimic the OMS/MP. More broadly, general objectives for operational test design could in- clude: (1) testing the average performance across scenarios (reflecting the OMS/MP) of a new system against its requirements, (2) testing the average performance of a new system against the baseline, (3) testing performance of a new system against requirements or against a baseline for individual scenarios, or (4) understanding the types of scenarios in which the new system will outperform the baseline system, and by how much. Each of these goals would generally produce a different optimal test design. In addition to test objectives, test designs are optimized using previous information on system performance, which are typically means and vari- ances of performance measures for the system under test and for the baseline system. This is a catch-22 in that the better one is able to target the design based on estimates of these quantities, the less one would clearly need to test, because the results would be known. Nevertheless, previous informa- tion can be extremely helpful in designing an operational test to allocate test samples to scenarios to address test objectives. Specifically, if the goal is to obtain high power, within each scenario, for comparing the new system with the baseline system on an important measure, then a scenario in which the previous knowledge was that the mean performance of the new system was close to that for the baseline would result in a large sample allocation to that scenario to identify which system is, in fact, superior. But if the goal is to better understand system performance within the scenarios for which the new system outperforms the baseline system, previous knowledge that the mean performances were close would argue that test samples should be allocated to other test sce- narios in which the new system might have a clear advantage. Information from developmental tests, modeling and simulation, and

PHASE I REPORT: STATISTICAL DESIGN 169 the performance of similar systems with similar components should be used to target the test design to help it meet its objectives. For IBCT/Stryker, background documents have indicated that the Army expects that differ- ences at low combat intensity may not be practically important but that IBCT/Stryker will be clearly better than the baseline for urban and high- intensity missions. If the goal is to best understand the performance of IBCT/Stryker in scenarios in which it is expected to perform well, it would be sensible to test very little in low-intensity scenarios, since there are un- likely to be any practical and statistically detectable differences in the per- formance between IBCT/Stryker and the baseline. Understanding the ad- vantages of IBCT/Stryker is a key part of the decision whether to proceed to full-rate procurement; therefore, understanding the degree to which Stryker is better in urban, high-intensity environments is important, and so relatively more samples should be allocated to those situations. There may be other expectations concerning IBCT/Stryker that ATEC could comfort- ably rely on to adapt the design to achieve various goals. Furthermore, because the baseline has been used for a considerable length of time, its performance characteristics are better understood than those of IBCT/Stryker. While this may be less clear for the specific sce- narios under which IBCT/Stryker is being tested, allocating 42 scenarios to the baseline system may be inefficient compared with the allocation of greater test samples to IBCT/Stryker scenarios. Testing with Factors at High Stress Levels A general rule of thumb in test design is that testing at extremes is often more informative than testing at intermediate levels, because infor- mation from the extremes can often be used to estimate what would have happened at intermediate levels. In light of this, it is unclear how extreme the high-intensity conflict is, as currently scripted. For example, would the use of 300 OPFOR players be more informative than current levels? Our impression is that, in general, operational testing tests systems at typical stress levels. If testing were carried out in somewhat more stressful situa- tions than are likely to occur, information is obtained about when a system is likely to start breaking down, as well as on system performance for typi- cal levels of stress (although interpolation from the stressful conditions back to typical conditions may be problematic). Such a trade-off should be considered in the operational test design for IBCT/Stryker. In the follow- ing section, a framework is suggested in which the operational test is sepa-

170 IMPROVED OPERATIONAL TESTING AND EVALUATION rated into a learning component and a confirming component. Clearly, testing with factors at high stress levels naturally fits into the learning com- ponent of that framework, since it is an important element in developing a complete understanding of the system's capabilities and limitations. Alternatives to One Large Operational Test In the National Research Council's 1998 report Statistics, Testing, and Defense Acquisition, two possibilities were suggested as alternatives to large operational tests: operational testing carried out in stages and small-scale pilot tests. In this section, we discuss how these ideas might be imple- mented by ATEC. We have classified the two basic objectives of operational testing as learning what a system is (and is not) capable of doing in a realistic opera- tional setting, and confirming that a new system's performance is at a cer- tain level or outperforms a baseline system. Addressing these two types of objectives in stages seems natural, with the objective at the first stage being to learn about system performance and the objective at the second stage to confirm a level of system performance. An operational test could be phased to take advantage ofthis approach: the first phase might be to examine IBCT/Stryker under different condi- tions, to assess when this system works best and why. The second phase would be used to compare IBCT/Stryker with a baseline; it would serve as the confirmation experiment used to support the decision to proceed to full-rate production. In the second phase, IBCT/Stryker would be com- pared with the baseline only in the best and worst scenarios. This broad testing strategy is used by many companies in the pharmaceutical industry and is more fully described in Box, Hunter, and Hunter (19781. Some of the challenges now faced by ATEC result from an attempt to simultaneously address the two objectives of learning and confirming. Clearly, they will often require very different designs. Although there are pragmatic reasons why a multistage test may not be feasible (e.g., difficulty reserving test facilities and scheduling soldiers to carry out the test mis- sions), if these reasons can be addressed the multistage approach has sub- stantial advantages. For example, since TRADOC already conducts some of the learning phase, their efforts could be better integrated with those of ATEC. Also, a multistage process would have implications for how devel- opment testing is carried out, especially with respect to the need to have developmental testing make use of as much operational realism as possible,

PHASE I REPORT: STATISTICAL DESIGN 171 and to have the specific circumstances of developmental test events docu- mented and archived for use by ATEC. An important advantage of this overall approach is that the final operational test may turn out to be smaller than is currently the case. When specific performance or capability questions come up in the early part of operational testing, small-scale pilot tests, focused on the analy- sis of these questions, should be seriously considered. For example, the value of situation awareness is not directly addressed by the current opera- tional test for IBCT/Stryker (unless the six additional missions identified in the test plan are used for this purpose). It would be very informative to use Stryker with situation awareness degraded or "turned off" to determine the value that it provides in particular missions (see Chapter 31. COMMENTS ON THE CURRENT DESIGN IN THE CONTEXT OF CURRENT ATEC CONSTRAINTS Using the arguments developed above and referring to the current de- sign of the operational test as described in Chapter 2 (and illustrated in Table 2-1), the discussion that follows takes into account the following constraints of the current test design: Essentially no information about the performance of IBCT/Stryker or the baseline has been used to impact the allocation of test samples in the test design. 2. The allocation of test samples to scenarios is constrained to reflect the allocation of use detailed in the OMS/MP. 3. Operational tests are designed to test the system for typical stresses that will be encountered in the field. 4. Operational tests are single test events. Balanced Design The primary advantage of the current operational test design is that it is balanced. This means that the test design covers the design space in a systematic and relatively uniform manner (specifically, three security opera- tions in a stable environment, SOSE, appear for every two perimeter de- fense missions). It is a robust design, in that the test will provide direct, substantial information from all parts of the design space, reducing the need to extrapolate. Even with moderate amounts of missing data, result-

172 IMPROVED OPERATIONAL TESTING AND EVALUATION ing from an inability to carry out a few missions, some information will still be available from all design regions. Furthermore, if there are no missing data, the balance will permit straightforward analysis and presentation of the results. More specifically, estimation of the effect of any individual factor can be accomplished by collapsing the test results over the remaining factors. And, since estimates of the design effects are uncorrelated in this situation, inference for one effect does not depend on others. However, many of these potential advan- tages of balance can be lost if there are missing data. If error variances turn out to be heterogeneous, the balanced design will be inefficient compared with a design that would have a priori accommodated the heterogeneity. The primary disadvantage of the current design is that there is a very strong chance that observed differences will be confounded by important sources of uncontrolled variation. The panel discussed one potential source of confounding in its October 2002 letter report (see Appendix A), which recommends that the difference in starting time between the IBCT/Stryker test missions and the baseline test missions be sufficiently shortened to reduce any differences that seasonal changes (e.g., in foliage and tempera- ture) might cause. Other potential sources of confounding include: (1) player differences due to learning, fatigue, training, and overall compe- tence; (2) weather differences (e.g., amount of precipitation); and (3) dif- ferences between IBCT/Stryker and the baseline with respect to the num- ber of daylight and nighttime missions. In addition, the current design is not filly orthogonal (or balanced), which is evident when the current design is collapsed over scenarios. For example, for company B in the SOSE mission type, the urban missions have higher intensity than the rural missions. (After this was brought to the attention of ATEC they were able to develop a fully balanced design, but they were too far advanced in the design phase to implement this change). While the difference between the two designs appears to be small in this particular case, we are nevertheless disappointed that the best pos- sible techniques are not being used in such an important program. This is an indication of the need for access (in this case earlier access) to better statistical expertise in the Army test community, discussed in Chapter 6 (as well as in National Research Council, 1998a). During the operational test, the time of day at which each mission begins is recorded, providing some possibility of checking for player learn- ing and player fatigue. One alternative to address confounding due to player learning is to use four separate groups of players, one for each of the two

PHASE I REPORT: STATISTICAL DESIGN 173 OPFORs, one for the IBCT/Stryker, and one for baseline system. Inter- group variability appears likely to be a lesser problem than player learning. Alternating teams from test replication to test replication between the two systems under test would be a reasonable way to address differences in learning, training, fatigue, and competence. However, we understand that either idea might be very difficult to implement at this dated The confounding factor of extreme weather differences between Stryker and the baseline system can be partially addressed by postponing missions during heavy weather (although this would prevent gaining an understanding of how the system operates in those circumstances). Finally, the lack of control for daylight and nighttime missions remains a concern. It is not clear why this variable could not have been included as a design variable. Aside from the troublesome confounding issue (and the power calcula- tions commented on below), the current design is competent from a statis- tical perspective. However, measures to address the various sources of con- founding need to be seriously considered before proceeding. Comments on the Power Calculations3 ATEC designed the IBCT/Stryker IOT to support comparisons of the subject-matter expert (SME) ratings between IBCT/Stryker and the baseline for particular types of missions for example, high-intensity ur- ban missions and medium-intensity rural SOSE missions. In addition, ATEC designed the operational test for IBCT/Stryker to support compari- sons relative to attrition at the company level. ATEC provided analyses to justify the assertion that the current test design has sufficient power to support some of these comparisons. We describe these analyses and pro- vide brief comments below. SME ratings are reported on a subjective scale that ranges from 1 to 8. SMEs will be assigned randomly, with one SME assigned per company, and two SMEs assigned to each platoon mission. SMEs will be used to evaluate mission completion, protection of the force, and avoidance of col- 2It is even difficult to specify exactly what one would mean by "equal training," since the amount of training needed for the IBCT to operate Stryker is different from that for a Light Infantry Brigade. 3The source for this discussion is U.S. Department of Defense (2002b).

174 IMPROVED OPERATIONAL TESTING AND EVALUATION lateral damage, which results in 10 to 16 comparisons per test. Assuming that the size of an individual significance test was set equal to 0.01, and that there are 10 different comparisons that are likely to be made, from a Bonferroni-type argument the overall size of the significance tests would be at most 0.1. In our view this control of individual errors is not crucial, and ATEC should instead examine two or three key measures and carry out the relevant comparisons with the knowledge that the overall type I error may be somewhat higher than the stated significance level. Using previous experience, ATEC determined that it was important to have sufficient power to detect an average SME rating difference of 1.0 for high-intensity missions, 0.75 for medium-intensity missions, 0.50 for low- intensity missions, and 0.75 difference overall. (We have taken these criti- cal rating differences as given, because we do not know how these values were justified; we have questioned above the allocation of test resources to low-intensity missions.) ATEC carried out simulations of SME differences to assess the power of the current operational test design for IBCT/Stryker. While this is an excellent idea in general, we have some concerns as to how these particular simulations were carried out. First, due to the finite range of the ratings difference distribution, ATEC expressed concern that the nonnormality of average SME ratings differences (in particular, the short tail of its distribution) may affect the coverage properties of any confidence intervals that were produced in the subsequent analysis. We are convinced that even with relatively small sample sizes, the means of SME rating differences will be well represented by a normal distribution as a result of the structure of the original distribu- tion and the central limit theorem, and that taking the differences counters skewness effects. Therefore the nonnormality of SME ratings differences should not be a major concern. ATEC reports that they collected "historical task-rating differences and determined that the standard deviation of these differences was 1.98, which includes contributions from both random variation and variation in performance between systems. Then ATEC modeled SME ratings scores for both IBCT/Stryker and the baseline using linear functions of the con- trolled variables from the test design. These linear functions were chosen to produce SME scores in the range between 1 and 8. ATEC then added to these linear functions a simulated random error variation of + 1, O. and -1, 33 each with probability 1/3. The resulting SME scores were then truncated to make them integral (and to lie between 1 and 81. The residual standard error of dLiJ~erences of these scores was then estimated, using simulation, to

PHASE I REPORT: STATISTICAL DESIGN 175 be 1.2.4 In addition, SME ratings differences (which include random variation as well as modeled performance differences) were simulated, with a resulting observed standard deviation of 2.04. Since this value was close enough to the historical value of 1.98, it supported their view that the amount of random variation added was similar to what would be observed for SMEs in the field. The residual standard error of the mean is defined to be the residual standard error divided by the square root of the sample size. So, when the test sample size that can be used for a comparison is 36 (essentially the entire operational test minus the 6 additional missions), the residual stan- dard error of the mean will be 0.20; twice that is 0.40. ATEC's analysis argues that since 0.75 is larger than 0.40, the operational test will have sufficient statistical power to find a difference of 0.75 in SME ratings. The same argument was used to show that interaction effects that are estimated using test sample sizes of 18 or 12 would also have sufficient statistical power, but interaction effects that were estimated using test sample sizes of 6 or 4 would not have sufficient statistical power to identify SME ratings differences of 0.75. Furthermore, if the residual standard error of ratings differences were as high as 1.4, a sample size of 12 would no longer provide sufficient power to identify a ratings difference of 0.75. Our primary concern with this analysis is that the random variation of SME scores has not been estimated directly. It is not clear why SME rating differences would behave similarly to the various historic measures (see Chapter 31. It would have been preferable to run a small pilot study to provide preliminary estimates of these measures and their variance. If that is too expensive, ATEC should identify those values for which residual standard errors provide sufficient power at a number of test sample sizes, as a means of assessing the sensitivity of their analysis to the estimation of these standard errors. (ATEC's point about the result when the residual standard deviation is raised to 1.4 is a good start to this analysis.) ATEC has suggested increasing statistical power by combining the rat- ings for a company mission, or by combining ratings for company and platoon missions. We are generally opposed to this idea if it implies that the uncom~ine~ratings will not also be reported. 4For this easy example, simulation was not needed, but simulation might be required in more complicated situations.

176 IMPROVED OPERATIONAL TESTING AND EVALUATION During various missions in the IBCT/Stryker operational test, the number of OPFOR players ranges from 90 to 220, and the number of noncombatant or blue forces is constant at 120. Across 36 missions, there are 10,140 potential casualties. For a subset of these (e.g., blue force play- ers), the potential casualties range from 500 to 4,320. ATEC offered analy- sis asserting that with casualty rates of 13 percent for the baseline and 10 percent for IBCT/Stryker, it will be possible to reject the null hypothesis of equal casualty rates for the two systems under test with statistical power greater than 75 percent. It is not clear what distribution ATEC has as- sumed for casualty counts, but likely candidates are binomial and Poisson models. That analysis may be flawed in that it makes use of an assumption that is unlikely to hold: that individual casualties are independent of one an- other. Clearly, battles that go poorly initially are likely to result in more casualties, due to a common conditioning event that makes individual ca- sualty events dependent. As a result, these statistical power calculations are unlikely to be reliable. Furthermore, not only are casualties not indepen- dent, but even if they were, they should not be rolled up across mission types. For example, with respect to assessment of the value of IBCT/Stryker, one casualty in a perimeter defense mission does not equal one casualty in a raid. The unit of analysis appears to be a complicated issue in this test. For example, the unit of analysis is assumed by ATEC to be a mission or a task for SMEs, but it is assumed to be an individual casualty for the casualty rate measures. Both positions are somewhat extreme. The mission may in some cases be too large to use as the unit of analysis. Individual skirmishes and other events occurring within a mission could be assumed to be rela- tively independent and objectively assessed or measured, either by SMEs or by instrumentation. In taking this intermediate approach, the operational test could be shown to have much greater power to identify various differ- ences than the SME analysis discussed above indicates. Finally, we note that although the current operational test tests only company-level operations, brigade-level testing could be accomplished by using one real brigade-level commander supported by (a) two real battalion commanders, each supported by one real company and two simulated com- panies and (b) one simulated battalion commander supported by three simulated companies.

PHASE I REPORT: STATISTICAL DESIGN 177 SUMMARY It is inefficient to discover major design flaws during an operational test that could have been discovered earlier in developmental test. Opera- tional test should instead focus its limited sample size on providing opera- tionally relevant information sufficient to support the decision of whether to proceed to full-rate production, and sufficient to refine the system de- sign to address operationally relevant deficiencies. The current design for the IBCT/Stryker operational test is driven by the overall goal of testing the average difference, but it is not as effective at providing information for different scenarios of interest. The primary disadvantage of the current design, in the context of current ATEC constraints, is that there is a distinct possibility that observed differences will be confounded by important sources of uncontrolled variation (e.g., factors associated with seasonal dif- ferences). In the panel's view, it would be worthwhile for ATEC to consider a number of changes in the IBCT/Stryker test design: 1. ATEC should consider, for future test designs, relaxing various rules of test design that it adheres to, by (a) not allocating sample size to sce- narios to reflect the OMS/MP, but instead using principles from optimal experimental design theory to allocate sample size to scenarios, (b) testing under somewhat more extreme conditions than typically will be faced in the field, (c) using information from developmental testing to improve test design, and (~) separating the operational test into at least two stages, learn- ing and confirmatory. 2. ATEC should consider applying to future operational testing in gen- eral a two-phase test design that involves, first, learning phase studies that examine the test object under different conditions, thereby helping testers design further tests to elucidate areas of greatest uncertainty and impor- tance, and, second, a phase involving confirmatory tests to address hypoth- eses concerning performance vis-a-vis a baseline system or in comparison with requirements. ATEC should consider taking advantage of this ap- proach for the IBCT/Stryker JOT. That is, examining in the first phase IBCT/Stryker under different conditions, to assess when this system works best, and why, and conducting a second phase to compare IBCT/Stryker to a baseline, using this confirmation experiment to support the decision to proceed to full-rate production. An important feature of the learning phase

178 IMPROVED OPERATIONAL TESTING AND EVALUATION is to test with factors at high stress levels in order to develop a complete understanding of the system's capabilities and limitations. 3. When specific performance or capability problems come up in the early part of operational testing, small-scale pilot tests, focused on the analy- sis of these problems, shoulcl be seriously consiclerecl. For example, ATEC shoulcl consider test conditions that involve using Stryker with situation awareness clegraclecl or turned off to determine the value that it provides in . . . partlcu. tar missions. 4. ATEC shoulcl eliminate from the IBCT/Stryker IOT one signifi- cant potential source of confouncling, seasonal variation, in accordance with the recommendation provided earlier in the October 2002 letter report from the panel to ATEC (see Appenclix A). In aclclition, ATEC shoulcl also seriously consider ways to reduce or eliminate possible confounding from player learning, and clay/night imbalance. One possible way of addressing the concern about player learning is to use four separate groups of players for the two OPFORs, the IBCT/Stryker, and the baseline system. Also, alternating teams from test replication to test replication between the two systems under test would be a reasonable way to address differences in . . . ~ . . .earnlng, training, tatlgue, anc ~ competence. 5. ATEC shoulcl reconsider for the IBCT/Stryker their assumption concerning the distribution of SME scores and shoulcl estimate the residual standard errors clirectly, for example, by running a small pilot study to provide preliminary estimates; or, if that is too expensive, by identifying those SME score differences for which residual standard errors provide suf- ficient power at a number of test sample sizes, as a means of assessing the sensitivity of their analysis to the estimation of these standard errors. 6. ATEC shoulcl reexamine their statistical power calculations for the IBCT/Stryker JOT, taking into account the fact that individual casualties may not be independent of one another. 7. ATEC shoulcl reconsider the current units of analysis for the IBCT/ Stryker testing a mission or a task for SME ratings, but an individual casualty for the casualty rate measures. For example, individual skirmishes and other events that occur within a mission shoulcl be objectively assessed or measurecl, either by SMEs or by instrumentation. 8. Given either a learning or a confirmatory objective, ignoring various tactical considerations, a requisite for operational testing is that it shoulcl not commence until the system design is mature.

PHASE I REPORT: STATISTICAL DESIGN 179 Finally, to address the limitation that the current IBCT/Stryker IOT tests only company-level operations, ATEC might consider brigade-level testing, for example, by using one real brigade-level commander supported by (a) two real battalion commanders, each supported by one real company and two simulated companies, and (b) one simulated battalion commander supported by three simulated companies.

Data Analysis The panel has noted (see the October 2002 letter report in Appen- dix A) the importance of determining, prior to the collection of data, the types of results expected and the data analyses that will be carried out. This is necessary to ensure that the designed data collection effort will provide enough information of the right types to allow for a fruitful evaluation. Failure to think about the data analysis prior to data collection may result in omitted explanatory or response variables or inad- equate sample size to provide statistical support for important decisions. Also, if the questions of interest are not identified in advance but in- stead are determined by looking at the data, then it is not possible to for- mally address the questions, using statistical arguments, until an indepen- dent confirmatory study is carried out. An important characteristic of the IBCT/Stryker JOT, probably in common with other defense system evaluations, is that there are a large number of measures collected during the evaluation. This includes mea- sures of a variety of types (e.g., counts of events, proportions, binary out- comes) related to a variety of subjects (e.g., mission performance, casual- ties, reliability). In addition, there are a large number of questions of interest. For the IBCT/Stryker JOT, these include: Does the Stryker- equipped force outperform a baseline force? In which situations does the Stryker-equipped force have the greatest advantage? Why does the Stryker- equipped force outperform the baseline force? It is important to avoid "rolling up" the many measures into a small number of summary measures 180

PHASE I REPORT: DATA ANALYSIS 181 focused only on certain preidentified critical issues. Instead, appropriate measures should be used to address each of the many possible questions. It will sometimes, but certainly not always, be useful to combine measures into an overall summary measure. The design discussion in Chapter 4 introduced the important distinc- tion between the learning phase of a study and the confirmatory phase of a study. There we recommend that the study proceed in steps or stages rather than as a single large evaluation. This section focuses on the analysis of the data collected. The comments here are relevant whether a single evaluation test is done (as proposed by ATEC) or a series of studies are carried out (as proposed by the panel). Another dichotomy that is relevant when analyzing data is that be- tween the use of formal statistical methods (like significance tests) and the use of exploratory methods (often graphical). Formal statistical tests and procedures often play a large role in confirmatory studies (or in the confir- matory phase described in Chapter 41. Less formal methods, known as exploratory analysis, are useful for probing the data to detect interesting or unanticipated data values or patterns. Exploratory analysis is used here in the broad sense, to include but not to be limited by the methods described in Tukey (19771. Exploratory methods often make extensive use of graphs to search for patterns in the data. Exploratory analysis of data is always a good thing, whether the data are collected as part of a confirmatory study to compare two forces or as part of a learning phase study to ascertain the limits of performance for a system. The remainder of this chapter reviews the general principles behind the formal statistical procedures used in confirmatory studies and those methods used in exploratory statistical analyses and then presents some specific recommendations for data analysis for the IBCT/Stryker JOT. PRINCIPLES OF DATA ANALYSIS Formal Statistical Methods in Confirmatory Analyses A key component of any defense system evaluation is the formal com- parison of the new system with an appropriately chosen baseline. It is usu- ally assumed that the new system will outperform the baseline; hence this portion of the analysis can be thought of as confirmatory. Depending on the number of factors incorporated in the design, the statistical assessment could be a two-sample comparison (if there are no other controlled experi-

182 IMPROVED OPERATIONAL TESTING AND EVALUATION mental or measured covariate factors) or a regression analysis (if there are other factors). In either case, statistical significance tests or confidence in- tervals are often used to determine if the observed improvement provided by the new system is too large to have occurred by chance. Statistical significance tests are commonly used in most scientific fields as an objective method for assessing the evidence provided by a study. The National Research Council (NRC) report Statistics, Testing, and(DefenseAc- quisition reviews the role and limitations of significance testing in defense testing (National Research Council, 1998a). It is worthwhile to review some of the issues raised in that report. One of the limitations of signifi- cance testing is that it is focused on binary decisions: the null hypothesis (which usually states that there is no difference between the experimental and baseline systems) is rejected or not. If it is rejected, then the main goal of the evaluation is achieved, and the data analysis may move to an explor- atory phase to better understand when and why the new system is better. A difficulty with the binary decision is that it obscures information about the size of the improvement afforded by the new system, and it does not recog- nize the difference between statistical significance and practical significance. The outcome of a significance test is determined both by the amount of improvement observed and by the sample size. Failure to find a statistically significant difference may be because the observed improvement is less than anticipated or because the sample size was not sufficient. Confidence inter- vals that combine an estimate of the improvement provided by the new system with an estimate of the uncertainty or variability associated with the estimate generally provide more information. Confidence intervals pro- vide information about whether the hypothesis of"no difference" is plau- sible given the data (as do significance tests) but also inform about the likely size of the improvement provided by the system and its practical significance. Thus confidence intervals should be used with or in place of . . ~ slgnlilcance tests. Other difficulties in using and interpreting the results of significance tests are related to the fact that the two hypotheses are not treated equally. Most significance test calculations are computed under the assumption that the null hypothesis is correct. Tests are typically constructed so that a rejec- tion ofthe null hypothesis confirms the alternative that we believe (or hope) to be true. The alternative hypothesis is used to suggest the nature of the test and to define the region of values for which the null hypothesis is rejected. Occasionally the alternative hypothesis also figures in statistical

PHASE I REPORT: DATA ANALYSIS 183 power calculations to determine the minimum sample size required in or- der to be able to detect differences of practical significance. Carrying out tests in this way requires trading off the chances of making two possible errors: rejecting the null hypothesis when it is true and failing to reject the null hypothesis when it is false. Often in practice, little time is spent deter- mining the relative cost of these two types of errors, and as a consequence only the first is taken into account and reported. The large number of outcomes being assessed can further complicate carrying out significance tests. Traditional significance tests often are de- signed with a 5 or 10 percent error rate, so that significant differences are declared to be in error only infrequently. However, this also means that if formal comparisons are made for each of 20 or more outcome measures, then the probability of an error in one or more of the decisions can become quite high. Multiple comparison procedures allow for control ofthe experi- ment-wide error rate by reducing the acceptable error rate for each indi- vidual comparison. Because this makes the individual tests more conserva- tive, it is important to determine whether formal significance tests are required for the many outcome measures. If we think of the analysis as comprising a confirmatory and exploratory phase, then it should be pos- sible to restrict significance testing to a small number of outcomes in the confirmatory phase. The exploratory phase can focus on investigating the scenarios for which improvement seems greatest using confidence intervals and graphical techniques. In fact, we may know in advance that there are some scenarios for which the IBCT/Stryker and baseline performance will not differ, for example, in low-intensity military operations; it does not make sense to carry out significance tests when we expect that the null hypothesis is true or nearly true. It is also clearly important to identify the proper unit of analysis in carrying out statistical analyses. Often data are collected at several different levels in a study. For example, one might collect data about individual soldiers (especially casualty status), platoons, companies, etc. For many outcome measures, the data about individual soldiers will not be indepen- dent, because they share the same assignment. This has important implica- tions for data analysis in that most statistical methods require independent observations. This point is discussed in Chapter 4 in the context of study design and is revisited below in discussing data analysis specifics for the IBCT/Stryker JOT.

184 IMPROVED OPERATIONAL TESTING AND EVALUATION Exploratory Analyses Conclusions obtained from the IOT should not stop with the confir- mation that the new system performs better than the baseline. Operational tests also provide an opportunity to learn about the operating characteris- tics of new systems/forces. Exploratory analyses facilitate learning by mak- ing use of graphical techniques to examine the large number of variables and scenarios. For the IBCT/Stryker JOT, it is of interest to determine the factors (mission intensity, environment, mission type, and force) that im- pact IBCT/Stryker and the effects of these factors. Given the large number of factors and the many outcome measures, the importance of the explor- atory phase of the data analysis should not be underestimated. In fact, it is not even correct to assume (as has been done in this chap- ter) that formal confirmatory tests will be done prior to exploratory data analysis. Examination of data, especially using graphs, can allow investiga- tors to determine whether the assumptions required for formal statistical procedures are satisfied and identify incorrect or suspect observations. This ensures that appropriate methodology is used in the important confirma- tory analyses. The remainder of this section assumes that this important part of exploratory analysis has been carried out prior to the use of formal statistical tests and procedures. The focus here is on another crucial use of exploratory methods, namely, to identify data patterns that may suggest previously unseen advantages or disadvantages for one force or the other. Tukey (1977) and Chambers et al. (1983) describe an extensive collec- tion of tools and examples for using graphical methods in exploratory data analysis. These methods provide a mechanism for looking at the data to identify interesting results and patterns that provide insight about the sys- tem under study. Graphs displaying a single outcome measure against a variety of factors can identify subsets of the design space (i.e., combinations of factors) for which the improvement provided by a new system is notice- ably high or low. Such graphs can also identify data collection or recording errors and unexpected aspects of system performance. Another type of graphical display presents several measures in a single graph (for example, parallel box plots for the different measures or the same measures for different groups). Such graphs can identify sets of outcome measures that show the same pattern of responses to the factors, and so can help confirm either that these measures are all correlated with mission suc- cess as expected, or may identify new combinations of measures worthy of consideration. When an exploratory analysis of many independent mea-

PHASE I REPORT: DATA ANALYSIS 185 sures shows results consistent with a priori expectations but not statistically significant, these results might in combination reinforce one another if they could all be attributed to the same underlying cause. It should be pointed out that exploratory analysis can include formal multivariate statistical methods, such as principal components analysis, to determine which measures appear to correlate highly across mission sce- narios (see, for example, Johnson and Wichern, 19921. One might iden- tify combinations of measures that appear to correlate well with the ratings of SMEs, in this way providing a form of objective confirmation of the implicit combination of information done by the experts. Reliability and Maintainability These general comments above regarding confirmatory and explor- atory analysis apply to all types of outcome measures, including those asso- ciated with reliability and maintainability, although the actual statistical techniques used may vary. For example, the use of exponential or Weibull data models is common in reliability work, while normal data models are often dominant in other fields. Meeker and Escobar (1998) provide an excellent discussion of statistical methods for reliability. A key aspect of complex systems like Stryker that impacts reliability, availability, and maintainability data analysis is the large number of failure modes that affect reliability and availability (discussed also in Chapter 31. These failure modes can be expected to have different behavior. Failure modes due to wear would have increasing hazard over time, whereas other modes would have decreasing hazard over time (as defects are fixed). Rather than using statistical models to directly model system-wide failures, each of the major failure modes should be modeled. Inferences about system-wide reliability would then be obtained by combining information from the dif- ferent modes. Thinking about exploratory analysis for reliability and maintainability data raises important issues about data collection. Data regarding the reli- ability of a vehicle or system should be collected from the start of opera- tions and tracked through the lifetime of the vehicle, including training uses of the vehicle, operational tests, and ultimately operational use. It is a challenge to collect data in this way and maintain it in a common database, but the ability to do so has important ramifications for reliability modeling. It is also important to keep maintenance records as well, so that the times between maintenance and failures are available.

186 IMPROVED OPERATIONAL TESTING AND EVALUATION Modeling anal Simulation Evaluation plans often rely on modeling and simulation to address several aspects of the system being evaluated. Data from the operational test may be needed to run the simulation models that address some issues, but certainly not all; for example, no new data are needed for studying transportability of the system. Information from an operational test may also identify an issue that was not anticipated in pretest simulation work, and this could then be used to refine or improve the simulation models. In addition, modeling and simulation can be used to better under- stand operational test results and to extrapolate to larger units. This is done by using data from the operational test to recreate and/or visualize test events. The recreated events may then be further probed via simulation. In addition, data (e.g., on the distributions of events) can be used to run through simulation programs and assess factors likely to be important at the brigade level. Care should be taken to assess the uncertainty effect of the limited sample size results from the IOT on the scaled-up simulations. ANALYSIS OF DATA FROM THE IBCT/STRYKER IOT This section addresses more specifically the analysis of data to be col- lected from the IBCT/Stryker JOT. Comments here are based primarily on information provided to the panel in various documents (see Chapter 1) and briefings by ATEC that describe the test and evaluation plans for the IBCT/Stryker. Confirmatory Analysis ATEC has provided us with detailed plans describing the intended analysis of the SME scores of mission outcomes and mission casualty rates. These plans are discussed here. The discussion of general principles in the preceding section comments on the importance of defining the appropriate unit for data analysis. The ATEC-designed evaluation consists basically of 36 missions for the Stryker- equipped force and 36 missions for the baseline force (and the 6 additional missions in the ATEC design reserved for special studies). These missions are defined by a mix of factors, including mission type (raid, perimeter defense, area presence), mission intensity (high, medium, low), location (rural, urban), and company pair (B. C). The planned analysis of SME

PHASE I REPORT: DATA ANALYSIS 187 mission scores uses the mission as the basic unit. This seems reasonable, although it may be possible to carry out some data analysis using company- level or platoon-level data or using events within missions (as described in Chapter 41. The planned analysis of casualty rates appears to work with the individual soldier as the unit of analysis. In the panel's view this is incorrect because there is sure to be dependence among the outcomes for different soldiers. Therefore, a single casualty rate should be computed for each mission (or for other units that might be deemed to yield independent information) and these should be analyzed in the manner currently planned for the SME scores. Several important data issues should be considered by ATEC analysts. These are primarily related to the SME scores. Confirmatory analyses are often based on the assumptions that there is a continuous or at least or- dered categorical measurement scale (although they are often done with Poisson or binomial data) and that the measurements on that scale are subject to measurement error that has constant variance (independent of the measured value). The SME scores provide an ordinal scale such that a mission success score of 8 is better than a score of 7, which is better than a score of 6. It is not clear that the scale can be considered an interval scale in which the difference between an 8 and 7 and between a 7 and 6 are the same. In fact, anecdotal evidence was presented to the panel suggesting that scores 5 through 8 are viewed as successes, and scores 1 through 4 are viewed as failures, which would imply a large gap between 4 and 5. One might also expect differences in the level of variation observed at different points along the scale, for two reasons. First, data values near either end of a scale (e.g., 1 or 8 in the present case) tend to have less measurement variation than those in the middle of the scale. One way to argue this is to note that all observers are likely to agree on judgments of missions with scores of 7 or 8, while there may be more variation on judgments about missions in the middle of the scoring scale (one expert's 3 might be another's 51. Second, the missions are of differing length and complexity. It is quite likely that the scores of longer missions may have more variability than those of shorter missions. Casualty rates, as proportions, are also likely to exhibit nonconstant variance. There is less variation in a low casualty rate (or an extremely high one) and more variation for a casualty rate away from the extremes. Transformations of SME scores or casualty rates should be considered if nonconstant variance is determined to be a problem. The intended ATEC analysis focuses on the difference between IBCT/ Stryker force outcomes and baseline force outcomes for the 36 missions.

188 IMPROVED OPERATIONAL TESTING AND EVALUATION By working with differences, the main effects of the various factors are eliminated, providing for more precise measurement of system effective- ness. Note that variation due to interactions, that is to say variation in the benefits provided by IBCT/Stryker over different scenarios, must be ad- dressed through a statistical model. The appropriate analysis, which ap- pears to be part of ATEC plans, is a linear model that relates the difference scores (that is, the difference between the IBCT and baseline performance measures on the same mission) to the effects of the various factors. The estimated residual variance from such a model provides the best estimate of the amount of variation in outcome that would be expected if missions were repeated under the same conditions. This is not the same as simply computing the variance of the 36 differences, as that variance would be inflated by the degree to which the IBCT/Stryker advantage varies across scenarios. The model would be likely to be of the form Di= difference score for mission i = overall mean + mission type effect + mission intensity effect + location effect + company effect + other desired . . Interactions + error The estimated overall mean is the average improvement afforded by IBCT/Stryker relative to the baseline. The null hypothesis of no difference (overall mean = 0) would be tested using traditional methods. Additional parameters measure the degree to which IBCT/Stryker improvement varies by mission type, mission intensity, location, company, etc. These addi- tional parameters can be tested for significance or, as suggested above, esti- mates for the various factor effects can be reported along with estimates of their precision to aid in the judgment of practically significant results. This same basic model can be applied to other continuous measures, including casualty rate, subject to earlier concerns about homogeneity of variance. This discussion ignores the six additional missions for each force. These can also be included and would provide additional degrees of free- dom and improved error variance estimates. Exploratory Analysis It is anticipated that IBCT/Stryker will outperform the baseline. As- suming that result is obtained, the focus will shift to determining under which scenarios Stryker helps most and why. This is likely to be deter- mined by careful analysis of the many measures and scenarios. In particu-

PHASE I REPORT: DATA ANALYSIS 189 far, it seems valuable to examine the IBCT unit scores, baseline unit scores, and differences graphically to identify any unusual values or scenarios. Such graphical displays will complement the results of the confirmatory analyses described above. In addition, the exploratory analysis provides an opportunity to con- sider the wide range of measures available. Thus, in addition to SME scores of mission success, other measures (as described in Chapter 3) could be used. By looking at graphs showing the relationship of mission outcome and factors like intensity simultaneously for multiple outcomes, it should be possible to learn more about IBCT/Stryker's strengths and vulnerabili- ties. However, the real significance of any such insights would need to be confirmed by additional testing. Reliability and Maintainability Reliability and maintainability analyses are likely to be focused on as- sessing the degree to which Stryker meets the design specifications. Tradi- tional reliability methods will be useful in this regard. The general prin- ciples discussed earlier concerning separate modeling for different failure modes is important. It is also important to explore the reliability data across vehicle types to identify groups of vehicles that may share common reliability profiles or, conversely, those with unique reliability problems. Modeling anal Simulation ATEC has provided little detail about how the IBCT/Stryker IOT data might be used in post-IOT simulations, so we do not discuss this issue. This leaves open the question of whether and how operational test data can be extrapolated to yield information about larger scale operations. SUMMARY The IBCT/Stryker IOT is designed to serve two major purposes: (1) confirmation that the Stryker-equipped force will outperform the Light Infantry Brigade baseline, and estimation of the amount by which it will outperform and (2) exploring the performance of the IBCT to learn about the performance capabilities and limitations of Stryker. Statistical signifi- cance tests are useful in the confirmatory analysis comparing the Stryker- equipped and baseline forces. In general, however, the issues raised by the 1998 NRC panel suggest that more use should be made of estimates and

190 IMPROVED OPERATIONAL TESTING AND EVALUATION associated measures of precision (or confidence intervals) in addition to significance tests because the former enable the judging of the practical significance of observed effects. There is a great deal to be learned by exploratory analysis of the IOT data, especially using graphical methods. The data may instruct ATEC about the relative advantage of IBCT/Stryker in different scenarios as well as any unusual events during the operational test. . We call attention to several key issues: 1. The IBCT/Stryker IOT involves the collection of a large number of measures intended to address a wide variety of issues. The measures should be used to address relevant issues without being rolled up into over- all summaries until necessary. 2. The statistical methods to be used by ATEC are designed for independent study units. In particular, it is not appropriate to compare casualty rates by simply aggregating indicators for each soldier over a set of missions. Casualty rates should be calculated for each mission (or possibly for discrete events of shorter duration) and these used in subsequent data analyses. 3. The IOT provides little vehicle operating data and thus may not be sufficient to address all of the reliability and maintainability concerns of ATEC. This highlights the need for improved data collection regarding vehicle usage. In particular, data should be maintained for each vehicle over that vehicle's entire life, including training, testing, and ultimately field use; data should also be gathered separately for different failure modes. 4. The panel reaffirms the recommendation of the 1998 NRC panel that more use should be made of estimates and associated measures of pre- cision (or confidence intervals) in addition to significance tests, because the former enable the judging of the practical significance of observed effects.

6 Assessing the IBCT/Stryker Operational Test in a Broad Context In our work reported here, the Panel on the Operational Test Design and Evaluation of the Interim Armored Vehicle has used the report of the Panel on Statistical Methods for Testing and Evaluating Defense Systems (National Research Council, 1998a, referred to in this chapter as NRC 1998) to guide our thinking about evaluating the IBCT/ Stryker Initial Operational Test (IOT). Consistent with our charge, we view our work as a case study of how the principles and practices put for- ward by the previous panel apply to the operational test design and evalua- tion of IBCT/Stryker. In this context, we have examined the measures, design, and evaluation strategy of IBCT/Stryker in light of the conclusions and recommendations put forward in NRC 1998 with the goal of deriving more general findings of broad applicability in the defense test and evalua- . . tlon community. From a practical point of view, it is clear that several of the ideas put forward in NRC 1998 for improvement of the measures and test design cannot be implemented in the IBCT/Stryker IOT due to various con- straints, especially time limitations. However, by viewing the Styker test as an opportunity to gain additional insights into how to do good opera- tional test design and evaluation, our panel hopes to further sharpen and disseminate the ideas contained in NRC 1998. In addition, this perspec- tive will demonstrate that nearly all of the recommendations contained in this report are based on generally accepted principles of test design and evaluation. 191

192 IMPROVED OPERATIONAL TESTING AND EVALUATION Although we note that many of the recommendations contained in NRC 1998 have not been fully acted on by ATEC or by the broader de- fense test and evaluation community, this is not meant as criticism. The paradigm shift called for in that report could not have been implemented in the short amount of time since it has been available. Instead, our aim is to more clearly communicate the principles and practices contained in NRC 1998 to the broad defense acquisition community, so that the changes sug- gested will be more widely understood and adopted. A RECOMMENDED PARADIGM FOR TESTING AND EVALUATION Operational tests, by necessity, are often large, very complicated, and expensive undertakings. The primary contribution of an operational test to the accumulated evidence about a defense system's operational suitabil- ity and effectiveness that exist a priori is that it is the only objective assess- ment of the interaction between the soldier and the complete system as it will be used in the field. It is well known that a number of failure modes and other considerations that affect a system's performance are best (or even uniquely) exhibited under these conditions. For this reason, Conclu- sion 2.3 of NRC 1998 states: "operational testing is essential for defense . . .. system evaluation. Operational tests have been put forward as tests that can, in isolation from other sources of information, provide confirmatory statistical "proof" that specific operational requirements have been met. However, a major finding of NRC 1998 is that, given the test size that is typical of the opera- tional tests of large Acquisition Category I (ACAT I) systems and the het- erogeneity of the performance of these systems across environments of use, users, tactics, and doctrine, operational tests cannot, generally speaking, satisfy this role.1 Instead, the test and evaluation process should be viewed as a continuous process of information collection, analysis, and decision making, starting with information collected from field experience of the Conclusion 2.2 of the NRC 1998 report states: "The operational test and evaluation requirement, stated in law, that the Director, Operational Test and Evaluation certify that a system is operationally effective and suitable often cannot be supported solely by the use of standard statistical measures of confidence for complex defense systems with reasonable amounts of testing resources" (p. 33).

PHASE I REPORT: ASSESSING THE IBCT/STRYKER OPERATIONAL TEST 193 baseline and similar systems, and systems with similar or identical compo- nents, through contractor testing of the system in question, and then through developmental testing and operational testing (and in some sense continued after fielding forward to field performance). Perhaps the most widely used statistical method for supporting deci- sions made from operational test results is significance testing. Significance testing is flawed in this application because of inadequate test sample size to detect differences of practical importance (see NRC, 1998:88-91), and because it focuses attention inappropriately on a pass/fail decision rather than on learning about the system's performance in a variety of settings. Also, significance testing answers the wrong question not whether the system's performance satisfies its requirements but whether the system's per- formance is inconsistent with failure to meet its requirements and signifi- cance testing fails to balance the risk of accepting a "bad" system against the risk of rejecting a "good" system. Significance tests are designed to detect statistically significant differences from requirements, but they do not ad- dress whether any differences that may be detected are practically signifi- cant. The DoD milestone process must be rethought, in order to replace the fundamental role that significance testing currently plays in the pass/fail decision with a fuller exploration of the consequences of the various pos- sible decisions. Significance tests and confidence intervals2 provide useful information, but they should be augmented by other numeric and analytic assessments using all information available, especially from other tests and trials. An effective formal decision-making framework could use, for ex- ample, significance testing augmented by assessments of the likelihood of various hypotheses about the performance of the system under test (and the baseline system), as well as the costs of making various decisions based on whether the various alternatives are true. Moreover, designs used in operational testing are not usually constructed to inform the actual deci- sions that operational test is intended to support. For example, if a new system is supposed to outperform a baseline in specific types of environ- ments, the test should provide sufficient test sample in those environments to determine whether the advantages have been realized, if necessary at the methods. 2Producing confidence intervals for sophisticated estimates often requires resampling

194 IMPROVED OPERATIONAL TESTING AND EVALUATION cost of test sample in environments where the system is only supposed to equal the baseline. Testing the IBCT/Stryker is even more complicated than many ACAT I systems in that it is really a test of a system of systems, not simply a test of what Stryker itself is capable of. It is therefore no surprise that the size of the operational test (i.e., the number of test replications) for IBCT/Stryker will be inadequate to support many significance tests that could be used to base decisions on whether Stryker should be passed to full-rate production. Such decisions therefore need to be supplemented with information from the other sources, mentioned above. This argument about the role of significance testing is even more im- portant for systems such as the Stryker that are placed into operational testing when the system's performance (much less its physical characteris- tics) has not matured, since then the test size needs to be larger to achieve reasonable power levels. When a fully mature system is placed into opera- tional testing, the test is more of a confirmatory exercise, a shakedown test, since it is essentially understood that the requirements are very likely to be met, and the test can then focus on achieving a greater understanding of how the system performs in various environments. Recommendation 3.3 of NRC 1998 argued strongly that information should be used and appropriately combined from all phases of system de- velopment and testing, and that this information needs to be properly archived to facilitate retrieval and use. In the case of the IBCT/Stryker JOT, it is clear that this has not occurred, as evidenced by the difficulty ATEC has had in accessing relevant information from contractor testing and, indeed, operational experiences from allies using predecessor systems (e.g., the Canadian LAY-III). HOW IBCT/STRYKER IOT CONFORMS WITH RECOMMENDATIONS FROM THE NRC 1998 REPORT Preliminaries to Testing The new paradigm articulated in NRC 1998 argues that defense sys- tems should not enter into operational testing unless the system design is relatively mature. This maturation should be expedited through previous testing that incorporates various aspects of operational realism in addition to the usual developmental testing. The role, then, for operational testing would be to confirm the results from this earlier testing and to learn more

PHASE I REPORT: ASSESSING THE IBCT/STRYKER OPERATIONAL TEST 195 about how to operate the system in different environments and what the system's limitations are. The panel believes that in some important respects Stryker is not likely to be fully ready for operational testing when that is scheduled to begin. This is because: 1. many of the vehicle types have not yet been examined for their suitability, having been driven only a fraction of the required mean miles to failure (1,000 miles); 2. the use of add-on armor has not been adequately tested prior to the operational test; 3. it is still not clear how IBCT/Stryker needs to be used in various types of scenarios, given the incomplete development of its tactics and doc- trine; and 4. the GFE systems providing situation awareness have not been suffi- ciently tested to guarantee that the software has adequate reliability. The role of operational test as a confirmatory exercise has therefore not been realized for IBCT/Stryker. This does not necessarily mean that the IOT should be postponed, since the decision to go to operational test is based on a number of additional considerations. However, it does mean that the operational test is being run with some additional complications that could reduce its effectiveness. Besides system maturity, another prerequisite for an operational test is a full understanding of the factors that affect system performance. While ATEC clearly understands the most crucial factors that will contribute to variation in system performance (intensity, urban/rural, day/night, terrain, and mission type), it is not clear whether they have carried out a systematic test planning exercise, including (quoting from NRC, 1998a:64-651: "~1) defining the purpose of the test; . . . (4) using previous information to compare variation within and across environments, and to understand sys- tem performance as a function of test factors; . . . and (6) use of small-scale screening or guiding tests for collecting information on test planning." Also, as mentioned in Chapter 4, it is not yet clear that the test design and the subsequent test analysis have been linked. For example, if perfor- mance in a specific environment is key to the evaluation of IBCT/Stryker, more test replications will need to be allocated to that environment. In addition, while the main factors affecting performance have been identi- fied, factors such as season, day versus night, and learning effects were not,

196 IMPROVED OPERATIONAL TESTING AND EVALUATION at least initially, explicitly controlled for. This issue was raised in the panel's letter report (Appendix A). Test Design This section discusses two issues relevant to test design: the basic test design and the allocation of test replications to design cells. First, ATEC has decided to use a balanced design to give it the most flexibility in esti- mating the variety of main effects of interest. As a result, the effects of terrain, intensity, mission, and scenario on the performance of these sys- tems will be jointly estimated quite well, given the test sample size. How- ever, at this point in system development, ATEC does not appear to know which of these factors matter more and/or less, or where the major uncer- tainties lie. Thus, it may be that there is only a minority of environments in which IBCT/Stryker offers distinct advantages, in which case those en- vironments could be more thoroughly tested to achieve a better under- standing of its advantages in those situations. Specific questions of inter- est, such as the value of situation awareness in explaining the advantage of IBCT/Stryker, can be addressed by designing and running small side ex- periments (which might also be addressed prior to a final operational test). This last suggestion is based on Recommendation 3.4 of the NRC 1998 report (p. 491: "All services should explore the adoption of the use of small- scale testing similar to the Army concept of force development test and experimentation. " Modeling and simulation are discussed in NRC 1998 as an important tool in test planning. ATEC should take better advantage of information from modeling and simulation, as well as from developmental testing, that could be very useful for the IBCT/Stryker test planning. This includes information as to when the benefits of the IBCT/Stryker over the baseline are likely to be important but not well established. Finally, in designing a test, the goals of the test have to be kept in mind. If the goal of an operational test is to learn about system capabilities, then test replications should be focused on those environments in which the most can be learned about how the system's capabilities provide advan- tages. For example, if IBCT/Stryker is intended primarily as an urban system, more replications should be allocated to urban environments than to rural ones. We understand ATEC's view that its operational test designs must allocate, to the extent possible, replications to environments in accor- dance with the allocation of expected field use, as presented in the OMS/

PHASE I REPORT: ASSESSING THE IBCT/STRYKER OPERATIONAL TEST 197 MP. In our judgment the OMS/MP need only refer to the operational evaluation, and certainly once estimates of test performance in each envi- ronment are derived, they can be reweighted to correspond to summary measures defined by the OMS/MP (which may still be criticized for focus- ing too much on such summary measures in comparison to more detailed assessments). Furthermore, there are substantial advantages obtained with respect to designing operational tests by separating the two goals of confirming that various requirements have been met and of learning as much as possible about the capabilities and possible deficiencies of the system before going to full-rate production. That separation allows the designs for these two separate tests to target these two distinct objectives. Given the recent emphasis in DoD acquisition on spiral development, it is interesting to speculate about how staged testing might be incorpo- rated into this management concept. One possibility is a test strategy in which the learning phase makes use of early prototypes of the subsequent stage of development. System Suitability Recommendation 7.1 of NRC 1998 states (p. 1051: The Department of Defense and the military services should give increased attention to their reliability, availability, and maintainability data collection and analysis procedures because deficiencies continue to be responsible for many of the current field problems and concerns about military readiness. While criticizing developmental and operational test design as being too focused on evaluation of system effectiveness at the expense of evalua- tion of system suitability, this recommendation is not meant to suggest that operational tests should be strongly geared toward estimation of system suitability, since these large-scale exercises cannot be expected to run long enough to estimate fatigue life, etc. However, developmental testing can give measurement of system (operational) suitability a greater priority and can be structured to provide its test events with greater operational realism. Use of developmental test events with greater operational realism also should facilitate development of models for combining information, the topic of this panel's next report. The NRC 1998 report also criticized the test and evaluation commu- nity for relying too heavily on the assumption that the interarrival time for

198 IMPROVED OPERATIONAL TESTING AND EVALUATION initial failures follows an exponential distribution. The requirement for Stryker of 1,000 mean miles between failures makes sense as a relevant measure only if ATEC is relying on the assumption of exponentially dis- tributed times to failure. Given that Stryker, being essentially a mechanical system, will not have exponentially distributed times to failure, due to wearout, the actual distribution of waiting times to failure needs to be esti- mated and presented to decision makers so that they understand its range of performance. Along the same lines, Stryker will, in all probability, be repaired during the operational test and returned to action. Understanding the variation in suitability between a repaired and a new system should be an important part of the operational test. Testing of Software-Intensive Systems The panel has been told that obtaining information about the perfor- mance of GFE is not a priority of the JOT: GFE will be assumed to have well-estimated performance parameters, so the test should focus on the non-GFE components of Stryker. One of the components of Stryker's GFE is the software providing Stryker with situation awareness. A primary assumption underlying the argument for the development of Stryker was that the increased vulnerability of IBCT/Stryker (due to its reduced armor) is offset by the benefits gained from the enhanced firepower and defensive positions that Stryker will have due to its greater awareness of the place- ment of friendly and enemy forces. There is some evidence (FBCB2 test results) that this situation awareness capability is not fully mature at this date. It would therefore not be surprising if newly developed, complex software will suffer reliability or other performance problems that will not be fully resolved prior to the start of operational testing. NRC 1998 details procedures that need to be more widely adopted for the development and testing of software-intensive systems, including us- age-based testing. Further, Recommendation 8.4 of that report urges that software failures in the field should be collected and analyzed. Making use of the information on situation awareness collected during training exer- cises and in contractor and developmental testing in the operational test design would have helped in the more comprehensive assessment of the performance of IBCT/Stryker. For example, allocating test replications to situations in which previous difficulties in situation awareness had been experienced would have been very informative as to whether the system is effective enough to pass to full-rate production.

PHASE I REPORT: ASSESSING THE IBCT/STRYKER OPERATIONAL TEST 199 Greater Access to Statistical Expertise in Operational Test and Evaluation Stryker, if fully procured, will be a multibillion dollar system. Clearly, the decision on whether to pass Stryker to full-rate production is extremely important. Therefore, the operational test design and evaluation for Stryker needs to be representative of the best possible current practice. The statisti- cal resources allocated to this task were extremely limited. The enlistment of the National Research Council for high-level review of the test design and evaluation plans is commendable. However, this does not substitute for detailed, hands-on, expert attention by a cadre of personnel trained in statistics with "ownership" of the design and subsequent test and evalua- tion. ATEC should give a high priority to developing a contractual rela- tionship with leading practitioners in the fields of reliability estimation, experimental design, and methods for combining information to help them in future IOTs. (Chapter 10 of NRC 1998 discusses this issue.) SUMMARY The role of operational testing as a confirmatory exercise evaluating a mature system design has not been realized for IBCT/Stryker. This does not necessarily mean that the IOT should be postponed, since the decision to go to operational testing is based on a number of additional consider- ations. However, it does mean that the operational test is being asked to provide more information than can be expected. The IOT may illuminate potential problems with the IBCT and Stryker, but it may not be able to convincingly demonstrate system effectiveness. We understand ATEC's view that its operational test designs must allo- cate, to the extent possible, replications to environments in accordance with the allocation of expected field use, as presented in the OMS/MP. In the panel's judgment, the OMS/MP need only refer to the operational evalua- tion, and once estimates of test performance in each environment are de- rived, they can be reweighted to correspond to summary measures defined by the OMS/MP. We call attention to a number of key points: 1. Operational tests should not be strongly geared toward estimation of system suitability, since they cannot be expected to run long enough to estimate fatigue life, estimate repair and replacement times, identify failure

200 IMPROVED OPERATIONAL TESTING AND EVALUATION modes, etc. Therefore, developmental testing should give greater priority to measurement of system (operational) suitability and should be struc- tured to provide its test events with greater operational realism. 2. Since the size ofthe operational test (i.e., the number of test replica- tions) for IBCT/Stryker will be inadequate to support significance tests leading to a decision on whether Stryker should be passed to full-rate pro- duction, ATEC should augment this decision by other numerical and graphical assessments from this IOT and other tests and trials. 3. In general, complex systems should not be forwarded to operational testing, absent strategic considerations, until the system design is relatively mature. Forwarding an immature system to operational test is an expensive way to discover errors that could have been detected in developmental test- ing, and it reduces the ability of an operational test to carry out its proper function. System maturation should be expedited through previous testing that incorporates various aspects of operational realism in addition to the usual developmental testing. 4. Because it is not yet clear that the test design and the subsequent test analysis have been linked, ATEC should prepare a straw man test evalu- ation report in advance of test design, as recommended in the panel's Octo- ber 2002 letter to ATEC (see Appendix A). 5. The goals of the initial operational test need to be more clearly specified. Two important types of goals for operational test are learning about system performance and confirming system performance in com- parison to requirements and in comparison to the performance of baseline systems. These two different types of goals argue for different stages of operational test. Furthermore, to improve test designs that address these different types of goals, information from previous stages of system devel- opment need to be utilized. 6. To achieve needed detailed, hands-on, expert attention by a cadre of statistically trained personnel with "ownership" of the design and subse- quent test and evaluation, the Department of Defense and ATEC in par- ticular should give a high priority to developing a contractual relationship with leading practitioners in the fields of reliability estimation, experimen- tal design, and methods for combining information to help them with fu- ture IOTs.

References Box, G.E.P., Hunter, W.G., and Hunter, J.S. 1978 Statistics for Experimenters. New York: John Wiley & Sons. Chambers, J.M., Cleveland,W.S., Kleiner, B., and Tukey, P. A. 1983 Graphical Methods for Data Analysis. Belmont, CA: Wadsworth. Helmbold R.L. 1992 Casualty Fractions and Casualty Exchange Ratio. Unpublished memorandum to J. Riente, February 12, 1992. Johnson, R.A. and Wichern, D.W. 1992 Applied multivariate statistical analysis, 3rd edition. Englewood Cliffs, NJ: Prentice Hall. Meeker, William Q., and Escobar, Luis A. 1998 Statistical Methods for Reliability Data. New York: John Wiley & Sons. National Research Council 1998a Statistics, Testing and Defense Acquisition: New Approaches and Methodological Improvements. Michael L. Cohen, John E. Rolph, and Duane L. Steffoy, Eds. Panel on Statistical Methods for Testing and Evaluating Defense Systems, Committee on National Statistics. Washington, DC: National Academy Press. 1998b Modeling Human and Organizational Behavior. R.W. Pew and A.S. Mavor, Eds. Panel on Modeling Human Behavior and Command Decision Making: Representations for Military Simulations. Washington, DC: National Academy Press. Thompson, D. 1992 The Casualty-FER Curve of the Force Structure Reduction Study: A Comparison to Historical Data. Vector Research, Incorporated Document No. VRI-OHD WP92-1, March 10, 1992. Ann Arbor, MI: Vector Research, Incorporated. 201

202 IMPROVED OPERATIONAL TESTING AND EVALUATION Tukey, J.W. 1977 Exploratory Data Analysis. New York: Addison-Wesley, U.S. Department of Defense 2000 Operational Requirements Document (ORD) for a Family of Interim Armored Vehicles (IAV), ACAT I, Prepared for the Milestone I Decision, April 6. U.S. Army Training and Doctrine Command, Fort Monroe, Virginia. 2001 Test And Evaluation Master Plan (TEMP): Stryker Family of Interim Armored Vehicles (IAV). Revision 1, Nov 12. U.S. Army Test and Evaluation Command, Alexandria, Virginia. 2002a Interim Armored Vehicle IOTE: Test Design Review with NAS Panel. Unpublished presentation, Nancy Dunn and Bruce Grigg, April 15. 2002b Interim Armored Vehicle IOTE: Test Design Review with NAS Panel; Power and Sample Size Considerations. Unpublished presentation, Nancy Dunn and Bruce Grigg, May 6. 2002c System Evaluation Plan (SEP) for the Stryker Family of Interim Armored Vehicles (IAV), May 22. U.S. Army Test and Evaluation Command, Alexandria, Virginia. Veit, Clairice T. 1996 Judgments in military research: The elusive validity issue. Phalanx.

Appendix A Letter Report of the Pane! to the Army Test and Evaluation Command

THE NATIONAL ACADEMIES Advisers to the Nation on Science, Engineering, and Medirine Division of Behavioral and Social Sciences and Education Committee on National Statistics Panel on Operational Test Design and Evaluation of the Interim Armored Vehicle (IAV) Frank John Apicella Technical Director Army Evaluation Center U.S. Army Test and Evaluation Command 4501 Ford Avenue Alexandria, VA 22302-1458 Dear Mr. Apicella: 500 Fifth Street, NW Washington, DC 20001 Phone: 202 334 3408 Fax: 202 334 3584 Email:jmcgee~ nas.edu October 10, 2002 As you know, at the request of the Army Test and Evaluation Com- mand (ATEC) the Committee on National Statistics has convened a panel to examine ATEC's plans for the operational test design and evaluation of the Interim Armored Vehicle, now referred to as the Stryker. The panel is currently engaged in its tasks of focusing on three aspects of the operational test design and evaluation of the Stryker: (1) the measures of performance and effectiveness used to compare the Interim Brigade Combat Team (IBCT), equipped with the Stryker, against a baseline force; (2) whether the current operational test design is consistent with state-of-the-art meth- ods in statistical experimental design; and (3) the applicability of models for combining information from testing and field use of related systems and from developmental test results for the Stryker with operational test results for the Stryker. ATEC has asked the panel to comment on ATEC's current plans and to suggest alternatives. The work performance plan includes the preparation of three reports: · The first interim report (due in November 2002) will address two 205

206 IMPROVED OPERATIONAL TESTING AND EVALUATION topics: (1) the measures of performance and effectiveness used to compare the Stryker-equipped IBCT against the baseline force, and (2) whether the current operational test design is consistent with state-of-the-art methods . . . . . In statlstlca . experlmenta . ( ~eslgn. · The second interim report (due in March 2003) will address the topic of the applicability of models for combining information from test- ing and field use of related systems and from developmental test results for the Stryker with operational test results for the Stryker. · The final report (due in July 2003) will integrate the two interim reports and add any additional findings of the panel. The reports have been sequenced and timed for delivery to support ATEC's time-critical schedule for developing plans for designing and imple- menting operational tests and for performing analyses and evaluations of the test results. Two specific purposes of the initial operational test of the Stryker are to determine whether the Interim Brigade Combat Team (IBCT) equipped with the Stryker performs more effectively than a baseline force (Light In- fantry Brigade), and whether the Stryker meets its performance require- ments. The results of the initial operational test contribute to the Army's decisions of whether and how to employ the Stryker and the IBCT. The panel's first interim report will address in detail factors relating to the effec- tiveness and performance of the Stryker-equipped IBCT and of the Stryker; effective experimental designs and procedures for testing these forces and their systems under relevant operational conditions, missions, and scenarios; subjective and objective measures of performance and effectiveness for cri- teria of suitability, force effectiveness, and survivability; and analytical pro- cedures and methods appropriate to assessing whether and why the Stryker- equipped IBCT compares well (or not well) against the baseline force, and whether and why the Stryker meets (or does not meet) its performance requirements. In the process of deliberations toward producing the first interim re- port that will address this broad sweep of issues relevant to operational test design and to measures of performance and effectiveness, the panel has discerned two issues with long lead times to which, in the opinion of the panel, ATEC should begin attending immediately, so that resources can be identified, mustered, and applied in time to address them: early develop- ment of a "straw man" (hypothetical draft) Test and Evaluation Report (which will support the development of measures and the test design as

PHASEIREPORTAPPENDIXA: LETTER REPORT 207 well as the subsequent analytical efforts) and the scheduling of test partici- pation by the Stryker-equipped force and the baseline force so as to remove an obvious test confounder of different seasonal conditions. The general purpose of the initial operational test (JOT) is to provide information to decision makers about the utility of and the remaining chal- lenges to the IBCT and the Stryker system. This information is to be generated through the analysis of IOT results. In order to highlight areas for which data are lacking, the panel strongly recommends that immediate effort be focused on specifying how the test data will be analyzed to address relevant decision issues and questions. Specifically, a straw man Test Evalua- tion Report (TER) should be prepared as if the IOT had been completed. It should include examples of how the representative data will be analyzed, specific presentation formats (including graphs) with expected results, in- sights one might develop from the data, draft recommendations, etc. The content of this straw man report should be based on the experience and intuition of the analysts and what they think the results of the IOT might look like. Overall, this could serve to provide a set of hypotheses that would be tested with the actual results. Preparation of this straw man TER will help ATEC assess those issues that cannot be informed by the opera- tional tests as currently planned, will expose areas for which needed data is lacking, and will allow appropriate revision of the current operational test plan. The current test design calls for the execution of the IBCT/Stryker vs. the opposing force (OPFOR) trials and the baseline vs. the OPFOR trials to be scheduled for different seasons. This design totally confounds time of year with the primary measure of interest: the difference in effectiveness between the baseline force and the IBCT/Stryker force. The panel believes that the factors that are present in seasonal variations weather, foliage density, light level, temperature, etc. may have a greater effect on the differences between the measures of the two forces than the abilities of the two forces themselves. We therefore recommend that serious consideration be given to scheduling these events as closely in time as possible. One way to address the potential confounding of seasonal affects, as well as possible effects of learning by blue forces and by the OPFOR, would be to inter- sperse activities of the baseline force and the IBCT/Stryker force over time. The panel remains eager to assist ATEC in improving its plans and processes for operational test and evaluation of the IBCT/Stryker. We are grateful for the support and information you and your staff have consis- tently provided during our efforts to date. It is the panel's hope that deliv-

208 IMPROVED OPERATIONAL TESTING AND EVALUATION Bring to you the recommendations in this brief letter in a timely fashion will encourage ATEC to begin drafting a straw man Test Evaluation Report in time to influence operational test activities and to implement the change in test plan to allow the compared forces to undergo testing in the same season. Sincerely yours, Stephen Pollock, Chair Panel on Operational Test Design and Evaluation of the Interim Armored Vehicle

Appendix B Force Exchange Ratio, Historical Win Probability, and Winning with Decisive Force FORCE EXCHANGE RATIO AND HISTORICAL WIN PROBABILITY For a number of years the Center for Army Analysis (CAA) analyzed historical combat data to determine the relationship between victory and casualties in land combat. The historical data, contained in the CAA Data Base of Battles (1991 version, CDB91) is from a wide range of battle types durations ranging from hours to weeks, dates ranging from the 1600s to the late 20th century, and forces involving a variety of nationali- ties. Based on the analysis of these data (and some motivation from Lanchester's square law formulation), it has been demonstrated (see Center for Army Analysis, 1987, and its references) that: · the probability of an attacker victory1 is related to a variable called the "defenders advantage" or ADS, where ADVis a function of force strengths and final survivors; and · ADV~ in (FER) Since N= threat forces and M= friendly coalition forces in our defini- tion of the force exchange ratio (FER), Figure B-1 depicts the historical relationship between the FER and the probability of winning, regardless of Probability of a defender victory is the complement. 209

210 1 0.8 0.6 0.4 - 0.2 - IMPROVED OPERATIONAL TESTING AND EVALUATION , - . ~ I 0~ 1 1 1 1 1 1 1 1 1 0 1 4 5 6 Force Exchange Ratio 7 8 9 10 FIGURE B-1 Historical relationship between force exchange ratio and Prkwin). SOURCE: Adapted from Thompson (1992) and Helmbold (1992~. whether the coalition is in defense or attack mode. Additionally, the rela- tion between FER and friendly fractional casualties is depicted in Figure B-2 (see CAA, 1992 and VRI, 19921. FER is not only a useful measure of effectiveness (MOE) to indicate the degree to which force imbalance is reduced, but it is also a useful his- torical measure of a force's warfighting capability for mission success. FER AND "DECISIVE FORCE" Following the demise of the Soviet Union and Operation Desert Storm, the U.S. National Military Strategy (NMS) codified a new military success objective: "Apply decisive force to win swiftly and minimize casual- ties." The NMS also implied that decisive force will be used to minimize risks associated with regional conflicts. The FER is a MOE that is useful in defining and quantifying the level of warfi~htin~ canabilitv needed to meet this objective. Figure B-3 has been derived from a scatterplot of results from a large number of simulated regional conflicts involving joint U.S. forces and coa- lition partners against a Southwest Asia regional aggressor. The coalition's ~ ~ 1 ~

PHASE I REPORTAPPENDIXB: FORCE EXCHANGE RATIO .12 REIN .10 o it, .08 Ct tn .06 Ct ~ .04 . _ 211 \ \ \ - 2 3 4 5 6 Force Exchange Ratio FIGURE B-2 Force exchange ratio/casualty relationship. SOURCE: Adapted from Thompson (1992) and Helmbold (1992~. Objectives are to conduct a defense to prevent the aggressor from capturing critical ports and airfields in Saudi Arabia and to then conduct a counterof- fensive to regain lost territory and restore national boundaries. The FER-coalition casualty relationship shown in the figure is based on simulation results, in which the FER is the ratio of the percentage of enemy losses to the percentage of coalition losses. Except for the region in which the coalition military objectives were not achieved (FER < 1.3) be- cause insufficient forces arrived in the theater, the relationship between FER and coalition casualties is similar to that shown in Figure B-2, which is based on historic data. The relationship between FER and the probability of win in Figure B-3 is based on the analysis of historic data. As shown in Figure B-3, a FER= 5.0 is defined to be a "decisive" warfighting capability. This level comes close to achieving the criterion of minimizing casualties, since improvements above that level only slightly reduce casualties further. This level of FER also minimizes risk in that a force with a FERof 2.5 will win approximately 90 out of 100 conflicts (lose 10 percent of the time) but will lose less than 2 percent of the time with a FER= 5.0.

212 IMPROVED OPERATIONAL TESTING AND EVALUATION Aggressor achieves military objectives Coalition objectives achieved, casualties high ~ : Coalition objectives achieved, casualties . reduced Pr (win) = .505 1.0 2.0 3.0 4.0 .846 .937 .968 .981 Force Exchange Ratio FIGURE B-3 Force exchange ratio and decisive warfighting capability. SOURCE: Adapted from Thompson (1992) and Helmbold (1992~. Decisive ODS 5.0 50.0 100.0

Next: Biographical Sketches of Panel Members and Staff »

Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report (2004)

Chapter: Phase I Report: Operational Test Design and Evaluation of the Interim Armored Vehicle

Welcome to OpenBook!

Get Email Updates