Read "Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report" at NAP.edu

« Previous: 3. Test Measures

Page 59 Cite

Suggested Citation:"4. Statistical Design." National Research Council. 2003. Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report. Washington, DC: The National Academies Press. doi: 10.17226/10710.

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 Statistical Design _ ' n this chapter we first discuss some broader perspectives and statistical issues associated with the design of any large-scale industrial experi- ~ ~ meet. We discuss the designs and design processes that could be imple- mented if a number of constraints in the operational test designs were ei- ther relaxed or abandoned. Since the operational test design for the IBCT/ Stryker is now relatively fixed, the discussion is intended to demonstrate to ATEC the advantages of various alternative approaches to operational test design that could be adopted in the future, and therefore the need to recon- sider these constraints. This is followed by a brief description of the cur- rent design of the IBCT/Stryker initial operational test (IOT), accompa- nied by a review of the design conditioned on adherence to the above- . . mentioner ~ constraints. BROAD PERSPECTIVE ON EXPERIMENTAL DESIGN OF OPERATIONAL TESTS Constrained Designs of ATEC Operational Tests ATEC has designed the IOT to be consistent with the following constraints: 1. Aside from statistical power calculations, little information on the performance of IBCT/Stryker or the baseline Light Infantry Brigade 59

60 . IMPROVED OPERATIONAL TESTING AND EVALUATION (LIB) from modeling or simulation, developmental testing, or the per- formance of similar systems is used to impact the allocation of test samples in the test design. In particular, this precludes increasing the test sample size for environments for which the IBCT/Stryker or the baseline has proved to be problematic in previous tests. 2. The allocation of test samples to environments is constrained to reflect the allocation of use detailed in the operational mission summary/ mission profile (OMS/MP). 3. Operational tests are designed to test the system for typical stresses that will be encountered in the field. This precludes testing systems in more extreme environments to provide information on the limitations of system performance. 4. Possibly most important, operational tests are, very roughly speak- ing, single test events. It is currently not typical for an operational test either to be carried out in stages or to include use of smaller-scale tests with operationally relevant features focused on specific issues of interest. Reconsidering Operational Test Design: Initial Operational Testing Should Not Commence Until System Design Is Mature The above constraints do not need strict adherence, which will result in designs that have substantial disadvantages compared with current meth- ods used in industrial settings. The following discussion provides some characteristics of operational test designs that could be implemented if these constraints were relaxed or removed. There are two broad goals of any operational test: to learn about a system's performance and its performance limitations in a variety of set- tings and to confirm either that a system meets its requirements or that it outperforms a baseline system (when this is with respect to average perfor- mance over a variety of environments). A fundamental concern with the current approach adopted by ATEC is that both of these objectives are unlikely to be well addressed by the same test design and, as a result, ATEC has (understandably) focused on the confirmatory objective, with emphasis on designs that support significance testing. Given either a learning or a confirmatory objective, a requisite for op- erational testing is that it should not commence until the system design is mature. Developmental testing should be used to find major design flaws, including many of those that would typically arise only in operationally realistic conditions. Even fine-tuning the system to improve performance

STATISTICAL DESIGN 61 should be carried out during developmental testing. This is especially true for suitability measures. Operational testing performs a difficult and cru- cial role in that it is the only test of the system as a whole in realistic opera- tional conditions. Operational testing can be used to determine the limita- tions and value, relative to a baseline system, of a new system in realistic operational conditions in carrying out various types of missions. While operational testing can reveal problems that cannot be discovered, or dis- covered as easily, in other types of testing, the primary learning that should take place during operational test should be the development of a better understanding of system limitations, i.e., the circumstances under which the system performs less well and under which the system excels (relative to a baseline system). Discovering major design flaws during an operational test that could have been discovered earlier compromises the ability of the operational test to carry out these important functions. The benefits of waiting until a system design is mature before begin- ning operational testing does not argue against the use of spiral develop- ment. In that situation, for a given stage of acquisition, one should wait until that stage of development is mature before entering operational test. That does not then preclude the use of evolutionary acquisition for subse- quent stages of development. (This issue is touched on in Chapter 6.) Multiple Objectives of Operational Testing and Operational Test Design: Moving Beyond Statistical Significance as a Goal Operational test designs need to satisfy a number of objectives. Major defense systems are enormously complicated, with performances that can change in important ways as a result of changes in many factors of interest. Furthermore, there are typically dozens of measures for which information on performance is needed. These measures usually come in two major types those used to compare a new system with a baseline systems and those used to compare a new system with its requirements, as provided in the Operational Requirements Document (ORD). In nearly all cases, it is impossible to identify a single operational test design that is simultaneously best for identifying how various factors affect system performance for doz- 1Though not generally feasible, the use of multiple baselines should sometimes be con- sidered, since for some environments some baselines would provide little information as comparison systems.

62 IMPROVED OPERATIONAL TESTING AND EVALUATION ens of measures of interest. Test designs that would be optimal for the task of comparing a system with requirements would not generally be as effec- tive for comparing a system with a baseline, and test designs that would be optimal for measures of suitability would not generally be excellent for measures of effectiveness. In practice, one commonly selects a primary measure (one that is of greatest interest), and the design is selected to per- form well for that measure. The hope is that the other measures of interest will be related in some fashion to the primary measure, and therefore the test design to evaluate the primary measure will be reasonably effective in evaluating most of the remaining measures of interest. (If there are two measures of greatest interest, a design can be found that strikes a balance between the performance for the two measures.) In addition, operational tests can have a number of broader goals: 1. to understand not only how much the various measures differ for the two systems but also why the measures differ; 2. to identify additional unknown factors that affect system perfor- mance or that affect the difference between the operation of the system being tested and the baseline system; 3. to acquire a better strategic understanding of the system, for ex- ample, to develop a greater understanding of the value of information, mobility, and lethality for performance; 4. to understand the demands on training and the need for system expertise in operating the system in the field; and 5. to collect sufficient information to support models and simulations on system performance. The test and evaluation master plan (TEMP) states that the Stryker: has utility in all operational environments against all projected future threats; however, it is designed and optimized for contingency employment in urban or complex terrain while confronting low- and mid-range threats that may display both conventional and asymmetric warfare capabilities. Clearly, the operational test for Stryker will be relied on for a number of widely varied purposes. As stated above, ATEC's general approach to this very challenging problem focuses on the objective of confirming performance and uses the statistical concept of significance testing: comparing the performance of IBCT/Stryker against the baseline (LIB) to establish that the former is pre- ferred to the latter. In addition, there is some testing against specific re- quirements (e.g., Stryker has a requirement for 1,000 mean miles traveled

STATISTICAL DESIGN 63 between failures). This approach, which results in the balanced design described in Chapter 2 (for a selected number of test design factors), does not provide as much information as other approaches could in assessing the performance of the system over a wide variety of settings. To indicate what might be done differently in the IBCT/Stryker IOT (and for other systems in the future), we discuss here modifications to the sample size, test design, and test factor levels. Sample Size Given that operational tests have multiple goals (i.e., learning and con- firming for multiple measures of interest), arguments for appropriate sample sizes for operational tests are complicated. Certainly, sample sizes that sup- port minimal power at reasonable significance levels for testing primary measures of importance provide a starting point for sample size discussions. However, for complicated, expensive systems, given the dynamic nature of system performance as a function of a number of different factors of impor- tance (e.g., environments, mission types), it is rare that one will have suffi- cient sample size to be able to achieve adequate power. (Some benefits in decreasing test sample size for confirmatory purposes can be achieved through use of sequential testing, when feasible.) Therefore, budgetary limitations will generally drive sample size calculations for operational tests. However, when that is not the case, the objectives of learning about system performance, in addition to that of confirming improvement over a baseline, argue for additional sample size so that these additional objectives can be addressed. Therefore, rather than base sample size arguments solely on power calculations, the Army needs to allocate as much funding as vari- ous external constraints permit to support operational test design. Testing in Scenarios in Which Performance Differences Are Anticipatedt As mentioned above, ATEC believes that it is constrained to allocate test samples to mission types and environments to reflect expected field use, as provided in the OMS/MP. This constraint is unnecessary, and it works against the more important goal of understanding the differences between the IBCT/Stryker and the baseline and the causes of these differ- ences. If a specific average (one that reflects the OMS/MP) of performance across mission type is desired as part of the test evaluation, a reweighting of the estimated performance measures within scenario can provide the de-

64 IMPROVED OPERATIONAL TESTING AND EVALUATION sires ~ summary measures a posteriors. Therefore, the issue of providing specific averages in the evaluation needs to be separated from allocation of test samples to scenarios. As indicated above, test designs go hand-in-hand with test goals. If the primary goal for ATEC in carrying out an operational test is to confirm that, for a specific average over scenarios that conforms to the OMS/MP missions and environments, the new system significantly outperforms the baseline, then allocations that mimic the OMS/MP may be effective. How- ever, if the goal is one of learning about system performance for each sce- nario, then, assuming equal variances of the performance measure across scenarios, the allocation of test samples equally to test scenarios would be preferable to allocations that mimic the OMS/MP. More broadly, general objectives for operational test design could in- clude: (1) testing the average performance across scenarios (reflecting the OMS/MP) of a new system against its requirements, (2) testing the average performance of a new system against the baseline, (3) testing performance of a new system against requirements or against a baseline for individual scenarios, or (4) understanding the types of scenarios in which the new system will outperform the baseline system, and by how much. Each of these goals would generally produce a different optimal test design. In addition to test objectives, test designs are optimized using previous information on system performance, which are typically means and vari- ances of performance measures for the system under test and for the baseline system. This is a catch-22 in that the better one is able to target the design based on estimates of these quantities, the less one would clearly need to test, because the results would be known. Nevertheless, previous informa- tion can be extremely helpful in designing an operational test to allocate test samples to scenarios to address test objectives. Specifically, if the goal is to obtain high power, within each scenario, for comparing the new system with the baseline system on an important measure, then a scenario in which the previous knowledge was that the mean performance of the new system was close to that for the baseline would result in a large sample allocation to that scenario to identify which system is, in fact, superior. But if the goal is to better understand system performance within the scenarios for which the new system outperforms the baseline system, previous knowledge that the mean performances were close would argue that test samples should be allocated to other test sce- narios in which the new system might have a clear advantage. Information from developmental tests, modeling and simulation, and

STATISTICAL DESIGN 65 the performance of similar systems with similar components should be used to target the test design to help it meet its objectives. For IBCT/Stryker, background documents have indicated that the Army expects that differ- ences at low combat intensity may not be practically important but that IBCT/Stryker will be clearly better than the baseline for urban and high- intensity missions. If the goal is to best understand the performance of IBCT/Stryker in scenarios in which it is expected to perform well, it would be sensible to test very little in low-intensity scenarios, since there are un- likely to be any practical and statistically detectable differences in the per- formance between IBCT/Stryker and the baseline. Understanding the ad- vantages of IBCT/Stryker is a key part of the decision whether to proceed to full-rate procurement; therefore, understanding the degree to which Stryker is better in urban, high-intensity environments is important, and so relatively more samples should be allocated to those situations. There may be other expectations concerning IBCT/Stryker that ATEC could comfort- ably rely on to adapt the design to achieve various goals. Furthermore, because the baseline has been used for a considerable length of time, its performance characteristics are better understood than those of IBCT/Stryker. While this may be less clear for the specific sce- narios under which IBCT/Stryker is being tested, allocating 42 scenarios to the baseline system may be inefficient compared with the allocation of greater test samples to IBCT/Stryker scenarios. Testing with Factors at High Stress Levels A general rule of thumb in test design is that testing at extremes is often more informative than testing at intermediate levels, because infor- mation from the extremes can often be used to estimate what would have happened at intermediate levels. In light of this, it is unclear how extreme the high-intensity conflict is, as currently scripted. For example, would the use of 300 OPFOR players be more informative than current levels? Our impression is that, in general, operational testing tests systems at typical stress levels. If testing were carried out in somewhat more stressful situa- tions than are likely to occur, information is obtained about when a system is likely to start breaking down, as well as on system performance for typi- cal levels of stress (although interpolation from the stressful conditions back to typical conditions may be problematic). Such a trade-off should be considered in the operational test design for IBCT/Stryker. In the follow- ing section, a framework is suggested in which the operational test is sepa-

66 IMPROVED OPERATIONAL TESTING AND EVALUATION rated into a learning component and a confirming component. Clearly, testing with factors at high stress levels naturally fits into the learning com- ponent of that framework, since it is an important element in developing a complete understanding of the system's capabilities and limitations. Alternatives to One Large Operational Test In the National Research Council's 1998 report Statistics, Testing, and Defense Acquisition, two possibilities were suggested as alternatives to large operational tests: operational testing carried out in stages and small-scale pilot tests. In this section, we discuss how these ideas might be imple- mented by ATEC. We have classified the two basic objectives of operational testing as learning what a system is (and is not) capable of doing in a realistic opera- tional setting, and confirming that a new system's performance is at a cer- tain level or outperforms a baseline system. Addressing these two types of objectives in stages seems natural, with the objective at the first stage being to learn about system performance and the objective at the second stage to confirm a level of system performance. An operational test could be phased to take advantage ofthis approach: the first phase might be to examine IBCT/Stryker under different condi- tions, to assess when this system works best and why. The second phase would be used to compare IBCT/Stryker with a baseline; it would serve as the confirmation experiment used to support the decision to proceed to full-rate production. In the second phase, IBCT/Stryker would be com- pared with the baseline only in the best and worst scenarios. This broad testing strategy is used by many companies in the pharmaceutical industry and is more fully described in Box, Hunter, and Hunter (19781. Some of the challenges now faced by ATEC result from an attempt to simultaneously address the two objectives of learning and confirming. Clearly, they will often require very different designs. Although there are pragmatic reasons why a multistage test may not be feasible (e.g., difficulty reserving test facilities and scheduling soldiers to carry out the test mis- sions), if these reasons can be addressed the multistage approach has sub- stantial advantages. For example, since TRADOC already conducts some of the learning phase, their efforts could be better integrated with those of ATEC. Also, a multistage process would have implications for how devel- opment testing is carried out, especially with respect to the need to have developmental testing make use of as much operational realism as possible,

STATISTICAL DESIGN 67 and to have the specific circumstances of developmental test events docu- mented and archived for use by ATEC. An important advantage of this overall approach is that the final operational test may turn out to be smaller than is currently the case. When specific performance or capability questions come up in the early part of operational testing, small-scale pilot tests, focused on the analy- sis of these questions, should be seriously considered. For example, the value of situation awareness is not directly addressed by the current opera- tional test for IBCT/Stryker (unless the six additional missions identified in the test plan are used for this purpose). It would be very informative to use Stryker with situation awareness degraded or "turned off" to determine the value that it provides in particular missions (see Chapter 31. COMMENTS ON THE CURRENT DESIGN IN THE CONTEXT OF CURRENT ATEC CONSTRAINTS Using the arguments developed above and referring to the current de- sign of the operational test as described in Chapter 2 (and illustrated in Table 2-1), the discussion that follows takes into account the following constraints of the current test design: Essentially no information about the performance of IBCT/Stryker or the baseline has been used to impact the allocation of test samples in the test design. 2. The allocation of test samples to scenarios is constrained to reflect the allocation of use detailed in the OMS/MP. 3. Operational tests are designed to test the system for typical stresses that will be encountered in the field. 4. Operational tests are single test events. Balanced Design The primary advantage of the current operational test design is that it is balanced. This means that the test design covers the design space in a systematic and relatively uniform manner (specifically, three security opera- tions in a stable environment, SOSE, appear for every two perimeter de- fense missions). It is a robust design, in that the test will provide direct, substantial information from all parts of the design space, reducing the need to extrapolate. Even with moderate amounts of missing data, result-

68 IMPROVED OPERATIONAL TESTING AND EVALUATION ing from an inability to carry out a few missions, some information will still be available from all design regions. Furthermore, if there are no missing data, the balance will permit straightforward analysis and presentation of the results. More specifically, estimation of the effect of any individual factor can be accomplished by collapsing the test results over the remaining factors. And, since estimates of the design effects are uncorrelated in this situation, inference for one effect does not depend on others. However, many of these potential advan- tages of balance can be lost if there are missing data. If error variances turn out to be heterogeneous, the balanced design will be inefficient compared with a design that would have a priori accommodated the heterogeneity. The primary disadvantage of the current design is that there is a very strong chance that observed differences will be confounded by important sources of uncontrolled variation. The panel discussed one potential source of confounding in its October 2002 letter report (see Appendix A), which recommends that the difference in starting time between the IBCT/Stryker test missions and the baseline test missions be sufficiently shortened to reduce any differences that seasonal changes (e.g., in foliage and tempera- ture) might cause. Other potential sources of confounding include: (1) player differences due to learning, fatigue, training, and overall compe- tence; (2) weather differences (e.g., amount of precipitation); and (3) dif- ferences between IBCT/Stryker and the baseline with respect to the num- ber of daylight and nighttime missions. In addition, the current design is not filly orthogonal (or balanced), which is evident when the current design is collapsed over scenarios. For example, for company B in the SOSE mission type, the urban missions have higher intensity than the rural missions. (After this was brought to the attention of ATEC they were able to develop a fully balanced design, but they were too far advanced in the design phase to implement this change). While the difference between the two designs appears to be small in this particular case, we are nevertheless disappointed that the best pos- sible techniques are not being used in such an important program. This is an indication of the need for access (in this case earlier access) to better statistical expertise in the Army test community, discussed in Chapter 6 (as well as in National Research Council, 1998a). During the operational test, the time of day at which each mission begins is recorded, providing some possibility of checking for player learn- ing and player fatigue. One alternative to address confounding due to player learning is to use four separate groups of players, one for each of the two

STATISTICAL DESIGN 69 OPFORs, one for the IBCT/Stryker, and one for baseline system. Inter- group variability appears likely to be a lesser problem than player learning. Alternating teams from test replication to test replication between the two systems under test would be a reasonable way to address differences in learning, training, fatigue, and competence. However, we understand that either idea might be very difficult to implement at this date.2 The confounding factor of extreme weather differences between Stryker and the baseline system can be partially addressed by postponing missions during heavy weather (although this would prevent gaining an understanding of how the system operates in those circumstances). Finally, the lack of control for daylight and nighttime missions remains a concern. It is not clear why this variable could not have been included as a design variable. Aside from the troublesome confounding issue (and the power calcula- tions commented on below), the current design is competent from a statis- tical perspective. However, measures to address the various sources of con- founding need to be seriously considered before proceeding. Comments on the Power Calculations3 ATEC designed the IBCT/Stryker IOT to support comparisons of the subject-matter expert (SME) ratings between IBCT/Stryker and the baseline for particular types of missions for example, high-intensity ur- ban missions and medium-intensity rural SOSE missions. In addition, ATEC designed the operational test for IBCT/Stryker to support compari- sons relative to attrition at the company level. ATEC provided analyses to justify the assertion that the current test design has sufficient power to support some of these comparisons. We describe these analyses and pro- vide brief comments below. SME ratings are reported on a subjective scale that ranges from 1 to 8. SMEs will be assigned randomly, with one SME assigned per company, and two SMEs assigned to each platoon mission. SMEs will be used to evaluate mission completion, protection of the force, and avoidance of col- 2It is even difficult to specify exactly what one would mean by "equal training," since the amount of training needed for the IBCT to operate Stryker is different from that for a Light Infantry Brigade. 3The source for this discussion is U.S. Department of Defense (2002b).

70 IMPROVED OPERATIONAL TESTING AND EVALUATION lateral damage, which results in 10 to 16 comparisons per test. Assuming that the size of an individual significance test was set equal to 0.01, and that there are 10 different comparisons that are likely to be made, from a Bonferroni-type argument the overall size of the significance tests would be at most 0.1. In our view this control of individual errors is not crucial, and ATEC should instead examine two or three key measures and carry out the relevant comparisons with the knowledge that the overall type I error may be somewhat higher than the stated significance level. Using previous experience, ATEC determined that it was important to have sufficient power to detect an average SME rating difference of 1.0 for high-intensity missions, 0.75 for medium-intensity missions, 0.50 for low- intensity missions, and 0.75 difference overall. (We have taken these criti- cal rating differences as given, because we do not know how these values were justified; we have questioned above the allocation of test resources to low-intensity missions.) ATEC carried out simulations of SME differences to assess the power of the current operational test design for IBCT/Stryker. While this is an excellent idea in general, we have some concerns as to how these particular simulations were carried out. First, due to the finite range of the ratings difference distribution, ATEC expressed concern that the nonnormality of average SME ratings differences (in particular, the short tail of its distribution) may affect the coverage properties of any confidence intervals that were produced in the subsequent analysis. We are convinced that even with relatively small sample sizes, the means of SME rating differences will be well represented by a normal distribution as a result of the structure of the original distribu- tion and the central limit theorem, and that taking the differences counters skewness effects. Therefore the nonnormality of SME ratings differences should not be a major concern. ATEC reports that they collected "historical task-rating differences and determined that the standard deviation of these differences was 1.98, which includes contributions from both random variation and variation in performance between systems. Then ATEC modeled SME ratings scores for both IBCT/Stryker and the baseline using linear functions of the con- trolled variables from the test design. These linear functions were chosen to produce SME scores in the range between 1 and 8. ATEC then added to these linear functions a simulated random error variation of + 1, O. and -1, each with probability 1/3. The resulting SME scores were then truncated to make them integral (and to lie between 1 and 81. The residual standard error of dLi~erences of these scores was then estimated, using simulation, to

STATISTICAL DESIGN 71 be 1.2.4 In addition, SME ratings differences (which include random variation as well as modeled performance differences) were simulated, with a resulting observed standard deviation of 2.04. Since this value was close enough to the historical value of 1.98, it supported their view that the amount of random variation added was similar to what would be observed for SMEs in the field. The residual standard error of the mean is defined to be the residual standard error divided by the square root of the sample size. So, when the test sample size that can be used for a comparison is 36 (essentially the entire operational test minus the 6 additional missions), the residual stan- dard error of the mean will be 0.20; twice that is 0.40. ATEC's analysis argues that since 0.75 is larger than 0.40, the operational test will have sufficient statistical power to find a difference of 0.75 in SME ratings. The same argument was used to show that interaction effects that are estimated using test sample sizes of 18 or 12 would also have sufficient statistical power, but interaction effects that were estimated using test sample sizes of 6 or 4 would not have sufficient statistical power to identify SME ratings differences of 0.75. Furthermore, if the residual standard error of ratings differences were as high as 1.4, a sample size of 12 would no longer provide sufficient power to identify a ratings difference of 0.75. Our primary concern with this analysis is that the random variation of SME scores has not been estimated directly. It is not clear why SME rating differences would behave similarly to the various historic measures (see Chapter 31. It would have been preferable to run a small pilot study to provide preliminary estimates of these measures and their variance. If that is too expensive, ATEC should identify those values for which residual standard errors provide sufficient power at a number of test sample sizes, as a means of assessing the sensitivity of their analysis to the estimation of these standard errors. (ATEC's point about the result when the residual standard deviation is raised to 1.4 is a good start to this analysis.) ATEC has suggested increasing statistical power by combining the rat- ings for a company mission, or by combining ratings for company and platoon missions. We are generally opposed to this idea if it implies that the uncom~ine~ratings will not also be reported. 4For this easy example, simulation was not needed, but simulation might be required in more complicated situations.

72 IMPROVED OPERATIONAL TESTING AND EVALUATION During various missions in the IBCT/Stryker operational test, the number of OPFOR players ranges from 90 to 220, and the number of noncombatant or blue forces is constant at 120. Across 36 missions, there are 10,140 potential casualties. For a subset of these (e.g., blue force play- ers), the potential casualties range from 500 to 4,320. ATEC offered analy- sis asserting that with casualty rates of 13 percent for the baseline and 10 percent for IBCT/Stryker, it will be possible to reject the null hypothesis of equal casualty rates for the two systems under test with statistical power greater than 75 percent. It is not clear what distribution ATEC has as- sumed for casualty counts, but likely candidates are binomial and Poisson models. That analysis may be flawed in that it makes use of an assumption that is unlikely to hold: that individual casualties are independent of one an- other. Clearly, battles that go poorly initially are likely to result in more casualties, due to a common conditioning event that makes individual ca- sualty events dependent. As a result, these statistical power calculations are unlikely to be reliable. Furthermore, not only are casualties not indepen- dent, but even if they were, they should not be rolled up across mission types. For example, with respect to assessment of the value of IBCT/Stryker, one casualty in a perimeter defense mission does not equal one casualty in a raid. The unit of analysis appears to be a complicated issue in this test. For example, the unit of analysis is assumed by ATEC to be a mission or a task for SMEs, but it is assumed to be an individual casualty for the casualty rate measures. Both positions are somewhat extreme. The mission may in some cases be too large to use as the unit of analysis. Individual skirmishes and other events occurring within a mission could be assumed to be rela- tively independent and objectively assessed or measured, either by SMEs or by instrumentation. In taking this intermediate approach, the operational test could be shown to have much greater power to identify various differ- ences than the SME analysis discussed above indicates. Finally, we note that although the current operational test tests only company-level operations, brigade-level testing could be accomplished by using one real brigade-level commander supported by (a) two real battalion commanders, each supported by one real company and two simulated com- panies and (b) one simulated battalion commander supported by three simulated companies.

STATISTICAL DESIGN 73 SUMMARY It is inefficient to discover major design flaws during an operational test that could have been discovered earlier in developmental test. Opera- tional test should instead focus its limited sample size on providing opera- tionally relevant information sufficient to support the decision of whether to proceed to full-rate production, and sufficient to refine the system de- sign to address operationally relevant deficiencies. The current design for the IBCT/Stryker operational test is driven by the overall goal of testing the average difference, but it is not as effective at providing information for different scenarios of interest. The primary disadvantage of the current design, in the context of current ATEC constraints, is that there is a distinct possibility that observed differences will be confounded by important sources of uncontrolled variation (e.g., factors associated with seasonal dif- ferences). In the panel's view, it would be worthwhile for ATEC to consider a number of changes in the IBCT/Stryker test design: 1. ATEC should consider, for future test designs, relaxing various rules of test design that it adheres to, by (a) not allocating sample size to sce- narios to reflect the OMS/MP, but instead using principles from optimal experimental design theory to allocate sample size to scenarios, (b) testing under somewhat more extreme conditions than typically will be faced in the field, (c) using information from developmental testing to improve test design, and (~) separating the operational test into at least two stages, learn- ing and confirmatory. 2. ATEC should consider applying to future operational testing in gen- eral a two-phase test design that involves, first, learning phase studies that examine the test object under different conditions, thereby helping testers design further tests to elucidate areas of greatest uncertainty and impor- tance, and, second, a phase involving confirmatory tests to address hypoth- eses concerning performance vis-a-vis a baseline system or in comparison with requirements. ATEC should consider taking advantage of this ap- proach for the IBCT/Stryker JOT. That is, examining in the first phase IBCT/Stryker under different conditions, to assess when this system works best, and why, and conducting a second phase to compare IBCT/Stryker to a baseline, using this confirmation experiment to support the decision to proceed to full-rate production. An important feature of the learning phase

74 IMPROVED OPERATIONAL TESTING AND EVALUATION is to test with factors at high stress levels in order to develop a complete understanding of the system's capabilities and limitations. 3. When specific performance or capability problems come up in the early part of operational testing, small-scale pilot tests, focused on the analy- sis of these problems, shoulcl be seriously consiclerecl. For example, ATEC shoulcl consider test conditions that involve using Stryker with situation awareness clegraclecl or turned off to determine the value that it provides in . . . partlcu. tar missions. 4. ATEC shoulcl eliminate from the IBCT/Stryker IOT one signifi- cant potential source of confouncling, seasonal variation, in accordance with the recommendation provided earlier in the October 2002 letter report from the panel to ATEC (see Appenclix A). In aclclition, ATEC shoulcl also seriously consider ways to reduce or eliminate possible confounding from player learning, and clay/night imbalance. One possible way of addressing the concern about player learning is to use four separate groups of players for the two OPFORs, the IBCT/Stryker, and the baseline system. Also, alternating teams from test replication to test replication between the two systems under test would be a reasonable way to address differences in . . . ~ . . .earnlng, training, tatlgue, anc ~ competence. 5. ATEC shoulcl reconsider for the IBCT/Stryker their assumption concerning the distribution of SME scores and shoulcl estimate the residual standard errors clirectly, for example, by running a small pilot study to provide preliminary estimates; or, if that is too expensive, by identifying those SME score differences for which residual standard errors provide suf- ficient power at a number of test sample sizes, as a means of assessing the sensitivity of their analysis to the estimation of these standard errors. 6. ATEC shoulcl reexamine their statistical power calculations for the IBCT/Stryker JOT, taking into account the fact that individual casualties may not be independent of one another. 7. ATEC shoulcl reconsider the current units of analysis for the IBCT/ Stryker testing a mission or a task for SME ratings, but an individual casualty for the casualty rate measures. For example, individual skirmishes and other events that occur within a mission shoulcl be objectively assessed or measurecl, either by SMEs or by instrumentation. S. Given either a learning or a confirmatory objective, ignoring various tactical considerations, a requisite for operational testing is that it shoulcl not commence until the system design is mature.

STATISTICAL DESIGN 75 Finally, to address the limitation that the current IBCT/Stryker IOT tests only company-level operations, ATEC might consider brigade-level testing, for example, by using one real brigade-level commander supported by (a) two real battalion commanders, each supported by one real company and two simulated companies, and (b) one simulated battalion commander supported by three simulated companies.

Next: 5. Data Analysis »

Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report (2003)

Chapter: 4. Statistical Design

Welcome to OpenBook!

Get Email Updates