Read "Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report" at NAP.edu

« Previous: 4. Statistical Design

Page 76 Cite

Suggested Citation:"5. Data Analysis." National Research Council. 2003. Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report. Washington, DC: The National Academies Press. doi: 10.17226/10710.

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Data Analysis The panel has noted (see the October 2002 letter report in Appen- dix A) the importance of determining, prior to the collection of data, the types of results expected and the data analyses that will be carried out. This is necessary to ensure that the designed data collection effort will provide enough information of the right types to allow for a fruitful evaluation. Failure to think about the data analysis prior to data collection may result in omitted explanatory or response variables or inad- equate sample size to provide statistical support for important decisions. Also, if the questions of interest are not identified in advance but in- stead are determined by looking at the data, then it is not possible to for- mally address the questions, using statistical arguments, until an indepen- dent confirmatory study is carried out. An important characteristic of the IBCT/Stryker JOT, probably in common with other defense system evaluations, is that there are a large number of measures collected during the evaluation. This includes mea- sures of a variety of types (e.g., counts of events, proportions, binary out- comes) related to a variety of subjects (e.g., mission performance, casual- ties, reliability). In addition, there are a large number of questions of interest. For the IBCT/Stryker JOT, these include: Does the Stryker- equipped force outperform a baseline force? In which situations does the Stryker-equipped force have the greatest advantage? Why does the Stryker- equipped force outperform the baseline force? It is important to avoid "rolling up" the many measures into a small number of summary measures 76

DATA ANALYSIS 77 focused only on certain preidentified critical issues. Instead, appropriate measures should be used to address each of the many possible questions. It will sometimes, but certainly not always, be useful to combine measures into an overall summary measure. The design discussion in Chapter 4 introduced the important distinc- tion between the learning phase of a study and the confirmatory phase of a study. There we recommend that the study proceed in steps or stages rather than as a single large evaluation. This section focuses on the analysis of the data collected. The comments here are relevant whether a single evaluation test is done (as proposed by ATEC) or a series of studies are carried out (as proposed by the panel). Another dichotomy that is relevant when analyzing data is that be- tween the use of formal statistical methods (like significance tests) and the use of exploratory methods (often graphical). Formal statistical tests and procedures often play a large role in confirmatory studies (or in the confir- matory phase described in Chapter 41. Less formal methods, known as exploratory analysis, are useful for probing the data to detect interesting or unanticipated data values or patterns. Exploratory analysis is used here in the broad sense, to include but not to be limited by the methods described in Tukey (19771. Exploratory methods often make extensive use of graphs to search for patterns in the data. Exploratory analysis of data is always a good thing, whether the data are collected as part of a confirmatory study to compare two forces or as part of a learning phase study to ascertain the limits of performance for a system. The remainder of this chapter reviews the general principles behind the formal statistical procedures used in confirmatory studies and those methods used in exploratory statistical analyses and then presents some specific recommendations for data analysis for the IBCT/Stryker JOT. PRINCIPLES OF DATA ANALYSIS Formal Statistical Methods in Confirmatory Analyses A key component of any defense system evaluation is the formal com- parison of the new system with an appropriately chosen baseline. It is usu- ally assumed that the new system will outperform the baseline; hence this portion of the analysis can be thought of as confirmatory. Depending on the number of factors incorporated in the design, the statistical assessment could be a two-sample comparison (if there are no other controlled experi-

78 IMPROVED OPERATIONAL TESTING AND EVALUATION mental or measured covariate factors) or a regression analysis (if there are other factors). In either case, statistical significance tests or confidence in- tervals are often used to determine if the observed improvement provided by the new system is too large to have occurred by chance. Statistical significance tests are commonly used in most scientific fields as an objective method for assessing the evidence provided by a study. The National Research Council (NRC) report Statistics, Testing, and(DefenseAc- quisition reviews the role and limitations of significance testing in defense testing (National Research Council, 1998a). It is worthwhile to review some of the issues raised in that report. One of the limitations of signifi- cance testing is that it is focused on binary decisions: the null hypothesis (which usually states that there is no difference between the experimental and baseline systems) is rejected or not. If it is rejected, then the main goal of the evaluation is achieved, and the data analysis may move to an explor- atory phase to better understand when and why the new system is better. A difficulty with the binary decision is that it obscures information about the size of the improvement afforded by the new system, and it does not recog- nize the difference between statistical significance and practical significance. The outcome of a significance test is determined both by the amount of improvement observed and by the sample size. Failure to find a statistically significant difference may be because the observed improvement is less than anticipated or because the sample size was not sufficient. Confidence inter- vals that combine an estimate of the improvement provided by the new system with an estimate of the uncertainty or variability associated with the estimate generally provide more information. Confidence intervals pro- vide information about whether the hypothesis of"no difference" is plau- sible given the data (as do significance tests) but also inform about the likely size of the improvement provided by the system and its practical significance. Thus confidence intervals should be used with or in place of . . ~ slgnlilcance tests. Other difficulties in using and interpreting the results of significance tests are related to the fact that the two hypotheses are not treated equally. Most significance test calculations are computed under the assumption that the null hypothesis is correct. Tests are typically constructed so that a rejec- tion ofthe null hypothesis confirms the alternative that we believe (or hope) to be true. The alternative hypothesis is used to suggest the nature of the test and to define the region of values for which the null hypothesis is rejected. Occasionally the alternative hypothesis also figures in statistical

DATA ANALYSIS 79 power calculations to determine the minimum sample size required in or- der to be able to detect differences of practical significance. Carrying out tests in this way requires trading off the chances of making two possible errors: rejecting the null hypothesis when it is true and failing to reject the null hypothesis when it is false. Often in practice, little time is spent deter- mining the relative cost of these two types of errors, and as a consequence only the first is taken into account and reported. The large number of outcomes being assessed can further complicate carrying out significance tests. Traditional significance tests often are de- signed with a 5 or 10 percent error rate, so that significant differences are declared to be in error only infrequently. However, this also means that if formal comparisons are made for each of 20 or more outcome measures, then the probability of an error in one or more of the decisions can become quite high. Multiple comparison procedures allow for control ofthe experi- ment-wide error rate by reducing the acceptable error rate for each indi- vidual comparison. Because this makes the individual tests more conserva- tive, it is important to determine whether formal significance tests are required for the many outcome measures. If we think of the analysis as comprising a confirmatory and exploratory phase, then it should be pos- sible to restrict significance testing to a small number of outcomes in the confirmatory phase. The exploratory phase can focus on investigating the scenarios for which improvement seems greatest using confidence intervals and graphical techniques. In fact, we may know in advance that there are some scenarios for which the IBCT/Stryker and baseline performance will not differ, for example, in low-intensity military operations; it does not make sense to carry out significance tests when we expect that the null hypothesis is true or nearly true. It is also clearly important to identify the proper unit of analysis in carrying out statistical analyses. Often data are collected at several different levels in a study. For example, one might collect data about individual soldiers (especially casualty status), platoons, companies, etc. For many outcome measures, the data about individual soldiers will not be indepen- dent, because they share the same assignment. This has important implica- tions for data analysis in that most statistical methods require independent observations. This point is discussed in Chapter 4 in the context of study design and is revisited below in discussing data analysis specifics for the IBCT/Stryker JOT.

80 IMPROVED OPERATIONAL TESTING AND EVALUATION Exploratory Analyses Conclusions obtained from the IOT should not stop with the confir- mation that the new system performs better than the baseline. Operational tests also provide an opportunity to learn about the operating characteris- tics of new systems/forces. Exploratory analyses facilitate learning by mak- ing use of graphical techniques to examine the large number of variables and scenarios. For the IBCT/Stryker JOT, it is of interest to determine the factors (mission intensity, environment, mission type, and force) that im- pact IBCT/Stryker and the effects of these factors. Given the large number of factors and the many outcome measures, the importance of the explor- atory phase of the data analysis should not be underestimated. In fact, it is not even correct to assume (as has been done in this chap- ter) that formal confirmatory tests will be done prior to exploratory data analysis. Examination of data, especially using graphs, can allow investiga- tors to determine whether the assumptions required for formal statistical procedures are satisfied and identify incorrect or suspect observations. This ensures that appropriate methodology is used in the important confirma- tory analyses. The remainder of this section assumes that this important part of exploratory analysis has been carried out prior to the use of formal statistical tests and procedures. The focus here is on another crucial use of exploratory methods, namely, to identify data patterns that may suggest previously unseen advantages or disadvantages for one force or the other. Tukey (1977) and Chambers et al. (1983) describe an extensive collec- tion of tools and examples for using graphical methods in exploratory data analysis. These methods provide a mechanism for looking at the data to identify interesting results and patterns that provide insight about the sys- tem under study. Graphs displaying a single outcome measure against a variety of factors can identify subsets of the design space (i.e., combinations of factors) for which the improvement provided by a new system is notice- ably high or low. Such graphs can also identify data collection or recording errors and unexpected aspects of system performance. Another type of graphical display presents several measures in a single graph (for example, parallel box plots for the different measures or the same measures for different groups). Such graphs can identify sets of outcome measures that show the same pattern of responses to the factors, and so can help confirm either that these measures are all correlated with mission suc- cess as expected, or may identify new combinations of measures worthy of consideration. When an exploratory analysis of many independent mea-

DATA ANALYSIS 81 sures shows results consistent with a priori expectations but not statistically significant, these results might in combination reinforce one another if they could all be attributed to the same underlying cause. It should be pointed out that exploratory analysis can include formal multivariate statistical methods, such as principal components analysis, to determine which measures appear to correlate highly across mission sce- narios (see, for example, Johnson and Wichern, 19921. One might iden- tify combinations of measures that appear to correlate well with the ratings of SMEs, in this way providing a form of objective confirmation of the implicit combination of information done by the experts. Reliability and Maintainability These general comments above regarding confirmatory and explor- atory analysis apply to all types of outcome measures, including those asso- ciated with reliability and maintainability, although the actual statistical techniques used may vary. For example, the use of exponential or Weibull data models is common in reliability work, while normal data models are often dominant in other fields. Meeker and Escobar (1998) provide an excellent discussion of statistical methods for reliability. A key aspect of complex systems like Stryker that impacts reliability, availability, and maintainability data analysis is the large number of failure modes that affect reliability and availability (discussed also in Chapter 31. These failure modes can be expected to have different behavior. Failure modes due to wear would have increasing hazard over time, whereas other modes would have decreasing hazard over time (as defects are fixed). Rather than using statistical models to directly model system-wide failures, each of the major failure modes should be modeled. Inferences about system-wide reliability would then be obtained by combining information from the dif- ferent modes. Thinking about exploratory analysis for reliability and maintainability data raises important issues about data collection. Data regarding the reli- ability of a vehicle or system should be collected from the start of opera- tions and tracked through the lifetime of the vehicle, including training uses of the vehicle, operational tests, and ultimately operational use. It is a challenge to collect data in this way and maintain it in a common database, but the ability to do so has important ramifications for reliability modeling. It is also important to keep maintenance records as well, so that the times between maintenance and failures are available.

82 IMPROVED OPERATIONAL TESTING AND EVALUATION Modeling anal Simulation Evaluation plans often rely on modeling and simulation to address several aspects of the system being evaluated. Data from the operational test may be needed to run the simulation models that address some issues, but certainly not all; for example, no new data are needed for studying transportability of the system. Information from an operational test may also identify an issue that was not anticipated in pretest simulation work, and this could then be used to refine or improve the simulation models. In addition, modeling and simulation can be used to better under- stand operational test results and to extrapolate to larger units. This is done by using data from the operational test to recreate and/or visualize test events. The recreated events may then be further probed via simulation. In addition, data (e.g., on the distributions of events) can be used to run through simulation programs and assess factors likely to be important at the brigade level. Care should be taken to assess the uncertainty effect of the limited sample size results from the IOT on the scaled-up simulations. ANALYSIS OF DATA FROM THE IBCT/STRYKER IOT This section addresses more specifically the analysis of data to be col- lected from the IBCT/Stryker JOT. Comments here are based primarily on information provided to the panel in various documents (see Chapter 1) and briefings by ATEC that describe the test and evaluation plans for the IBCT/Stryker. Confirmatory Analysis ATEC has provided us with detailed plans describing the intended analysis of the SME scores of mission outcomes and mission casualty rates. These plans are discussed here. The discussion of general principles in the preceding section comments on the importance of defining the appropriate unit for data analysis. The ATEC-designed evaluation consists basically of 36 missions for the Stryker- equipped force and 36 missions for the baseline force (and the 6 additional missions in the ATEC design reserved for special studies). These missions are defined by a mix of factors, including mission type (raid, perimeter defense, area presence), mission intensity (high, medium, low), location (rural, urban), and company pair (B. C). The planned analysis of SME

DATA ANALYSIS 83 mission scores uses the mission as the basic unit. This seems reasonable, although it may be possible to carry out some data analysis using company- level or platoon-level data or using events within missions (as described in Chapter 41. The planned analysis of casualty rates appears to work with the individual soldier as the unit of analysis. In the panel's view this is incorrect because there is sure to be dependence among the outcomes for different soldiers. Therefore, a single casualty rate should be computed for each mission (or for other units that might be deemed to yield independent information) and these should be analyzed in the manner currently planned for the SME scores. Several important data issues should be considered by ATEC analysts. These are primarily related to the SME scores. Confirmatory analyses are often based on the assumptions that there is a continuous or at least or- dered categorical measurement scale (although they are often done with Poisson or binomial data) and that the measurements on that scale are subject to measurement error that has constant variance (independent of the measured value). The SME scores provide an ordinal scale such that a mission success score of 8 is better than a score of 7, which is better than a score of 6. It is not clear that the scale can be considered an interval scale in which the difference between an 8 and 7 and between a 7 and 6 are the same. In fact, anecdotal evidence was presented to the panel suggesting that scores 5 through 8 are viewed as successes, and scores 1 through 4 are viewed as failures, which would imply a large gap between 4 and 5. One might also expect differences in the level of variation observed at different points along the scale, for two reasons. First, data values near either end of a scale (e.g., 1 or 8 in the present case) tend to have less measurement variation than those in the middle of the scale. One way to argue this is to note that all observers are likely to agree on judgments of missions with scores of 7 or 8, while there may be more variation on judgments about missions in the middle of the scoring scale (one expert's 3 might be another's 51. Second, the missions are of differing length and complexity. It is quite likely that the scores of longer missions may have more variability than those of shorter missions. Casualty rates, as proportions, are also likely to exhibit nonconstant variance. There is less variation in a low casualty rate (or an extremely high one) and more variation for a casualty rate away from the extremes. Transformations of SME scores or casualty rates should be considered if nonconstant variance is determined to be a problem. The intended ATEC analysis focuses on the difference between IBCT/ Stryker force outcomes and baseline force outcomes for the 36 missions.

84 IMPROVED OPERATIONAL TESTING AND EVALUATION By working with differences, the main effects of the various factors are eliminated, providing for more precise measurement of system effective- ness. Note that variation due to interactions, that is to say variation in the benefits provided by IBCT/Stryker over different scenarios, must be ad- dressed through a statistical model. The appropriate analysis, which ap- pears to be part of ATEC plans, is a linear model that relates the difference scores (that is, the difference between the IBCT and baseline performance measures on the same mission) to the effects of the various factors. The estimated residual variance from such a model provides the best estimate of the amount of variation in outcome that would be expected if missions were repeated under the same conditions. This is not the same as simply computing the variance of the 36 differences, as that variance would be inflated by the degree to which the IBCT/Stryker advantage varies across scenarios. The model would be likely to be of the form Di= difference score for mission i = overall mean + mission type effect + mission intensity effect + location effect + company effect + other desired . . Interactions + error The estimated overall mean is the average improvement afforded by IBCT/Stryker relative to the baseline. The null hypothesis of no difference (overall mean = 0) would be tested using traditional methods. Additional parameters measure the degree to which IBCT/Stryker improvement varies by mission type, mission intensity, location, company, etc. These addi- tional parameters can be tested for significance or, as suggested above, esti- mates for the various factor effects can be reported along with estimates of their precision to aid in the judgment of practically significant results. This same basic model can be applied to other continuous measures, including casualty rate, subject to earlier concerns about homogeneity of variance. This discussion ignores the six additional missions for each force. These can also be included and would provide additional degrees of free- dom and improved error variance estimates. Exploratory Analysis It is anticipated that IBCT/Stryker will outperform the baseline. As- suming that result is obtained, the focus will shift to determining under which scenarios Stryker helps most and why. This is likely to be deter- mined by careful analysis of the many measures and scenarios. In particu-

DATA ANALYSIS 85 far, it seems valuable to examine the IBCT unit scores, baseline unit scores, and differences graphically to identify any unusual values or scenarios. Such graphical displays will complement the results of the confirmatory analyses described above. In addition, the exploratory analysis provides an opportunity to con- sider the wide range of measures available. Thus, in addition to SME scores of mission success, other measures (as described in Chapter 3) could be used. By looking at graphs showing the relationship of mission outcome and factors like intensity simultaneously for multiple outcomes, it should be possible to learn more about IBCT/Stryker's strengths and vulnerabili- ties. However, the real significance of any such insights would need to be confirmed by additional testing. Reliability and Maintainability Reliability and maintainability analyses are likely to be focused on as- sessing the degree to which Stryker meets the design specifications. Tradi- tional reliability methods will be useful in this regard. The general prin- ciples discussed earlier concerning separate modeling for different failure modes is important. It is also important to explore the reliability data across vehicle types to identify groups of vehicles that may share common reliability profiles or, conversely, those with unique reliability problems. Modeling and Simulation ATEC has provided little detail about how the IBCT/Stryker IOT data might be used in post-IOT simulations, so we do not discuss this issue. This leaves open the question of whether and how operational test data can be extrapolated to yield information about larger scale operations. SUMMARY The IBCT/Stryker IOT is designed to serve two major purposes: (1) confirmation that the Stryker-equipped force will outperform the Light Infantry Brigade baseline, and estimation of the amount by which it will outperform and (2) exploring the performance of the IBCT to learn about the performance capabilities and limitations of Stryker. Statistical signifi- cance tests are useful in the confirmatory analysis comparing the Stryker- equipped and baseline forces. In general, however, the issues raised by the 1998 NRC panel suggest that more use should be made of estimates and

86 IMPROVED OPERATIONAL TESTING AND EVALUATION associated measures of precision (or confidence intervals) in addition to significance tests because the former enable the judging of the practical significance of observed effects. There is a great deal to be learned by exploratory analysis of the IOT data, especially using graphical methods. The data may instruct ATEC about the relative advantage of IBCT/Stryker in different scenarios as well as any unusual events during the operational test. . We call attention to several key issues: 1. The IBCT/Stryker IOT involves the collection of a large number of measures intended to address a wide variety of issues. The measures should be used to address relevant issues without being rolled up into over- all summaries until necessary. 2. The statistical methods to be used by ATEC are designed for independent study units. In particular, it is not appropriate to compare casualty rates by simply aggregating indicators for each soldier over a set of missions. Casualty rates should be calculated for each mission (or possibly for discrete events of shorter duration) and these used in subsequent data analyses. 3. The IOT provides little vehicle operating data and thus may not be sufficient to address all of the reliability and maintainability concerns of ATEC. This highlights the need for improved data collection regarding vehicle usage. In particular, data should be maintained for each vehicle over that vehicle's entire life, including training, testing, and ultimately field use; data should also be gathered separately for different failure modes. 4. The panel reaffirms the recommendation of the 1998 NRC panel that more use should be made of estimates and associated measures of pre- cision (or confidence intervals) in addition to significance tests, because the former enable the judging of the practical significance of observed effects.

Next: 6. Assessing the IBCT/Stryker Operational Test in a Broad Context »

Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report (2003)

Chapter: 5. Data Analysis

Welcome to OpenBook!

Get Email Updates