Click for next page ( 88


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 87
6 Assessing the IBCT/Stryker Operational Test in a Broad Context In our work reported here, the Panel on the Operational Test Design and Evaluation of the Interim Armored Vehicle has used the report of the Panel on Statistical Methods for Testing and Evaluating Defense Systems (National Research Council, 1998a, referred to in this chapter as NRC 1998) to guide our thinking about evaluating the IBCT/ Stryker Initial Operational Test (IOT). Consistent with our charge, we view our work as a case study of how the principles and practices put for- ward by the previous panel apply to the operational test design and evalua- tion of IBCT/Stryker. In this context, we have examined the measures, design, and evaluation strategy of IBCT/Stryker in light of the conclusions and recommendations put forward in NRC 1998 with the goal of deriving more general findings of broad applicability in the defense test and evalua- . . tlon community. From a practical point of view, it is clear that several of the ideas put forward in NRC 1998 for improvement of the measures and test design cannot be implemented in the IBCT/Stryker IOT due to various con- straints, especially time limitations. However, by viewing the Styker test as an opportunity to gain additional insights into how to do good opera- tional test design and evaluation, our panel hopes to further sharpen and disseminate the ideas contained in NRC 1998. In addition, this perspec- tive will demonstrate that nearly all of the recommendations contained in this report are based on generally accepted principles of test design and evaluation. 87

OCR for page 87
88 IMPROVED OPERATIONAL TESTING AND EVALUATION Although we note that many of the recommendations contained in NRC 1998 have not been fully acted on by ATEC or by the broader de- fense test and evaluation community, this is not meant as criticism. The paradigm shift called for in that report could not have been implemented in the short amount of time since it has been available. Instead, our aim is to more clearly communicate the principles and practices contained in NRC 1998 to the broad defense acquisition community, so that the changes sug- gested will be more widely understood and adopted. A RECOMMENDED PARADIGM FOR TESTING AND EVALUATION Operational tests, by necessity, are often large, very complicated, and expensive undertakings. The primary contribution of an operational test to the accumulated evidence about a defense system's operational suitabil- ity and effectiveness that exist a priori is that it is the only objective assess- ment of the interaction between the soldier and the complete system as it will be used in the field. It is well known that a number of failure modes and other considerations that affect a system's performance are best (or even uniquely) exhibited under these conditions. For this reason, Conclu- sion 2.3 of NRC 1998 states: "operational testing is essential for defense . . .. system evaluation. Operational tests have been put forward as tests that can, in isolation from other sources of information, provide confirmatory statistical "proof" that specific operational requirements have been met. However, a major finding of NRC 1998 is that, given the test size that is typical of the opera- tional tests of large Acquisition Category I (ACAT I) systems and the het- erogeneity of the performance of these systems across environments of use, users, tactics, and doctrine, operational tests cannot, generally speaking, satisfy this role.1 Instead, the test and evaluation process should be viewed as a continuous process of information collection, analysis, and decision making, starting with information collected from field experience of the Conclusion 2.2 of the NRC 1998 report states: "The operational test and evaluation requirement, stated in law, that the Director, Operational Test and Evaluation certify that a system is operationally effective and suitable often cannot be supported solely by the use of standard statistical measures of confidence for complex defense systems with reasonable amounts of testing resources" (p. 33).

OCR for page 87
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 89 baseline and similar systems, and systems with similar or identical compo- nents, through contractor testing of the system in question, and then through developmental testing and operational testing (and in some sense continued after fielding forward to field performance). Perhaps the most widely used statistical method for supporting deci- sions made from operational test results is significance testing. Significance testing is flawed in this application because of inadequate test sample size to detect differences of practical importance (see NRC, 1998:88-91), and because it focuses attention inappropriately on a pass/fail decision rather than on learning about the system's performance in a variety of settings. Also, significance testing answers the wrong question not whether the system's performance satisfies its requirements but whether the system's per- formance is inconsistent with failure to meet its requirements and signifi- cance testing fails to balance the risk of accepting a "bad" system against the risk of rejecting a "good" system. Significance tests are designed to detect statistically significant differences from requirements, but they do not ad- dress whether any differences that may be detected are practically signifi- cant. The DoD milestone process must be rethought, in order to replace the fundamental role that significance testing currently plays in the pass/fail decision with a fuller exploration of the consequences of the various pos- sible decisions. Significance tests and confidence intervals2 provide useful information, but they should be augmented by other numeric and analytic assessments using all information available, especially from other tests and trials. An effective formal decision-making framework could use, for ex- ample, significance testing augmented by assessments of the likelihood of various hypotheses about the performance of the system under test (and the baseline system), as well as the costs of making various decisions based on whether the various alternatives are true. Moreover, designs used in operational testing are not usually constructed to inform the actual deci- sions that operational test is intended to support. For example, if a new system is supposed to outperform a baseline in specific types of environ- ments, the test should provide sufficient test sample in those environments to determine whether the advantages have been realized, if necessary at the methods. 2Producing confidence intervals for sophisticated estimates often requires resampling

OCR for page 87
90 IMPROVED OPERATIONAL TESTING AND EVALUATION cost of test sample in environments where the system is only supposed to equal the baseline. Testing the IBCT/Stryker is even more complicated than many ACAT I systems in that it is really a test of a system of systems, not simply a test of what Stryker itself is capable of. It is therefore no surprise that the size of the operational test (i.e., the number of test replications) for IBCT/Stryker will be inadequate to support many significance tests that could be used to base decisions on whether Stryker should be passed to full-rate production. Such decisions therefore need to be supplemented with information from the other sources, mentioned above. This argument about the role of significance testing is even more im- portant for systems such as the Stryker that are placed into operational testing when the system's performance (much less its physical characteris- tics) has not matured, since then the test size needs to be larger to achieve reasonable power levels. When a fully mature system is placed into opera- tional testing, the test is more of a confirmatory exercise, a shakedown test, since it is essentially understood that the requirements are very likely to be met, and the test can then focus on achieving a greater understanding of how the system performs in various environments. Recommendation 3.3 of NRC 1998 argued strongly that information should be used and appropriately combined from all phases of system de- velopment and testing, and that this information needs to be properly archived to facilitate retrieval and use. In the case of the IBCT/Stryker JOT, it is clear that this has not occurred, as evidenced by the difficulty ATEC has had in accessing relevant information from contractor testing and, indeed, operational experiences from allies using predecessor systems (e.g., the Canadian LAY-III). HOW IBCT/STRYKER IOT CONFORMS WITH RECOMMENDATIONS FROM THE NRC 1998 REPORT Preliminaries to Testing The new paradigm articulated in NRC 1998 argues that defense sys- tems should not enter into operational testing unless the system design is relatively mature. This maturation should be expedited through previous testing that incorporates various aspects of operational realism in addition to the usual developmental testing. The role, then, for operational testing would be to confirm the results from this earlier testing and to learn more

OCR for page 87
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 91 about how to operate the system in different environments and what the system's limitations are. The panel believes that in some important respects Stryker is not likely to be fully ready for operational testing when that is scheduled to begin. This is because: 1. many of the vehicle types have not yet been examined for their suitability, having been driven only a fraction of the required mean miles to failure (1,000 miles); 2. the use of add-on armor has not been adequately tested prior to the operational test; 3. it is still not clear how IBCT/Stryker needs to be used in various types of scenarios, given the incomplete development of its tactics and doc- trine; and 4. the GFE systems providing situation awareness have not been suffi- ciently tested to guarantee that the software has adequate reliability. The role of operational test as a confirmatory exercise has therefore not been realized for IBCT/Stryker. This does not necessarily mean that the IOT should be postponed, since the decision to go to operational test is based on a number of additional considerations. However, it does mean that the operational test is being run with some additional complications that could reduce its effectiveness. Besides system maturity, another prerequisite for an operational test is a full understanding of the factors that affect system performance. While ATEC clearly understands the most crucial factors that will contribute to variation in system performance (intensity, urban/rural, day/night, terrain, and mission type), it is not clear whether they have carried out a systematic test planning exercise, including (quoting from NRC, 1998a:64-651: "~1) defining the purpose of the test; . . . (4) using previous information to compare variation within and across environments, and to understand sys- tem performance as a function of test factors; . . . and (6) use of small-scale screening or guiding tests for collecting information on test planning." Also, as mentioned in Chapter 4, it is not yet clear that the test design and the subsequent test analysis have been linked. For example, if perfor- mance in a specific environment is key to the evaluation of IBCT/Stryker, more test replications will need to be allocated to that environment. In addition, while the main factors affecting performance have been identi- fied, factors such as season, day versus night, and learning effects were not,

OCR for page 87
92 IMPROVED OPERATIONAL TESTING AND EVALUATION at least initially, explicitly controlled for. This issue was raised in the panel's letter report (Appendix A). Test Design This section discusses two issues relevant to test design: the basic test design and the allocation of test replications to design cells. First, ATEC has decided to use a balanced design to give it the most flexibility in esti- mating the variety of main effects of interest. As a result, the effects of terrain, intensity, mission, and scenario on the performance of these sys- tems will be jointly estimated quite well, given the test sample size. How- ever, at this point in system development, ATEC does not appear to know which of these factors matter more and/or less, or where the major uncer- tainties lie. Thus, it may be that there is only a minority of environments in which IBCT/Stryker offers distinct advantages, in which case those en- vironments could be more thoroughly tested to achieve a better under- standing of its advantages in those situations. Specific questions of inter- est, such as the value of situation awareness in explaining the advantage of IBCT/Stryker, can be addressed by designing and running small side ex- periments (which might also be addressed prior to a final operational test). This last suggestion is based on Recommendation 3.4 of the NRC 1998 report (p. 491: "All services should explore the adoption of the use of small- scale testing similar to the Army concept of force development test and experimentation. " Modeling and simulation are discussed in NRC 1998 as an important tool in test planning. ATEC should take better advantage of information from modeling and simulation, as well as from developmental testing, that could be very useful for the IBCT/Stryker test planning. This includes information as to when the benefits of the IBCT/Stryker over the baseline are likely to be important but not well established. Finally, in designing a test, the goals of the test have to be kept in mind. If the goal of an operational test is to learn about system capabilities, then test replications should be focused on those environments in which the most can be learned about how the system's capabilities provide advan- tages. For example, if IBCT/Stryker is intended primarily as an urban system, more replications should be allocated to urban environments than to rural ones. We understand ATEC's view that its operational test designs must allocate, to the extent possible, replications to environments in accor- dance with the allocation of expected field use, as presented in the OMS/

OCR for page 87
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 93 MP. In our judgment the OMS/MP need only refer to the operational evaluation, and certainly once estimates of test performance in each envi- ronment are derived, they can be reweighted to correspond to summary measures defined by the OMS/MP (which may still be criticized for focus- ing too much on such summary measures in comparison to more detailed assessments). Furthermore, there are substantial advantages obtained with respect to designing operational tests by separating the two goals of confirming that various requirements have been met and of learning as much as possible about the capabilities and possible deficiencies of the system before going to full-rate production. That separation allows the designs for these two separate tests to target these two distinct objectives. Given the recent emphasis in DoD acquisition on spiral development, it is interesting to speculate about how staged testing might be incorpo- rated into this management concept. One possibility is a test strategy in which the learning phase makes use of early prototypes of the subsequent stage of development. System Suitability Recommendation 7.1 of NRC 1998 states (p. 1051: The Department of Defense and the military services should give increased attention to their reliability, availability, and maintainability data collection and analysis procedures because deficiencies continue to be responsible for many of the current field problems and concerns about military readiness. While criticizing developmental and operational test design as being too focused on evaluation of system effectiveness at the expense of evalua- tion of system suitability, this recommendation is not meant to suggest that operational tests should be strongly geared toward estimation of system suitability, since these large-scale exercises cannot be expected to run long enough to estimate fatigue life, etc. However, developmental testing can give measurement of system (operational) suitability a greater priority and can be structured to provide its test events with greater operational realism. Use of developmental test events with greater operational realism also should facilitate development of models for combining information, the topic of this panel's next report. The NRC 1998 report also criticized the test and evaluation commu- nity for relying too heavily on the assumption that the interarrival time for

OCR for page 87
94 IMPROVED OPERATIONAL TESTING AND EVALUATION initial failures follows an exponential distribution. The requirement for Stryker of 1,000 mean miles between failures makes sense as a relevant measure only if ATEC is relying on the assumption of exponentially dis- tributed times to failure. Given that Stryker, being essentially a mechanical system, will not have exponentially distributed times to failure, due to wearout, the actual distribution of waiting times to failure needs to be esti- mated and presented to decision makers so that they understand its range of performance. Along the same lines, Stryker will, in all probability, be repaired during the operational test and returned to action. Understanding the variation in suitability between a repaired and a new system should be an important part of the operational test. Testing of Software-Intensive Systems The panel has been told that obtaining information about the perfor- mance of GFE is not a priority of the JOT: GFE will be assumed to have well-estimated performance parameters, so the test should focus on the non-GFE components of Stryker. One of the components of Stryker's GFE is the software providing Stryker with situation awareness. A primary assumption underlying the argument for the development of Stryker was that the increased vulnerability of IBCT/Stryker (due to its reduced armor) is offset by the benefits gained from the enhanced firepower and defensive positions that Stryker will have due to its greater awareness of the place- ment of friendly and enemy forces. There is some evidence (FBCB2 test results) that this situation awareness capability is not fully mature at this date. It would therefore not be surprising if newly developed, complex software will suffer reliability or other performance problems that will not be fully resolved prior to the start of operational testing. NRC 1998 details procedures that need to be more widely adopted for the development and testing of software-intensive systems, including us- age-based testing. Further, Recommendation 8.4 of that report urges that software failures in the field should be collected and analyzed. Making use of the information on situation awareness collected during training exer- cises and in contractor and developmental testing in the operational test design would have helped in the more comprehensive assessment of the performance of IBCT/Stryker. For example, allocating test replications to situations in which previous difficulties in situation awareness had been experienced would have been very informative as to whether the system is effective enough to pass to full-rate production.

OCR for page 87
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 95 Greater Access to Statistical Expertise in Operational Test and Evaluation Stryker, if fully procured, will be a multibillion dollar system. Clearly, the decision on whether to pass Stryker to full-rate production is extremely important. Therefore, the operational test design and evaluation for Stryker needs to be representative of the best possible current practice. The statisti- cal resources allocated to this task were extremely limited. The enlistment of the National Research Council for high-level review of the test design and evaluation plans is commendable. However, this does not substitute for detailed, hands-on, expert attention by a cadre of personnel trained in statistics with "ownership" of the design and subsequent test and evalua- tion. ATEC should give a high priority to developing a contractual rela- tionship with leading practitioners in the fields of reliability estimation, experimental design, and methods for combining information to help them in future IOTs. (Chapter 10 of NRC 1998 discusses this issue.) SUMMARY The role of operational testing as a confirmatory exercise evaluating a mature system design has not been realized for IBCT/Stryker. This does not necessarily mean that the IOT should be postponed, since the decision to go to operational testing is based on a number of additional consider- ations. However, it does mean that the operational test is being asked to provide more information than can be expected. The IOT may illuminate potential problems with the IBCT and Stryker, but it may not be able to convincingly demonstrate system effectiveness. We understand ATEC's view that its operational test designs must allo- cate, to the extent possible, replications to environments in accordance with the allocation of expected field use, as presented in the OMS/MP. In the panel's judgment, the OMS/MP need only refer to the operational evalua- tion, and once estimates of test performance in each environment are de- rived, they can be reweighted to correspond to summary measures defined by the OMS/MP. We call attention to a number of key points: 1. Operational tests should not be strongly geared toward estimation

OCR for page 87
96 IMPROVED OPERATIONAL TESTING AND EVALUATION of system suitability, since they cannot be expected to run long enough to estimate fatigue life, estimate repair and replacement times, identify failure modes, etc. Therefore, developmental testing should give greater priority to measurement of system (operational) suitability and should be struc- tured to provide its test events with greater operational realism. 2. Since the size ofthe operational test (i.e., the number of test replica- tions) for IBCT/Stryker will be inadequate to support significance tests leading to a decision on whether Stryker should be passed to full-rate pro- duction, ATEC should augment this decision by other numerical and graphical assessments from this IOT and other tests and trials. 3. In general, complex systems should not be forwarded to operational testing, absent strategic considerations, until the system design is relatively mature. Forwarding an immature system to operational test is an expensive way to discover errors that could have been detected in developmental test- ing, and it reduces the ability of an operational test to carry out its proper function. System maturation should be expedited through previous testing that incorporates various aspects of operational realism in addition to the usual developmental testing. 4. Because it is not yet clear that the test design and the subsequent test analysis have been linked, ATEC should prepare a straw man test evalu- ation report in advance of test design, as recommended in the panel's Octo- ber 2002 letter to ATEC (see Appendix A). 5. The goals of the initial operational test need to be more clearly specified. Two important types of goals for operational test are learning about system performance and confirming system performance in com- parison to requirements and in comparison to the performance of baseline systems. These two different types of goals argue for different stages of operational test. Furthermore, to improve test designs that address these different types of goals, information from previous stages of system devel- opment need to be utilized. 6. To achieve needed detailed, hands-on, expert attention by a cadre of statistically trained personnel with "ownership" of the design and subse- quent test and evaluation, the Department of Defense and ATEC in par- ticular should give a high priority to developing a contractual relationship with leading practitioners in the fields of reliability estimation, experimen- tal design, and methods for combining information to help them with fu- ture IOTs.