Upgrading Statistical Methods for Testing and Evaluation
Chapter 3 outlined a new paradigm for integrating testing into defense system development. This new paradigm reflects state-of-the-art industrial models and is based on applying statistical principles throughout the system development process. Other improvements to the testing and evaluation process itself could be realized by applying current views of statistical methodology in a more widespread and appropriate way.
Conclusion 4.1: The current practice of statistics in defense testing design and evaluation does not take full advantage of the benefits available from the use of state-of-the-art statistical methodology.
This chapter presents an overview of that methodology so that test planning, design, and evaluation are as effective and efficient as possible; chapters 5-9 discuss these issues in greater detail. The adoption of many current techniques can be accomplished at minimal expense, and some discussion of how this can be accomplished is presented in Chapter 10. These changes are implementable in the short term and do not, generally speaking, require the institution of the new paradigm recommended here, although some of them would be more effective if implemented concurrently.
Detailed recommendations related to test design, test evaluation, design and evaluation for reliability, availability, and maintainability, software test methodology, and use of modeling and simulation are in the chapters that follow; this chapter presents a less technical review of the inadequacy of current statistical practice in defense testing and the benefits to be gained from use of the best current methods and practices. In this chapter we highlight issues in need of immediate attention, identifying areas in which current Department of Defense
(DoD) practice differs substantially from best practice, to the detriment of effective operational test and evaluation.
We focus on operational (rather than developmental) testing, especially for ACAT I systems. However, many of the issues raised and recommendations made apply to developmental (or other forms) of testing, and to systems in other acquisition categories.
KEY ISSUES ILLUSTRATING THE USES OF STATISTICAL METHODS IN OPERATIONAL TESTING AND EVALUATION
Test Planning and Design
Test planning consists of collecting specific information about various characteristics of a system and the anticipated test scenarios and environments and recognizing the implications of this information for test design. Test planning is crucial to a test's success. Test planning comprises several elements (see, e.g., Hahn, 1977; Coleman and Montgomery, 1993).
Defining the Purpose of the Test Operational tests often have multiple objectives, for example: to measure "average" or "typical" performance across relevant scenarios, to identify sources of the most important system flaws and limitations, or to measure system performance in the most demanding scenario. Each of these objectives could be applied to several performance measures. Different objectives and measures can require different tests. Some test designs that are extremely effective for one purpose can be quite ineffective for others; therefore, agreeing on the purpose of the test is necessary for test design. One must also identify those performance measures that are most important (however defined) so that the operational test can be designed to effectively measure them.
Handling Test Factors Test factors include the defining variables for: environments of interest (temperature, terrain, humidity, day/night, etc.), tactics and the use of countermeasures, the training and ability of the users, and the particular prototype used. Clearly, how a system's performance varies across different values of some factors (e.g., in day or night, or against various kinds of enemy tactics) is crucial to an informed decision about procuring the system. Some test factors are under the control of the test planner and some are not, and some test factors are (or are not) influential in that varying them can cause substantial changes in system performance. Considering each test factor with respect to whether or not it is controllable and/or influential, may require different approaches to its use in testing. A serious problem arises from the failure to consider some influential factors in the test design: such a failure can make the test ineffective since those factors may vary during the test, causing performance differences.
Specifying Test Constraints Budgets, environmental constraints, and various limitations concerning the scheduling of test participants and test facilities are just a few of the constraints on an operational test. The ability to change the level (or category) of test factors in time for successive test runs may also be constrained. To effectively design a test, one must fully understand the various test constraints.
Using Previous Information to Assess Variation Understanding the degree of variation performance measures from repeated tests within a single scenario, compared with the variation in measures between scenarios (i.e., the sensitivity of the system to changes in environment, tactics, users, prototypes), is needed to decide how to allocate prototypes to scenarios and the number of replications needed for each scenario. This information is also needed to decide on the maximum number of scenarios to use if estimates of performance for individual scenarios are needed. Collecting information about the sources and degrees of variability is crucial for effective test design.
Establishing Standardized, Consistent Data Recording Procedures Operational tests are, in part, unscripted activities for which data collection is clearly complicated. To use test results from a given system for test design or evaluation of another (related) system and to properly evaluate whether a test result was due to unusual circumstances, test data need to be recorded in a form that is accessible and useful across the military services. Such data are also extremely valuable in improving operational test practice, and in the validation of modeling and simulation when used in conjunction with data on real use, by comparing observed performance with test performance. The standardization of data recording should be done in a manner analogous to that of industry, using ideas from industrial standards (ISO 9000).
Using Preliminary Tests for Test Planning Information, and Running Operational Tests in Stages Effective decisions about operational test design and test size require information that is often system specific, especially with respect to operational characteristics. For example, data needed to determine an operational test's sample size is sometimes not available from other systems or developmental test results. Moreover, issues often arise in the test of a complicated system that are difficult to anticipate.
For these reasons, some form of inexpensive and operationally realistic preliminary testing would be extremely valuable. Such tests would help ensure that operational tests will be as informative, effective, and efficient as possible. Preliminary testing is broader than the Army concept of a force development test and evaluation; it includes the collection of other information concerning test design and test conduct. Additional benefits of preliminary testing are discussed in Chapter 3 (as well as in Part II), in terms of a more continuous assessment of
operational system performance. Although there are aspects of operational testing that do not easily scale down to small tests (e.g., the number of users and systems needed for the test of a radar system), we are convinced that the use of preliminary testing has not been adequately explored.
The recommended continuous process of information gathering with respect to the operational performance of a weapon system under development could be accomplished in many ways. Operationally realistic, small, focused tests could be conducted earlier in system development. Developmental testing could incorporate aspects of operational realism. Training exercises could be organized to make objective performance assessments. Although some or all of these may not be feasible for individual systems, some of these methods will be feasible and should be used for most systems.
Additional Considerations Five other issues need to be addressed before designing a test:
the proper experimental range for the controllable variables;
the statistical relationships between performance (effectiveness) measures and test factors;
the desired degree of precision of the estimates;
the existence of previous benchmarks of performance; and
the statistical techniques that will be required to analyze the data.
Comprehensive test planning, including the elements described above, should be a routine, early step in the design of operational tests. (For more details on test planning, see Chapter 5.) Those in charge of an operational test must work with appropriate statistical experts to understand how the above information can be used in the operational test's experimental design. One useful technique, suggested by J. Stuart Hunter (personal communication) as a means to address many of the issues of test planning, is to try to guess the test results a priori. This is a quick way to communicate objectives of the test, test factors, and the expected between- and within-scenario variability. It will also be useful in estimating the statistical power of the test. Modeling and simulation could be very useful in organizing this information (see Chapter 9). The above components of test planning would be extremely helpful for the hypothetical major of Chapter 1 as a checklist when designing a complicated operational test and would assist the major in communicating with an experimental design expert. Templates for this purpose exist in the statistical literature and could be modified to be more specific to the operational testing of defense systems.
Recommendation 4.1: Comprehensive operational test planning should be required of all ACAT I operational tests, and the results should be appropriately summarized in the Test and Evaluation Master Plan. The
following information should be included: (1) the purpose and scope of the test, (2) explicit identification of the test factors and methods for handling them, (3) definition of the test environment and specification of constraints, (4) comparison of variation within and across test scenarios, and (5) specified, consistent data recording procedures. All of these steps should be documented in a set of standardized formats that are consistent across the military services. The elements of each set of formats would be designed for a specific type of system. The feasibility of preliminary testing should be fully explored by service test agencies as part of operational test planning.
Test Analysis and Reporting
Reporting Estimates of Uncertainty
In order to fully use the information collected in an operational test, it is important that all reported test results—typically averages and percentages—be accompanied by an assessment of their uncertainty. This information would alert decision makers about the variability of performance estimates, which in turn can help to determine the risks and benefits of proceeding to full-rate production on the basis of the results of the operational test. Such reporting should be in the form of confidence intervals, at a typical level of 90 or 95 percent, for each measure of performance or effectiveness. For example, instead of reporting that a missile system obtained an estimated average hit rate of 0.85, the report would state that there is a 95 percent confidence interval for the hit rate in the range between 0.65 and 0.92. Presenting the test results in this way helps to raise important questions, such as: If the true hit rate is as low as 0.65, would one decide to go ahead with procurement or to perform more testing to rule 0.65 out (one hopes)?
If significance testing for major measures of performance or effectiveness is used to decide whether to proceed to full-rate production (a method criticized in Chapter 6), the probability of "passing" and "failing" an operational test given that the true system performance is at various levels (the "operating characteristics" of the test) should also be provided to the decision makers. This should be done for several hypothesized performance levels. Four levels that would be particularly informative would be (1) at a level higher than the requirement, (2) at a level equal to the requirement, (3) at a minimally acceptable level, and (4) at a clearly unacceptable level. Along with information on the costs of further testing and the consequences of incorrect decisions about further system development, these probabilities would provide valuable information to decision makers about the risks involved in deciding to pass a system to full-rate production.
The evaluation report should also include the uncertainty of results for estimates for each important individual scenario. For example, if a test includes two
replications at a scenario of great interest, the confidence interval of the test results that are specific to that scenario should be reported. If an evaluation uses modeling and simulation, results from an analysis of the variability due to model misspecification, and its effect on the simulation results, should also be reported, providing important information about the reliability of input from the simulation model. (For more details on test evaluation, see Chapter 6.)
Recommendation 4.2: All estimates of the performance of a system from operational test should be accompanied by statements of uncertainty through use of confidence intervals. If significance testing is used, the operating characteristics of the test (along with the costs of additional testing and the consequences of incorrect decisions) should also be reported. Estimates of uncertainty for performance in important individual scenarios should also be provided, as should information about variability due to model misspecification and its effect on simulation results.
Combining Information from All Appropriate Sources for Test Design and Evaluation
Sources of information available on the performance of a defense system under development, before operational testing, include the test results and field use of the system that is intended to be replaced, the performance of similar components on other systems currently in use, the results of developmental tests, data from possibly less controlled situations such as training exercises or contractor test results, and early operational assessments or the preliminary testing suggested above.
Test designers need to make use of all of the available information about system performance in order to make an operational test as effective and efficient as possible. The above sources can be used to provide information on: the variability of system performance across replications within a single scenario; the sensitivity of the performance of the system to particular changes in the environment, tactics, countermeasures, etc.; the variability of system performance across prototypes; and components of the system or system design issues that might need focused attention due to perceived problems. This information can also help identify whether modeling or simulation is likely to be effective in various aspects of operational testing, by understanding whether it was effective in testing the baseline or related systems. Information about these characteristics of system performance are not now uniformly collected and used for test design. For example, in the Longbow Apache, the test design described in Appendix B calls for only a modest increase in the number of replications in nighttime scenarios over the number in daytime scenarios (320 versus 256). However, it is likely that the variability of the difference in performance between the Longbow Apache
and the baseline Apache would be much greater in nighttime than in daytime scenarios, suggesting that a more efficient test design would have been to use relatively more replications of the Longbow during nighttime. Given the importance of test design, especially for ACAT I systems, this information needs to be collected and used.
Test evaluators can also make important, productive use of this auxiliary information, particularly when measuring system suitability. There is an understandable concern about combining operational test information with information from tests either on different systems or from nonoperational settings (such as developmental tests). Combining such information for test evaluation will sometimes be inappropriate. However, there are statistical methods to identify when various forms of pooling or weighting data are appropriate, and there are statistical methods that can account for the difference in test circumstances that might be directly relevant. The applicability of these methods should be investigated.
Most methods for using auxiliary information, either for test design or for test evaluation, are beyond the technical expertise of our hypothetical major, who will need access to statistical expertise in order to use this auxiliary information effectively. The assumptions that underlie these methods require extensive understanding of the systems and tests in question; therefore, collaboration between those with statistical and system expertise is vital.
Recommendation 4.3: Information from tests and field use of related systems, developmental tests, early operational tests, and training and contractor testing should be examined for possible use in appropriate combination, when defensible, and with operational test results to achieve a more comprehensive assessment of system effectiveness and suitability.
It is important to stress the use of the term appropriate. For example, simple pooling and other use of developmental test data may be inappropriate since the test circumstances can be so different from operational use. (A good example concerning the Javelin is given in U.S. General Accounting Office, 1997:12; see also Table 9.1.)
Clearly, the ability to combine data on system performance from these various sources, including developmental test, is strongly dependent on the establishment of standardized methods of documenting test circumstances. Therefore, the archive put forward in Recommendation 3.3 (and the ideas in the supporting text of Chapter 3) is crucial to support combination of information. Otherwise, it is extremely difficult to understand the operational relevance of the information.
FOCUSED USES OF STATISTICAL METHODS
The operational test of a defense system is an experiment with a key objective of determining whether its performance satisfies stated requirements or exceeds those of a control or benchmark system. Since operational testing, particularly of ACAT I systems, is important and costly, operational tests must be designed so that they are as informative and efficient as possible. They must produce results that permit the best decisions to be made with respect to proceeding to full-rate production.
The statistical field of experimental design has made enormous advances in addressing precisely this broad issue, and specific techniques have been developed to design a test to either maximize the information gained given fixed costs or (equivalently) to minimize costs while providing information that permits a decision with acceptably small risk. Methods relevant to this problem include the use of randomization and controls, various specific designs (including fractional factorial and Plackett-Burman designs), and response surface methods. In addition, several principles have been discovered that broaden the applicability of these specific techniques (when not directly applicable) to test situations. Two general principles that can be applied to a wide variety of testing problems are: (1) test relatively more where variation of what is being measured is greatest, and (2) choose (some) values for test factors that are close to the limits of typical use. (For more details on test design, see Chapter 5.)
Although some of these techniques and principles are finding their way into DoD's standard operational test design, current practice is still substantially distant from the state of the art. This has resulted in inefficient test designs, wasted resources, and less effective acquisition decision making.
Recommendation 4.4: The service test agencies should examine the applicability of state-of-the-art experimental design techniques and principles and, as appropriate, make greater use of them in the design of operational tests.
Appropriate Models for the Distribution of Failure Times
The distribution of times to first failure or of times between failures of defense systems can depend on the cause of the failure, the age of the system, whether the system had experienced previous failures and then been repaired, the amount of time the system has been in continuous operation, the users, the stress of the environment, and the specific prototype used. It is common (though some exceptions exist) for testers in the defense community to assume that the time to first failure or between failures follows a common exponential distribution, prob
ably because of the simplicity of the methods that result. But the use of this distribution amounts to an implicit assumption that reliability does not depend on the age of the system or the amount of time on test. The assumption of exponentiality is used for test design (to decide how much time on test is sufficient), and it is used for test evaluation (e.g., to provide estimates of the variance of estimates of mean time to failure). When this assumption is completely accepted, it does not make any difference how many different prototypes are used or whether the systems being tested were new systems or systems that had been repaired. (For more details on test design and evaluation of system reliability, see Chapter 7.)
There are systems and components for which the assumption of exponential failure times is appropriate, such as some types of electronic systems. However, when it is not, there are many alternative failure-time distribution models that will be more appropriate for many systems. Use of these alternatives (when valid) can have the following benefits:
smaller test size when assessing reliability for systems that have the characteristic that their reliability decreases with use (which is frequently the case), since the data are more informative than if reliability is not a function of time in use;
test size that is relevant to the process being tested;
test design that explicitly addresses the treatment of repaired and unrepaired systems; and
significance tests that are more valid, since they are based on a more accurate model of the failure-time distribution.
Recommendation 4.5: Service test agencies should examine the use of alternative models to that of the exponential distribution for their applicability to model failure-time distributions in the operational tests of defense systems.
Since ACAT I defense systems essentially now all have a software component and since software reliability is a common troublespot in recently developed defense systems, the proper testing of software in defense systems has a high priority. At least one service uses the resources available for operational testing of a system with a software component to check individual lines of code. Although this is a useful activity in developmental testing, it is very resource intensive and does not serve the purposes of operational testing.
The proper method for the operational testing of software is usage-based testing, which tests whether the system is fit for its intended use. This testing involves using specific user profiles to develop a statistical distribution of how
the software is going to be used, that is, how commands or inputs are chosen. These distributions are then drawn from and applied to the software to determine if it performed correctly. When testing schedules and budgets are tightly constrained, usage-based testing yields the highest practical reliability because if failures occur, they will be high frequency failures. (The existence of rarer, but particularly crucial failures can be separately tested.) Companies such as AT&T, IBM, and Texas Instruments are increasingly using such strategies, and there has also been some use in the Army, Navy, and Air Force. The panel is convinced that this is an effective strategy with many benefits and urges DoD to expand its use of this strategy. (For more details on test design for software-intensive systems, see Chapter 8.)
Recommendation 4.6: The Department of Defense should expand its use of a usage-based perspective in the operational testing of softwareintensive systems.
Use of Modeling
Modeling and simulation are now being widely considered and occasionally used by DoD to augment operational testing. Given the benefits of decreased cost, enhanced safety, and avoidance of environmental and other constraints, it is natural to explore the extent to which such use can contribute to operational testing. (Note that the use of modeling and simulation cannot, by itself, replace operational tests.) One method for using modeling and simulation to augment operational tests is referred to as model-test-model. A model is developed, an operational exercise is carried out for modeled scenarios, and differences between the observed and the modeled performance are used to modify the model so that it is more in conformance with the observations.
This procedure will likely (but not necessarily) result in a revised model that is valid for situations very close to or the same as the tested scenarios. However, if the model is used to extrapolate performance to untested scenarios that are relatively distinct from those tested, there is a great risk that the predictions will be of very poor quality, possibly even worse than those from the unmodified model because a small number of operational experiences are used to refit what is typically a very large, complicated model. Such ''overfitting" can skew the model to represent only the particular scenarios tested.
The size of the differences between the predictions and the observations are an indication of the quality of the original and the modified A model.1 recommended method for measuring the degree to which the model has been overfit in
the model-test-model approach is to test the system again, but on new, possibly randomly selected scenarios and then, without refitting the model, measure the average distance between the model's predictions and the observed performance of the system. This approach could be given the name model-test-model-test, and it may be appropriate to use in augmenting some operational tests. (For more details on use of modeling and simulation for test design and evaluation, see Chapter 9.)