Pitfalls of Hypothesis Testing

The acquisition process must certify systems as having satisfied certain specifications or performance requirements. While there are no mandated methods for doing this, the approach typically has been a classical hypothesis test. For example, a device may be required to have an expected lifetime of 100 hours. With standard assumptions —e.g., that device lifetimes are well-modeled by an exponential distribution —one can determine, for a given sample of units, how long the sample average lifetime must be in order to conclude, at some significance level, that the device's expected lifetime is not less than 100 hours. (In statistical terms, we are thinking of rejecting the null hypothesis that the mean lifetime is less than or equal to 100 hours against the one-sided alternative that the mean lifetime is greater than 100 hours.)

This basic approach has a number of shortcomings. First, for many of the weapon systems, (1) the tests may be costly, (2) they may damage the environment, and (3) they may be dangerous. These considerations often make it impossible to collect samples of even moderate size. At the same time, system performance must usually be assessed under a variety of conditions (scenarios). Thus, minimizing the expected sample size needed to achieve a given level of significance is highly desirable and frequently leads to tests that yield little additional information about system performance.

A second shortcoming is that the small sample sizes often result in test designs that require the system to actually perform at levels well above the



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 33
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop Pitfalls of Hypothesis Testing The acquisition process must certify systems as having satisfied certain specifications or performance requirements. While there are no mandated methods for doing this, the approach typically has been a classical hypothesis test. For example, a device may be required to have an expected lifetime of 100 hours. With standard assumptions —e.g., that device lifetimes are well-modeled by an exponential distribution —one can determine, for a given sample of units, how long the sample average lifetime must be in order to conclude, at some significance level, that the device's expected lifetime is not less than 100 hours. (In statistical terms, we are thinking of rejecting the null hypothesis that the mean lifetime is less than or equal to 100 hours against the one-sided alternative that the mean lifetime is greater than 100 hours.) This basic approach has a number of shortcomings. First, for many of the weapon systems, (1) the tests may be costly, (2) they may damage the environment, and (3) they may be dangerous. These considerations often make it impossible to collect samples of even moderate size. At the same time, system performance must usually be assessed under a variety of conditions (scenarios). Thus, minimizing the expected sample size needed to achieve a given level of significance is highly desirable and frequently leads to tests that yield little additional information about system performance. A second shortcoming is that the small sample sizes often result in test designs that require the system to actually perform at levels well above the

OCR for page 33
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop specified level to ensure that the power of the test approaches reasonable values. Conversely, if the null hypothesis is that the system is performing at the required level, the resulting hypothesis test will be much too forgiving, failing to detect systems that perform at levels well below that specified. There were some revealing exchanges at the workshop about the role of the null hypothesis in determining whether a test result would lead to acceptance or rejection of a system's performance with respect to an established standard. Advocates of the system wanted the null hypothesis to be that the system is performing at the required level; skeptics took the opposite view. Furthermore, it is not clear what are appropriate levels of confidence or power. Is 80 percent reasonable, or 90 percent? On what basis should one decide? Also, the tests are, at least implicitly, often sequential (especially in developmental testing), because test results are examined before deciding whether more testing is required. In this situation, the sequential nature of the tests usually is not recognized and hence the nominal significance level is not adjusted, resulting in tests with actual significance levels that are different from the designed levels. Alternatively, a system may be tested until the results of the test certify the system with respect to some standard of performance. In this case, the resulting estimate of system performance will be biased because of the nature of the stopping rule. Finally, because of the significant costs associated with defense testing, questions about how much testing to do would be better addressed by statistical decision theory than by strict hypothesis testing. Cost considerations are especially important for complex single-shot systems (e.g., missiles) with high unit costs and highly reliable electronic equipment that might require testing over long periods of time (Meth and Read, Appendix B). Voting a system up or down against some standard of performance at a given decision point does not consider the potential for further improvements to the system. A better objective is to purchase the maximum possible military value/utility given the constraints of national security requirements and the budget. This broader perspective fits naturally into a decision analysis framework. Concerns about efficient use of testing resources have also stimulated work on reliability growth modeling (see the preceding section). POSSIBLE IMPROVEMENTS ON STATISTICAL SIGNIFICANCE TESTS As indicated in the section on communicating uncertainty, significance tests have a constraining structure, and it is more informative to present point estimates with uncertainty error measures simply as interval estimates. This approach is a by-product of the more structured modeling approach

OCR for page 33
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop taken, for example, in hierarchical or empirical Bayes analysis. Formal concepts in decision analysis, such as loss functions, can be helpful in this regard. Other decision problems can provide helpful case studies (e.g., Citro and Cohen, 1985, on census methodology). Workshop participants urged that the department move beyond the hypothesis testing paradigm to consider these more general approaches. However, participants also gave some specific suggestions that moved less far from significance tests. Confidence Intervals A simple alternative that avoids the necessity of power calculations is confidence intervals. Confidence intervals give a range of performance levels of a system that are consistent with the test results without the artificial aspect of a significance test's rejection regions. (Confidence intervals can also be compared with the maximum acceptable error, sometimes provided in the standards of performance, to determine whether the system is satisfactory. But this use is implicitly a hypothesis test procedure.) A related idea that can include the results of developmental tests is to report the Bayesian analog of a confidence interval—that is, a highest posterior probability interval. Sequential Tests Another improvement on standard hypothesis testing is sequential analysis, which minimizes the expected number of tests needed to establish significance at a given level. Siegmund (1985) is a good general reference. Sequential probability ratio tests—described, for example, in DeGroot (1970: Ch. 12)—were the first formal sequential methods and actually were developed from applications to military production. Sequential tests make best use of the modest number of available tests. (However, with sequential tests there is a small probability of having to perform a very large number of trials.) Also, these tests avoid the complication posed by the multiple looks that investigators have had on a sequence of test results and the impact of that on nominal significance levels. Unfortunately, sequential methods may be difficult to use in OT&E , because there are times when the results of previous operational tests will not be known before the next test is ready to begin. Methods for group sequential testing and other approaches to sequential monitoring of experimental situations, originally developed for clinical trials in medicine, may be helpful for these types of problems. (Jennison and Turnbull, 1990, provides a good review and further references.) In addition to sequential methods, designs using repeated measures are applicable when a particular

OCR for page 33
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop system is tested a number of times under the same or varying conditions. Sequential tests may still have low power, however, and they do not enable one to directly address the cost-benefit aspect of testing for system performance. Nonparametric Methods Standard parametric analyses are based on certain distributional assumptions—for example, requiring observations that are normally or exponentially distributed. These assumptions cannot always be verified, and nonparametric methods may be more appropriate for these testing applications. In reliability theory, nonparametric inferences typically involve a qualitative assumption about how systems age (i.e., the system failure rate) or a judgment about the relative susceptibility to failure of two or more systems. Recent and ongoing research in this area might be effectively used in defense testing. Decision Analysis Tests for military systems are expensive and often destructive. For example, every test of a system that delivers a projectile results in one fewer projectile for the war-fighting inventory. The natural approach to determine the amount of testing is decision analytic, wherein the added information provided by a test and the benefit of that information is compared with the cost of that test. One modeling approach when using significance tests is to minimize the expected cost of a test procedure: Expected Cost = α (Cost of rejecting if Ho is true) + β (Cost of failing to reject Ho if Ha is true) + (Cost of test itself). where Ho is the null hypothesis, Ha is the alternative hypothesis, and α and 1–β are, respectively, the size and the power of a standard hypothesis test. A decision-theoretic approach is most useful for testing problems that destroy valuable material. A central problem with this approach is that the above costs are usually difficult to estimate. One element of expected cost may be the probability of injury or loss of life due to a lower-performing system compared with the expected cost of a more expensive but higher-performing system. FINAL COMPLICATIONS AND REMARKS An additional difficulty that we have ignored is that real weapon systems typically have several measures of performance. Therefore, the suc-

OCR for page 33
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop cess of a system must be a combination of the measures of success of each individual assessment. Also, to implement several of the above techniques, some methods for combining measures of effectiveness are needed. Finally, weapon system testing is very complicated, and ideally every decision should make use of information in a creative and informative way. To this end it may be useful to produce graphic displays of the results of the various tests. There are now available very effective and informative graphic displays that do not require statistical sophistication to understand; these may aid in making decisions as to whether a system is worth developing. Packages such as Lisp-Stat (Tierney, 1990) and S-Plus (Chambers and Hastie, 1992) include dynamic graphics. Tufte (1983) and Morgan and Henrion (1990) discuss methods for displaying information and accounting for uncertainty when making decisions. Such techniques can allow human judgment to be combined with formal test procedures.