specified level to ensure that the power of the test approaches reasonable values. Conversely, if the null hypothesis is that the system is performing at the required level, the resulting hypothesis test will be much too forgiving, failing to detect systems that perform at levels well below that specified. There were some revealing exchanges at the workshop about the role of the null hypothesis in determining whether a test result would lead to acceptance or rejection of a system's performance with respect to an established standard. Advocates of the system wanted the null hypothesis to be that the system is performing at the required level; skeptics took the opposite view. Furthermore, it is not clear what are appropriate levels of confidence or power. Is 80 percent reasonable, or 90 percent? On what basis should one decide?

Also, the tests are, at least implicitly, often sequential (especially in developmental testing), because test results are examined before deciding whether more testing is required. In this situation, the sequential nature of the tests usually is not recognized and hence the nominal significance level is not adjusted, resulting in tests with actual significance levels that are different from the designed levels. Alternatively, a system may be tested until the results of the test certify the system with respect to some standard of performance. In this case, the resulting estimate of system performance will be biased because of the nature of the stopping rule.

Finally, because of the significant costs associated with defense testing, questions about how much testing to do would be better addressed by statistical decision theory than by strict hypothesis testing. Cost considerations are especially important for complex single-shot systems (e.g., missiles) with high unit costs and highly reliable electronic equipment that might require testing over long periods of time (Meth and Read, Appendix B). Voting a system up or down against some standard of performance at a given decision point does not consider the potential for further improvements to the system. A better objective is to purchase the maximum possible military value/utility given the constraints of national security requirements and the budget. This broader perspective fits naturally into a decision analysis framework. Concerns about efficient use of testing resources have also stimulated work on reliability growth modeling (see the preceding section).

POSSIBLE IMPROVEMENTS ON STATISTICAL SIGNIFICANCE TESTS

As indicated in the section on communicating uncertainty, significance tests have a constraining structure, and it is more informative to present point estimates with uncertainty error measures simply as interval estimates. This approach is a by-product of the more structured modeling approach



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement