during operational testing—sometimes with dramatic results. For example, comparative tests involving the Sergeant York and baseline air defense systems were unbalanced in that the Sergeant York system tended to participate in easier force-on-force trials and appeared to demonstrate superior performance. When controlled for individual trial conditions, however, the tests yielded no consistent performance rankings (Arthur Fries, Appendix B). Although quite useful, baseline testing is often challenged by the perception within DoD that such tests divert resources from testing of the prospective system.
Nontesting considerations, such as training, sometimes impose hidden constraints on test plans. For example, one participant described attempts to reduce sample size requirements in the testing of an already deployed system. The original sample size—unjustifiably large in purely statistical terms—was required because the testing was performed for training purposes. To be useful, any mathematical modeling aimed at producing an optimal test design must account for multiple objectives that the test might serve. Circumstances of this type may have implications (and may raise internal budgeting issues) for the testing office; control measurements from current systems may be available at small marginal cost if testing is needed for purposes unrelated to the assessment of system reliability.
Another common variance reduction technique involves blocking. Relatively homogeneous experimental units are arranged into blocks that are typically less variable than randomly constructed blocks. This type of blocking allows the experimenter to focus on experimental factors of primary interest by controlling the effects of extraneous variables. In particular, blocking can be used to control effects due to player learning during the course of testing (i.e., time-order effects) and to initial differences in skill levels among test crews. Block designs often yield more sensitive measurements and can be conducted more efficiently than a completely randomized study.
Introducing formal sequential methods would produce more realistic estimates of error probabilities. Sequential methods—in the form of the test-fix-retest sequence of developmental tests—are already used informally, but the sequential aspect is typically not taken into account, which leads to the significance probability calculations being biased. Explicit statistical modeling would yield more realistic assessments in such tests. Problems with hidden sequential tests and possible remedial measures are discussed further in the section below on the pitfalls of hypothesis testing.
Another important characteristic of defense testing is that, for any particular weapon system, the number of units that DoD might choose to acquire over a defined period of time will have an upper limit. Cost estimates for the SADARM missile system (see case study #2) are based on producing approximately 40,000 155mm projectiles (Horton, Appendix B); Ernest Seglie (Appendix B) cited a hypothetical example in which 1,000 missiles