Experimental Design

Because all testing is expensive, it is crucial that the results of such experiments be as informative as possible. The need for efficient design is particularly keen in DoD applications involving destructive testing—for example, when costly items, such as missiles, can be tested only by observing their performance in an assigned mission. Such concerns with maximizing information are indicated by continuing concern about sample size: “How big does n need to be?” The methods of experimental design, however, concentrate on the overall goal—efficiently obtaining information with direct bearing on the effectiveness of the tested system—rather than on a single proxy measure—sample size.

A number of observations related to experimental design were made at the workshop. First, introducing control units from a current system with known performance may produce more efficient tests of a proposed (i.e., treatment) system. The differences of measurements from matched treatment-control pairs are often substantially less variable than individual measurements from treatment units or control units as a group, in which case fewer observations on test systems are needed to achieve a given level of accuracy. For example, the Javelin (AAWS-M) system (discussed in case study #1), if acquired, would replace the existing Dragon system. Measurements taken under identical test conditions would permit direct comparison of the two systems, even though the new system might employ different tactics and operational procedures.

Currently, comparisons against baseline systems are sometimes made



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 8
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop Experimental Design Because all testing is expensive, it is crucial that the results of such experiments be as informative as possible. The need for efficient design is particularly keen in DoD applications involving destructive testing—for example, when costly items, such as missiles, can be tested only by observing their performance in an assigned mission. Such concerns with maximizing information are indicated by continuing concern about sample size: “How big does n need to be?” The methods of experimental design, however, concentrate on the overall goal—efficiently obtaining information with direct bearing on the effectiveness of the tested system—rather than on a single proxy measure—sample size. A number of observations related to experimental design were made at the workshop. First, introducing control units from a current system with known performance may produce more efficient tests of a proposed (i.e., treatment) system. The differences of measurements from matched treatment-control pairs are often substantially less variable than individual measurements from treatment units or control units as a group, in which case fewer observations on test systems are needed to achieve a given level of accuracy. For example, the Javelin (AAWS-M) system (discussed in case study #1), if acquired, would replace the existing Dragon system. Measurements taken under identical test conditions would permit direct comparison of the two systems, even though the new system might employ different tactics and operational procedures. Currently, comparisons against baseline systems are sometimes made

OCR for page 8
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop during operational testing—sometimes with dramatic results. For example, comparative tests involving the Sergeant York and baseline air defense systems were unbalanced in that the Sergeant York system tended to participate in easier force-on-force trials and appeared to demonstrate superior performance. When controlled for individual trial conditions, however, the tests yielded no consistent performance rankings (Arthur Fries, Appendix B). Although quite useful, baseline testing is often challenged by the perception within DoD that such tests divert resources from testing of the prospective system. Nontesting considerations, such as training, sometimes impose hidden constraints on test plans. For example, one participant described attempts to reduce sample size requirements in the testing of an already deployed system. The original sample size—unjustifiably large in purely statistical terms—was required because the testing was performed for training purposes. To be useful, any mathematical modeling aimed at producing an optimal test design must account for multiple objectives that the test might serve. Circumstances of this type may have implications (and may raise internal budgeting issues) for the testing office; control measurements from current systems may be available at small marginal cost if testing is needed for purposes unrelated to the assessment of system reliability. Another common variance reduction technique involves blocking. Relatively homogeneous experimental units are arranged into blocks that are typically less variable than randomly constructed blocks. This type of blocking allows the experimenter to focus on experimental factors of primary interest by controlling the effects of extraneous variables. In particular, blocking can be used to control effects due to player learning during the course of testing (i.e., time-order effects) and to initial differences in skill levels among test crews. Block designs often yield more sensitive measurements and can be conducted more efficiently than a completely randomized study. Introducing formal sequential methods would produce more realistic estimates of error probabilities. Sequential methods—in the form of the test-fix-retest sequence of developmental tests—are already used informally, but the sequential aspect is typically not taken into account, which leads to the significance probability calculations being biased. Explicit statistical modeling would yield more realistic assessments in such tests. Problems with hidden sequential tests and possible remedial measures are discussed further in the section below on the pitfalls of hypothesis testing. Another important characteristic of defense testing is that, for any particular weapon system, the number of units that DoD might choose to acquire over a defined period of time will have an upper limit. Cost estimates for the SADARM missile system (see case study #2) are based on producing approximately 40,000 155mm projectiles (Horton, Appendix B); Ernest Seglie (Appendix B) cited a hypothetical example in which 1,000 missiles

OCR for page 8
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop were to be produced. (Defense analysts presumably would want to ascertain the basis for such production numbers and their relationship to mission needs.) Sample size calculations based on a conceptually infinite population of units may be biased, especially for expensive systems that would be purchased in small quantities. Such calculations may also overstate the required number of tests. In such cases, the application of finite population sampling methods may produce a better allocation of testing resources. Operational tests might be designed with the goal of maximizing the number of reliable missiles in the stockpile; sample sizes during operational testing should be sufficient to provide reasonable assurance of the desired reliability. One potentially applicable area of experimental design is currently attracting substantial research interest: the designed exercise of large computer codes, such as those used in preparing the COEA and the Operational Requirements Document. Various designs for systematically distributing a sample of computer runs over the high-dimensional space of possible model runs could also provide information to those decision makers ultimately responsible for formulating the Operational Requirements Document. A cautionary note, however, was raised at the workshop in that the complexity of many of the actual models that were discussed pose problems that are not yet amenable to standard analyses. Use of experimental design methods both during COEA and during operational testing also may force a closer study of the relation of measures of performance to measures of effectiveness, and of the functional relation between measures of performance and design variables. This is important in making operational decisions about what design points to use for tests in a resource-scarce setting. Techniques such as response surface analysis could also contribute by forcing an explicit consideration of interactions among design options and, hence, more efficient use of the limited number of design points. Response surface experimental designs are useful for choosing design points that maximize or minimize an objective function and can be used to choose scenarios that span the range of performance. Box et al. (1978:Ch. 15) provide a good introduction to response surface methodology. If nothing else, such considerations demonstrate the inherently unsatisfactory nature of informal designs in which only a single factor at a time is varied. The use of complex experimental designs is not completely unknown in the operational testing of military systems. For example, graeco-latin square designs were used in the 1954 Project Stalk test of alternative tank and sighting system combinations (Fries, Appendix B). Some intricate designs may not be suitable for most defense applications because of constraints on test conditions and implementation. However, tools such as fractional factorial designs and incomplete block designs might be underutilized as meth-

OCR for page 8
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop ods for capturing information with an efficient number of experimental runs. As an illustration, consider the recent test of a ship-launched electronic decoy, described by James Duff (Appendix B). The original design included three types of threat (A, B, C), two altitude levels (high/low), four separation combinations (A/high, A/low, B/low, none), and two timing separations (long, short). With two replications per cell, a complete factorial design would have required 3 × 2 × 4 × 2 × 2 = 96 experimental units. Because only 40 decoys were available for operational testing, the complete factorial design was shrunk to reduce the required number of decoys. Timing separation was eliminated as a factor, and one of the four separation combinations (none) was dropped as a treatment, so that the revised design needed only 3 × 2 × 3 × 2 = 36 experimental units. An obvious question is: Why could only 40 decoys be made available for testing? Nevertheless, with this stated resource constraint, use of an incomplete block design for the operational test might have provided more useful information by retaining the original dimensions of the problem. Also, in this problem, the operational test director exercised judgment in pruning the original design. Timing separation was eliminated, because it was assumed to be the least contributing factor to operational effectiveness (Duff, Appendix B). Bayesian approaches to experimental design allow more formal use of expert opinion and other prior information at the design stage. Yet another related subdiscipline of statistics—sample design—might also be useful in providing quantitative information to satisfy DoD requirements, such as that characterizing system effectiveness in terms of use “by representative personnel” (U.S. Department of Defense, 1991:13). One relevant criticism of some OT&E programs concerns potential biases resulting from using golden crews—that is, operational testers who have higher skill levels than regular users—when testing a system's performance. Designing sampling plans to get representative samples, using methods closely related to experimental design, is a practical way to address some of the perceived problems in test design (Fienberg and Tanur, 1987). The key role of human operators is formally recognized even at the earlier COEA stage, as evidenced by the introduction of man-in-the-loop simulators and their integration into training and systems specification. Highly qualified personnel are employed in developmental testing and combined OT/DT, when attention is focused on how well a system meets technical specifications. Human factors can thus be incorporated into simulation modeling during COEA, as well as into the statistical design of test plans during OT&E . Modeling human factors was particularly important, for example, in assessing the Inter-Vehicular Information System. Distributed

OCR for page 8
Statistical Issues in Defense Analysis and Testing: Summary of a Workshop simulation methods were employed in the COEA for this system, as described by Staniec (Appendix B). In summary, just as earlier statistical quality control methods were eventually adapted to form the basis of the military standard sampling plans, so today's quality movement, with its emphasis on early use of experimental design and related methods to collect information optimally in the pursuit of ultimate quality, has the potential for contributing to current military procurement processes.