APPENDIX
B
A Short History of Experimental Design, with Commentary for Operational Testing
Some of the most important contributions to the theory and practice of statistical inference in the twentieth century have been those in experimental design. Most of the early development was stimulated by applications in agriculture. The statistical principles underlying design of experiments were largely developed by R. A. Fisher during his pioneering work at Rothamsted Experimental Station in the 1920s and 1930s.
The use of experimental design methods in the chemical industry was promoted in the 1950s by the extensive work of Box and his collaborators on response surface designs (Box and Draper, 1987). Over the past 15 years, there has been a tremendous increase in the application of experimental design techniques in industry. This is due largely to the increased emphasis on quality improvement and the important role played by statistical methods in general, and design of experiments in particular, in Japanese industry. The work of the Japanese quality consultant G. Taguchi on robust design for variation reduction has shown the power of experimental design techniques for quality improvement.
Experimental design techniques are also becoming popular in the area of computeraided design and engineering using computer/simulation models, including applications in manufacturing (automobile and semiconductor industries), as well as in the nuclear industry (Conover and Iman, 1980). Statistical issues in the design and analysis of computer/simulation experiments are discussed in Sacks et al. (1989).
Robust design uses designed experiments to study the response surfaces associated with both mean and variation, and to choose the factor settings judiciously so that both variability and bias are made simultaneously small. Variability is studied by identifying important “noise” variables and varying them systematically in offline experiments. Robust design ideas have been used extensively in industry in recent years (see Taguchi, 1986; Nair, 1992).
Some basic insights of experimental design have had revolutionary impact, but many of these insights are not well known among scientists without specialized training in statistics, partly because elementary texts and first courses seldom allocate time to this topic at all, or with any depth. For example, the role of randomization and the inefficiency of the practice of varying one factor at a time are
not well appreciated. To the extent that this is true of the operational testing community, it should be surprising, since many of the applications and much of the support for research in experimental design derived from problems faced by DoD during and shortly after World War II. The reason may be that practical considerations in carrying out operational testing often impose such complex restrictions on the nature of the experimental design that one cannot rely on standard formulae to optimize the design. Here, as in many other applications of statistical theory to practice, it seems likely that the limited standard textbook rules and dogmas are inadequate for dealing intelligently with the problem. What is required is the kind of expertise that can adapt underlying basic principles to the current situation, an expertise rarely found outside the scope of welltrained statisticians who understand the relation of standard rules to underlying principles.
Both to serve as a reference point for later discussion and to help summarize the progress made in this field, we describe a few of the basic principles and tools of experimental design in barest outline. It is our hope that appreciation of the basic principles will thus be enhanced, and the potential for more sophisticated applications developed.
THE VALUE OF CONTROLS, BLOCKING, AND RANDOMIZATION
Several basic principles of design of experiments are widely understood. One is the need for a control. In comparing two systems, a new one and a standard one whose behavior is relatively well known, there used to be a natural tendency to test and evaluate the new system separately. The result of such an evaluation tends to be biased by a “halo” factor because the new system is being evaluated under conditions somewhat different from the everyday conditions under which the old system has been used. To avoid this bias, it is commonplace to test both systems simultaneously under similar circumstances. With complicated weapon systems, satisfactory control may require careful consideration of the training of the personnel handling the system.
The use of controls has an additional advantage besides that of eliminating a potential inadvertent bias. This advantage stems from the factors that contribute to the variability in the outcomes of individual tests. Ordinarily, the outcome of an experiment depends not only on the overall quality of the system, but also on more or less random variations, some of which are due to the general environment. To the extent that the two systems are tested in the same environment, which is likely to have a similar effect on both systems, the difference in performance is less likely to be affected by the environment, and the experiment yields a more precise estimate of the overall difference in performance of the two systems. If natural variations in the environment have a relatively large effect on the variability in performance, the ability to match pairs has a correspondingly large effect on increasing the precision of conclusions.
When this principle of matching is generalized to more than two systems, it is referred to as blocking, a term derived from agricultural experiments in which several treatments are applied in each of many blocks of land. In the context of operational testing, a series of prototypes and controls are tested simultaneously under a variety of conditions defined by such factors as terrain, weather, degree of training of troops, and type of attack. Here one expects considerable homogeneity within blocks and nontrivial variation from block to block.
The process of blocking raises another issue. How should the various treatments be distributed within a block? In an agricultural experiment, if one assumes that position within the block has no effect, position will not matter. But if there is a systematic gradient in soil fertility in one direction, the use of a systematic allocation might introduce a bias. One way to deal with this possibility is to anticipate the bias and allocate within the various blocks in a clever fashion designed to cancel out the
extraneous and irrelevant gradient effect. This is tricky, and the history of such attempts is full of misguided failures.
Another approach to reducing the bias is to select the allocation within the block by randomization. Often in operational testing applications with a small number of test articles, randomization may not be necessary, and small systematic designs can be used safely. Or one can select a design at random from a restricted class of “reasonably safe” designs. However, in larger and more complicated experiments where there are many blocks, the possible biasing effect due to “unfortunate” randomizations is very likely to be minimal. Moreover, one byproduct of randomization is that it permits the statistician to ignore the complications due to many poorly understood potential biasing phenomena in constructing the probabilistic model on which to base the analysis.
VARYING MORE THAN ONE FACTOR AT A TIME
Perhaps one of the most important insights of experimental design is that the traditional policy of varying one factor at a time is inefficient; that is the resulting estimates have higher variance than estimates derived from experiments with the same number of replications in which several factors are simultaneously varied. We illustrate with two examples. One example, due to Hotelling and based on work by Yates, involves the weighing of eight objects whose weights are w_{i}, 1 ≤ i ≤ 8. A chemist's scale is used, which provides a reading equal to the weight in one pan minus the weight in the other pan plus a random error with mean 0 and variance σ^{2}. Hotelling proposes the design represented by the equations:
X_{1} = w_{1} + w_{2} + w_{3} + w_{4} + w_{5} + w_{6} + w_{7} + w_{8} + u_{1}
X_{2} = w_{1} + w_{2} + w_{3} − w_{4} − w_{5} − w_{6} − w_{7} + w_{8} + u_{2}
X_{3} = w_{1} − w_{2} − w_{3} + w_{4} + w_{5} − w_{6} − w_{7} + w_{8} + u_{3}
X_{4} = w_{1} − w_{2} − w_{3} − w_{4} − w_{5} + w_{6} + w_{7} + w_{8} + u_{4}
X_{5} = w_{1} + w_{2} − w_{3} + w_{4} − w_{5} + w_{6} − w_{7} − w_{8} + u_{5}
X_{6} = w_{1} + w_{2} − w_{3} − w_{4} + w_{5} − w_{6} + w_{7} − w_{8} + u_{6}
X_{7} = w_{1} − w_{2} + w_{3} + w_{4} − w_{5} − w_{6} + w_{7} − w_{8} + u_{7}
X_{8} = w_{1} − w_{2} + w_{3} − w_{4} + w_{5} + w_{6} − w_{7} − w_{8} + u_{8}
where X_{i} is the observed outcome of the ith weighing, every +1 before a w_{j} means that the jth object is in the first pan, a −1 means that it is in the other pan, and u_{i} is the random error for the ith weighing and is not observed directly. We estimate the w_{j} by solving the equations derived by assuming all u_{i} = 0. This gives, for example, the estimate ŵ_{1} of w_{1}, where:
ŵ_{1} = (X_{1} + X_{2} + X_{3} + X_{4} + X_{5} + X_{6} + X_{7} + X_{8})/8
or
ŵ_{1} = w_{1} + (u_{1} + u_{2} + u_{3} + u_{4} + u_{5} + u_{6} + u_{7} + u_{8})/8
Since the u_{i} are the errors resulting from independent weighings, we assume that they are independent
with mean 0 and variance σ^{2}. Then a straightforward computation yields the result that the ŵ_{j} have mean w_{j} and variance σ^{2}/8, and are uncorrelated.
If one had applied 8 weighings to the first object alone, no better result would have been obtained for w_{1}. Thus a design in which each object is weighed separately would require 64 weighings to achieve our results, which were derived from 8 weighings.
Another example, from Mead (1988), confronts the practice of varying one factor at a time more directly. Suppose that the outcome of a treatment is affected by three factors, p, q, and r, each of which can be controlled at two levels, p_{0} or p_{1}, q_{0} or q_{1}, and r_{0} or r_{1}. We are allowed 24 observations. In one experiment we use:

p_{0}q_{0}r_{0} and p_{1}q_{0}r_{0} four times each

p_{0}q_{0}r_{0} and p_{0}q_{1}r_{0} four times each

p_{0}q_{0}r_{0} and p_{0}q_{0}r_{1} four times each
An alternative second experiment uses each of the eight combinations p_{0}q_{0}r_{0}, p_{0}q_{0}r_{1}, p_{0}q_{1}r_{0}, p_{1}q_{0}r_{0}, p_{1}q_{1}r_{0}, p_{1}q_{0}r_{1}, p_{0}q_{1}r_{1}, and p_{1}q_{1}r_{1} three times. We are interested in estimating the difference in average effect due to the use of p_{1} rather than p_{0}. Assume that effects of the factors are additive, and the observations have a common variance σ^{2} about their expectation. Then the variance of the estimate of the difference due to p_{1}q_{k}r_{m} rather than p_{0}q_{k}r_{m} is σ^{2}/2 in the first experiment and σ^{2}/6 in the second. The same holds for the differences due to the second and third factors. A threefold reduction in variance can be achieved by a design that varies several factors at once.
The more efficient design consisted of replicating the eightcase block three times. This design also has the advantage of allowing the designer to select quite distinct environments for each block without worrying much about the contribution of the environmental factors to the overall effect being studied. In case the variations in environment have a large effect on the result, the blocking aspect of the design is useful in increasing the efficiency of the estimation of the contrasting effects of p, q, and r over a design that ignores blocking. Moreover, the design is well balanced in a technical sense, permitting simple analyses of the resulting data, as well as efficiency of the resulting estimates. The simplicity of the analysis, even in this day of cheap and fast computing, retains an advantage in permitting the analyst to present the results in a convincing way to those without a background in statistics.
An experiment in which each combination of controllable factors is considered at several levels is called a factorial experiment. (Factorial designs were developed by Fisher and Yates at Rothamsted.) So, for example, if one has four factors involving five levels each, a factorial experiment would require 5^{4} = 625 distinct observations. Such a large number could be impractical. For such cases, an elegant mathematical theory of incomplete block designs was developed, supplemented by a theory dealing with fractional factorial designs, latin squares, and graecolatin squares for studying the main effects and loworder interactions in a small number of runs. These designs tend to achieve efficiency and balance while reducing potential biases, leading to relatively simple analysis. Fractional factorial designs were introduced by Finney (1945). Orthogonal arrays, recently popularized by Taguchi, include the fractional factorial designs developed by Finney, the designs developed by Plackett and Burman (1946), and the orthogonal arrays developed by Rao (1946, 1947), Bose and Bush (1952), and others.
OPTIMAL EXPERIMENTAL DESIGNS
A major advance in the theory of experimental design was the introduction of optimal experimental design. This theory provides asymptotically optimal or efficient designs for estimating a single un
known parameter for problems in which the relationship between the outcome Y and the independent variables x_{1}, x_{2}, etc. is well understood and easily modeled as a function of a few unknown parameters. While this theory has some limitations in an applied setting, its results can be useful in pointing out targets of efficiency one should approximate, and where one should aim to get reasonably good designs. There are several such limitations.
First, since the theory is a largesample theory, except in the case of regression models, it may approximate good designs only for situations in which limited sample sizes are available.
Second, the optimal designs often depend on the value of the unknown parameter. For example, if the reliability r of a device under stress x is given by r(x) = exp(−θx), then the optimal design for estimating the unknown parameter θ consists of stressing a sample of devices with the stress x = 1.6/θ. In these cases, one must rely on some prior knowledge about the unknown parameter or carry out some preliminary less inefficient experiments to “get the lay of the land.” The latter is often good policy if it is feasible and not inconvenient.
Third, the optimality may depend on an assumed model that is incorrect, causing the resulting design to be suboptimal and possibly even noninformative. For example, consider a linear regression for probability of hit Y, which is a linear function of distance x for x in the range 3 to 4; i.e., Y = α + βx + u, where it is desired to estimate the slope β (Of course, this model makes sense only for a relatively short range of x, since there is the danger of predicting probabilities that are less than 0 or greater than 1.) For each value of x between 3 and 4, one may observe the corresponding value of Y, which depends not only on x but also on the random noise u, which is assumed to have mean 0 and constant variance (independent of x), and is not observed. Then an optimal experiment would consist of selecting half of the x values at 3 and the other half at 4.
However, if this was wrong and a more suitable model for Y as a function of distance was, instead, Y = α + βx + γx^{2} + u, adding a quadratic term, then an optimal design for estimating β would require the use of three values of x, and the above design that is concentrated on two values of x could not be used to estimate this threeparameter model. Note that for this quadratic model, the slope is no longer constant, and β represents the slope at x = 0. This raises the additional question of whether β is the parameter we wish to estimate if the regression is not linear in x. More likely one would want to estimate β + 7γ, the slope at the halfway point.
On the other hand, if one were fairly certain that the linear model was an adequate approximation, but were somewhat concerned with the possibility that gamma was substantial, and so wanted to be highly efficient for the linear model with some recourse in case the quadratic model was appropriate, then minor variations from the optimal design for the linear model could be used to reveal deviations from the model without affecting the efficiency greatly should the linear model be appropriate.
Finally, in many cases the object of the experiment involves the estimation of more than one unknown parameter. It is rarely possible to design an experiment that is simultaneously maximally efficient for estimating each of these parameters. In such cases, it is necessary to establish an appropriate criterion for measuring how well an experimental design does. Several criteria have been advanced. One possibility is to convert estimates of the parameters to estimates of performance of the equipment for each of several environments likely to be encountered. For each such environment, the estimate would have a variance. One could then determine a design that would minimize an average, over the range of environments, of the variances of estimated performance.
RESPONSE SURFACE DESIGNS
The choice of control settings is typically the subject of response surface design and analysis.
Response surfaces are simple linear or quadratic functions of independent variables that are used to approximate a more complex relationship between a response and those independent variables. Two types of optimality are studied. In one case, the response is measured by a simple output that is to be maximized as a function of several control variables. This type of study requires the estimation of a surface that is typically quadratic in the relevant neighborhood. The usual 2^{n} factorial design in which each variable is examined at two levels is inadequate for estimating the necessary quadratic effects for locating the optimal setting. However a 3^{n} factorial design may involve too many settings. Composite designs that supplement the 2^{n} factorial designs with additional points contribute useful information about quadratic effects. In particular, there is a useful class of rotatable designs that are efficient and easy to analyze and comprehend. In a second kind of problem, there may be several output variables to deal with, with good performance requiring that each of these lie within certain acceptable bounds. In many cases, each of these output variables behaves in a roughly linear fashion as a function of the control variables in the region under discussion. Then a 2^{n} factorial design may be appropriate for estimating the linear trends necessary to determine control settings that will contribute satisfactory results.
Typically, the goal in many industrial experiments is to identify the important factors that affect one or more responses from among a large set of factors. Highly fractional, typically maineffect plans are used as screening designs to identify the important factors. The high cost of industrial experimentation limits the number of runs; hence fractional designs with factors typically at two levels are used in these experiments. Once a smaller set of important factors has been identified, the response surface can be studied more thoroughly using designs with more than two levels, and process/product performance can be optimized. This is the rationale behind the response surface methodology developed by Box and others (see Box and Draper, 1987). It should be pointed out that most of the industrial applications along the lines of Taguchi focus on product or process development, and so are closer to the application of developmental testing.
In selecting the settings of the controls in a factorial design, an experimenter must use some background information on what to expect. It would be useless to carry out an experiment in which all the values of a factor were too extreme or too similar. Thus the operational tester must depend on information cumulated from previous experience, for example, from developmental testing, at least to establish what constitutes a useful design from which an analyst, who may be willing to discard that previous history, can draw useful conclusions. To the extent that an experimenter depends on an educated intuition about likely outcomes of an experiment or appropriate models, he or she tends to be subjective. This subjective element can never be fully removed from the design of an experiment, and in the minds of many, not even from the analysis of the resulting outcomes.
BAYESIAN AND SEQUENTIAL EXPERIMENTAL DESIGNS
The theory of Bayesian inference deals with the formalism of subjective beliefs by assigning prior probabilities to such beliefs. With care, this theory can be used productively in both the design and analysis of experiments. One advantage of such a theory is that it forces users to lay out their assumptions explicitly. It also provides a convenient way of expressing the effect of the experiment in the user's posterior probability. Care is required, for priors that seemingly express general ignorance about a parameter sometimes assume much more information than the user thinks.
During World War II, a theory of sequential analysis was developed in connection with weapons testing. According to this theory, there is no point in proceeding further with expensive tests if the results of the first few trials are overwhelming. For example, if a fixedsamplesize test might reject a
weapon that failed in 3 out of 10 trials, it would make sense to stop testing and reject if the weapon failed on the first two trials. This theory led to tests that were as effective as previous fixedsample procedures, with considerable savings in the cost of experimentation.
Although the initial theory confined attention to experiments in which identical trials were repeated, the concept is naturally extended to sequential experimentation. Here, after each trial or experiment, the analystdesigner can decide whether to stop experimentation and make a terminal decision, or continue experimentation. If the decision is to continue, the analystdesigner can then elect which of the alternative trials or experiments to carry out next. Twostage experiments, in which a preliminary experiment is devoted to gaining useful information for the design of a final stage, are special cases of sequential experimentation.
Finally, two active areas of research in experimental design (not specifically Bayesian or sequential) are the use of screening experiments, in which one wishes to discover which of many possible factors has a substantial effect on some response, and designs for testing computer software. The panel is interested in pursuing the application of these two new areas of research as they relate to operational testing.