The results from complex simulation models (such as those used in COEAs) and from tests conducted in a dynamic, high-dimensional environment (as performed in OT&E ) have substantial variability. Sampling error —i.e., uncertainty that exists because only a sample of simulation runs or operational tests is executed—is often only a minor component of the overall variability. Many nonsampling sources of variability are often ignored, because they are not well understood, they are elusive to characterize, or there is little or no information available about them. But only a thorough investigation and presentation of all the associated uncertainties of COEA and of developmental and operational tests will allow decision makers to interpret the results in an informed way (see also the section below on communicating uncertainty to decision makers).

The assessment of operational effectiveness depends on model assumptions and extrapolations, some of which cannot be tested empirically. Under such circumstances, it is important to examine the degree to which model-based results are sensitive to changes in underlying assumptions. For problems in defense analysis and testing, it is desirable to have models that are (1) robust to plausible changes in assumptions, (2) accurate characterizations of real-world behavior, and (3) adequate for decision-making purposes. The first of these desiderata, robustness, is discussed in relation to sensitivity analysis (see the next section). The latter two give rise to concepts of model validation and accreditation that are discussed in a later section.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 13

Statistical Issues in Defense Analysis and Testing: Summary of a Workshop
Sources of Variability
The results from complex simulation models (such as those used in COEAs) and from tests conducted in a dynamic, high-dimensional environment (as performed in OT&E ) have substantial variability. Sampling error —i.e., uncertainty that exists because only a sample of simulation runs or operational tests is executed—is often only a minor component of the overall variability. Many nonsampling sources of variability are often ignored, because they are not well understood, they are elusive to characterize, or there is little or no information available about them. But only a thorough investigation and presentation of all the associated uncertainties of COEA and of developmental and operational tests will allow decision makers to interpret the results in an informed way (see also the section below on communicating uncertainty to decision makers).
The assessment of operational effectiveness depends on model assumptions and extrapolations, some of which cannot be tested empirically. Under such circumstances, it is important to examine the degree to which model-based results are sensitive to changes in underlying assumptions. For problems in defense analysis and testing, it is desirable to have models that are (1) robust to plausible changes in assumptions, (2) accurate characterizations of real-world behavior, and (3) adequate for decision-making purposes. The first of these desiderata, robustness, is discussed in relation to sensitivity analysis (see the next section). The latter two give rise to concepts of model validation and accreditation that are discussed in a later section.

OCR for page 13

Statistical Issues in Defense Analysis and Testing: Summary of a Workshop
SENSITIVITY ANALYSIS AND ROBUSTNESS
Workshop participants noted the importance of sensitivity analysis in examining the effects of model assumptions on final results. Seglie (Appendix B) posed two key questions: How does one find the most sensitive assumptions? How does one define a robust measure of effectiveness? The second task might be reduced to the choice of a suitable parameterization. For example, should exchange ratios be considered only within a particular scenario instead of across a range of scenarios? These questions should ultimately be answered by military users who are informed by statistical thinking.
Staniec (Appendix B) cited the need for careful analysis of COEA sensitivity to changes in weapon system characteristics. The analysis of the Javelin system (see case study #1) involved many different scenarios and different combat simulation models, each of which is based on a series of assumptions. Introducing human beings into the operation leads to further complications. Closed-form simulations of combat typically do not model human decision processes well. However, with the recent development of “realistic” and real-time distributed simulation, it may be possible to conduct experiments using man-in-the-loop systems to quantify the benefits of various command-and-control system alternatives. Advanced distributed simulation methods will require more elaborate computer experiments, and sensitivity analysis will probably have to be built into the initial experimental design.
In addition to the variability in assessments of operational effectiveness, the uncertainties associated with cost estimates may be large but are typically not expressed in the COEAs performed for prospective defense systems. Defense analysts can provide useful information to decision makers by identifying key factors driving program costs and assessing the sensitivity of cost estimates to plausible changes in these key factors.
Kathryn Laskey observed that statistical methods may play an important part in examining the sensitivity of conclusions to assumptions made in the modeling process. Most combat simulations use nonlinear, deterministic models, and relatively small changes in inputs might lead to large changes in results. Also, output parameter values, such as weapon ranges or kill probabilities, will vary greatly under different conditions, not all of which are included in the model. In general, the variation of results from combat models is due to aggregation over factors included in the study but not in the model, to random error, and to factors not included in the study.
Standard sensitivity analysis involves varying input parameters one at a time and monitoring the corresponding changes in model outputs. Alternatively, more sophisticated multivariate methods are available. One alternative approach to sensitivity analyses is to use a variance components or a

OCR for page 13

Statistical Issues in Defense Analysis and Testing: Summary of a Workshop
hierarchical Bayesian model in which input parameters and/or scenario variations are sampled from probability distributions that reflect a reasonable range of uncertainty about their values (see also the section below on linking information across the acquisition process).
Vijayan Nair suggested that sensitivity of response variables to small deviations in factors being modeled could be designed into the simulation experiment by either computing or simulating the relevant derivatives. The Defense Department presumably should be interested in studying sensitivity primarily in order to develop weapon system designs that are robust to deviations from nominal values of operating characteristics. Nair observed that Taguchi's approach to robust design, used primarily to reduce variation in industrial processes, may also be applicable to COEA. Robustness may be desirable because specifications for weapon systems are often not as rigid in actual operations as they appear to be in the acquisition process. Recent work on the design and analysis of complex numerical experiments (e.g., see Currin et al., 1991, and Sacks et al., 1989) is also relevant here.
MODEL VALIDATION AND ACCREDITATION
As used by defense analysts, to validate is to determine the degree of conformance to reality (no model is ever completely valid), and to accredit is to determine that the model is appropriate for an answer to a particular question. Although these two concepts are clearly related, several authors (e.g., Hodges and Dewar, 1992) have pointed out that a model can fail a test of validity but still provide results that are adequate for decision-making purposes.
According to Patricia Sanders (Appendix B), validation is defined as the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model, i.e., just how good is the model? Model validation is frequently carried out by conducting “sanity checks” to confirm that the model is reasonable. Laskey observed that one can validate a whole model or its pieces; similarly, one can validate intermediate or final results.
Stephen Pollock termed the relationship between the model and the real world as the property of veridicality. He argued that assumptions about the stochastic nature of scenarios —i.e., characteristics such as visibility and threat level may change over the course of a mission—and measures of effectiveness are perhaps the most crucial factors determining the veridicality of mathematical models of combat. Pollock expressed concern that inherent randomness is often eliminated by the use of expected values of parameters (rather than their distributions) as inputs and the reporting of expected values of measures of effectiveness (rather than their distributions) as outputs.
Among statisticians, George Box has written extensively on the process

OCR for page 13

Statistical Issues in Defense Analysis and Testing: Summary of a Workshop
of model-building (see, e.g., Box et al., 1978). Box (1980) describes scientific learning as advancing by a practice-theory iteration. Statistical inference plays a formal role through model criticism (similar to Sanders's use of the term validation) and estimation. Model criticism is carried out using the predictive distribution to determine, under an assumed model, the plausibility of observed experimental results. Observed results that are implausible, in turn, cast doubt on the assumed model.
Sanders (Appendix B) also introduced the term model verification to describe an additional (earlier) stage in the development and use of computer models. Verification in this context refers to the process of determining that a model implementation accurately represents the developer's conceptual description and specifications (i.e., does the model do what you think it does). Based on this description, verification would seem to include debugging exercises to ensure that the computer software is faithfully implementing model algorithms. In discussing simulation modeling of defense systems, it may prove useful to regard verification as defined above—a concept closely related to, but distinct from, model validation.
Sanders described the process of model accreditation as an assessment of the risks associated with the model's employment. Such a risk assessment requires consideration both of the probabilities of model errors leading to incorrect decisions and of the consequences of incorrect decisions. The ultimate goal is the mutual agreement by the analyst and the decision maker on the extent to which the model can be the basis of decision (Sanders, Appendix B). She identified four steps in the accreditation process of a simulation model:
Clear identification of what decision is to be made, including specification of the measures of effectiveness or other response variables involved in the simulation.
Determination of the model dimensions (e.g., input parameters, assumptions) that are relevant for each measure of effectiveness.
Assessment of the sensitivity of the measure of effectiveness to each relevant model dimension.
Judgment about the credibility of each relevant model dimension, given the importance and sensitivity of the measure of effectiveness to that dimension.
As defined above, accreditation is a metric-free construct, lacking a commonly accepted measurement scale along which models might be ordered or acceptability thresholds established. The importance and level of effort applied to accreditation will be driven by the analyst's and decision maker's perceived importance of the model's use, and the associated risk will depend on internal model characteristics and on the decision to be informed by the model.

OCR for page 13

Statistical Issues in Defense Analysis and Testing: Summary of a Workshop
DISCUSSANTS
Commenting on Sanders's paper, Arthur Dempster supported the objectives of model accreditation, although he stressed the importance of distinguishing between standards (or meta-standards) for the processes of scientific modeling and standards applied to individual analyses. He conjectured that improving the decision-making process will require project managers to understand models and the data representations and arguments that support them. He believes that accreditation is needed in the processes of statistical sciences, including expositions of modeling, experimental design, and risk assessment. Dempster also suggested that a second type of accreditation be developed with regard to the training and certification of personnel, both analysts and consumers of analyses.
James Hodges observed that the paper by Sanders suggests a more ambitious agenda of representing all uncertainty about model outputs in terms of probability distributions. Accreditation, then, would be a determination that the uncertainty associated with a model's output is small enough for the purpose at hand. He applauded movement in this direction, but noted that progress will require solutions to difficult technical problems. See Hodges and Dewar (1992) for a discussion of the relationship between model accreditation and model validation. Depending on the context, it is possible for a model to score low on validation but still pass the accreditation test.
Hodges pointed out that, in modeling combat—a complex phenomenon about which few reliable data exist—the largest contributors to uncertainty are omitted factors, the effects of which are difficult or impossible to account for in an objective manner. This observation underlines the importance of considering both sampling and nonsampling sources of error. In the context of OT&E , nonsampling sources of error include the possible biases associated with testing that employs selected golden crews, that imposes too little stress on equipment and personnel, or that is potentially unrealistic because the testing is independent of related systems.
The assessment and propagation of model uncertainty has received attention in fields outside the defense industry. Dempster alluded to similar efforts in assessing reactor safety within the commercial nuclear energy industry. Early references in that field include Parry and Winter (1981) and Cox and Baybutt (1981). Hodges (1987) contains a general discussion of model uncertainty and refers to applications in law (capital punishment) and medicine (dose response).
As a matter of good practice, various statistical approaches to a particular problem should be pursued. For example, many workshop participants expressed the belief that increased use of hierarchical Bayesian methods (discussed in a later section) would prove valuable in portraying uncertainty

OCR for page 13

Statistical Issues in Defense Analysis and Testing: Summary of a Workshop
and in characterizing the state of knowledge in defense analysis and testing problems. The structure of hierarchical models would permit the merging of expert opinion with sparse data and the pooling of information from multiple studies. Bayesian methods can also be applied in a straightforward manner to analyze accumulating bodies of data.