The Panel on Statistical Methods for Testing and Evaluating Defense Systems was formed to assess current practice related to operational testing in the Department of Defense (DoD) and to consider how the use of statistical techniques can improve that practice. This interim report has two purposes: (1) to provide the sponsor and the defense acquisition community with feedback on the panel's work to date and (2) to present our current approaches and plans so that interested parties can provide input—for example, for additional literature or expert testimony—for our final report. Since this report represents work in progress, it includes relatively few conclusions and no recommendations.
Chapters of this report describe our progress to date in five major areas being addressed by working groups of the panel: use of experimental design; testing of software-intensive systems; system reliability, availability, and maintainability; use of modeling and simulation in operational testing; and efforts to develop a taxonomic structure for operational testing of new defense systems. Also, as discussed below, we have been led to take a broader view of how operational testing fits into the acquisition process, with the possibility that we may identify areas in the larger acquisition process in which changes could make operational testing more informative.
The rest of this summary presents our key interim findings and outlines topics that the panel intends to consider further in the remainder of the study.
The goal of an operational test is to measure the performance, under various conditions, of diverse aspects of a newly developed DoD system to determine whether the system satisfies a number of criteria. Some of these conditions can be completely controlled, and some are not subject to control. Since the size, scope, and duration of operational testing are constrained by budgetary, legal, and time considerations, the sample size available for a test is typically small. Thus, there is a benefit in designing a test as efficiently as possible so it can produce useful information about the performance of
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report Executive Summary The Panel on Statistical Methods for Testing and Evaluating Defense Systems was formed to assess current practice related to operational testing in the Department of Defense (DoD) and to consider how the use of statistical techniques can improve that practice. This interim report has two purposes: (1) to provide the sponsor and the defense acquisition community with feedback on the panel's work to date and (2) to present our current approaches and plans so that interested parties can provide input—for example, for additional literature or expert testimony—for our final report. Since this report represents work in progress, it includes relatively few conclusions and no recommendations. Chapters of this report describe our progress to date in five major areas being addressed by working groups of the panel: use of experimental design; testing of software-intensive systems; system reliability, availability, and maintainability; use of modeling and simulation in operational testing; and efforts to develop a taxonomic structure for operational testing of new defense systems. Also, as discussed below, we have been led to take a broader view of how operational testing fits into the acquisition process, with the possibility that we may identify areas in the larger acquisition process in which changes could make operational testing more informative. The rest of this summary presents our key interim findings and outlines topics that the panel intends to consider further in the remainder of the study. KEY ISSUES Experimental Design The goal of an operational test is to measure the performance, under various conditions, of diverse aspects of a newly developed DoD system to determine whether the system satisfies a number of criteria. Some of these conditions can be completely controlled, and some are not subject to control. Since the size, scope, and duration of operational testing are constrained by budgetary, legal, and time considerations, the sample size available for a test is typically small. Thus, there is a benefit in designing a test as efficiently as possible so it can produce useful information about the performance of
OCR for page 1
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report the system under the various conditions of interest. In its work to date, the panel has found much to commend about current practice in DoD operational testing. However, we do have several concerns related to experimental design: Uninformative scenarios in comparative testing. The choice of test scenarios does not always reflect consideration of the relative strengths of a new system compared to an existing control in these scenarios (when a control is relevant). It is important to use a priori assessments of which scenarios will discriminate most in identifying situations in which a new system might dominate an existing system, or vice versa, in terms of performance. Testing inside the envelope. A related concern is that operational testing tends to focus on the environments that are most typical in the field. Although this approach has the advantage that one directly estimates the system performance for the more common environments, the disadvantage is that little is known about the performance when the system is strongly stressed. Subjective scoring rules. Scoring rules with respect to which events are considered unusable are vaguely defined, as is precisely what constitutes a trial. Further, the definition of an outlier in an operational test is not always made as objectively as possible. Testers need to be more precise about the objective of each operational test. Sometimes, understanding the performance of the system in the most demanding of several environments is paramount, so that the objective is to estimate a lower bound on system performance; at other times, a measurement of the average performance of the system across environments is needed. Measurement inefficiencies. Data that measure system effectiveness, especially with respect to hit rates, are often treated as binary (zero-one) data, but such reduced data usually contain much less information than the original data on a continuous scale. For example, information on target miss distance can be used in modeling the variability of a shot about its mean, which in turn can be used to improve estimation of the hit rate. Testing of Software-Intensive Systems Defense systems are becoming increasingly complex and software-intensive. Early in the panel's work, it became clear that software is a critical path through which systems achieve their performance objectives. We therefore recognized the need for special attention to software-intensive systems and have sought to understand how operational testing of such systems is conducted across the military services. On the basis of our work to date, we note several concerns about current practice: Barriers to effective software testing. Several barriers limit effective software test and evaluation. One important barrier is that DoD has not acknowledged or addressed the criticality of software to systems' operational requirements early enough in the acquisition process. There is a perception that software is secondary to hardware and can be fixed later. Also, we concur with the findings of others who have identified three related barriers: (1) DoD has not developed, implemented, or standardized decision-making tools and processes for measuring or projecting software system cost, schedule, and performance risks; (2) DoD has not developed a test and evaluation policy that provides consistent guidance regarding software maturity; and (3) DoD has not adequately defined and managed software requirements (U.S. General Accounting Office, 1993). Evolutionary acquisition of software. In evolutionary acquisition, the software code that is evaluated in operational testing is being continuously changed, so that what is operationally tested is not
OCR for page 1
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report necessarily what is deployed. Thus, evolutionary acquisition may compromise the utility of operational testing. We plan to study this issue and its implications further. System Reliability, Availability, and Maintainability Considerations of operational suitability—including reliability, availability, and maintainability—are likely to have different implications for the design and analysis of operational tests than considerations of effectiveness, and consequently merit distinct attention by the panel in its work. Our overall goal in this area is to contribute to the improved use of statistics in reliability, availability, and maintainability assessment by reviewing best current practices within DoD, in other parts of the federal government, and in private industry, with respect to both technical aspects of statistical methodology and policy aspects of reliability, availability, and maintainability testing and evaluation processes. At this time, we make the following observations: Variability in reliability, availability, and maintainability policy and practice. Considerable differences in organization and methodology exist among the services, as well as within the testing community in each service. Such differences may be partly attributable to variability in the training and expertise of developmental and operational testing personnel. A related concern is the reliance on standard modeling assumptions (e.g., exponentiality) in circumstances in which they may not be tenable, and we are currently assessing the possible consequences for test design and evaluation. No accepted set of best reliability, availability, and maintainability practices. Efforts to achieve more efficient (i.e., less expensive) decision making by pooling data from various sources require documentation of the data sources and of the conditions under which the data were collected, as well as clear and consistent definitions of various terms. Such efforts underscore the potential value of standardizing reliability, availability, and maintainability testing and evaluation across the services and encouraging the use of best current practices. Industry models include the Organization for International Standardization (ISO 9000) series and existing documents on reliability, availability, and maintainability practices in the automobile and telephone industries. Use of Modeling and Simulation in Operational Testing The panel's work in this area is intended to address how statistical methods might be used to assess the use of and to validate simulations for developmental or, especially, operational testing.1 It seems clear that few if any of the current collection of simulations were designed for use in developmental or operational testing. The original purpose was typically to assist in training and doctrine. Therefore, the primary question concerns the extent to which simulations, possibly with some adjustments and enhancements, can be used for testing purposes, with the important objectives of saving limited test funds, enhancing safety, effectively increasing the sample size of a test, and possibly permitting the extrapolation of test results to untested scenarios. Although we applaud efforts made throughout DoD to make operational testing more cost-effective through the use of simulation, we have identified several concerns: 1 We use the term “simulation” to mean both modeling and simulation.
OCR for page 1
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report Infrequent and sometimes incorrectly applied attempts at rigorous validation. Rigorous validation of simulations, although difficult or expensive, is often absent or incorrectly applied in many operational testing applications. When applied, external validation can sometimes be used to overfit a model to field experience. In such cases, close correspondence between a “tuned” simulation and operational results does not necessarily imply that the simulation will predict performance well in any new scenario. The considerable literature on statistical validation of complex computer models apparently has not been effectively disseminated in the defense modeling community. Little use of statistical methods in designing simulations and interpreting results. Statistical methods can be used in characterizing relationships between inputs and outputs, planning efficient simulation runs, interpolating results for cases that were not run, detecting and analyzing unusual cases (outliers), and estimating uncertainties in simulation results. We have seen little evidence of awareness and use of statistical methods for these purposes. The DoD simulation policy and directives literature is generally deficient in its statistical content. Impossibility of identifying the “unknown unknowns.” Although appropriately validated simulations can supplement the knowledge gained from operational testing, no simulation can discover a system problem that arises from factors that were not included in the models on which the simulation is built. Often, unanticipated problems become apparent only during field testing of the system in an operational environment (or, in some cases, after deployment). Lack of treatment of reliability, availability, and maintainability. Many models and simulations used in the acquisition process apparently assume perfect availability and reliability of the system. Also, this observation seems to hold more generally for other aspects of suitability, such as logistics support, interoperability, transportability, operator fatigue, and state-of-training realities. Despite their inherent limitations, simulations that purport to assess a system 's operational value should incorporate, to the extent possible, estimates of the reliability, availability, and maintainability of that system. TOPICS FOR FURTHER STUDY Optimal allocation of test resources. An important general problem concerns how to allocate a small sample of test objects to several environments optimally so that the test sample is maximally informative about the overall system performance. It may be possible to improve on the common practice of alternately varying one factor of a central test scenario. The panel is considering several alternative statistical approaches to this problem involving such techniques as multidimensional scaling, fractional factorial designs, and Bayesian methods (see Appendix C). Training effects. Operational test results can be affected significantly by the degree of training that soldiers receive prior to testing and by player learning that occurs during the test. Therefore, we are examining the potential use of statistical methods, including trend-free and other experimental designs, to address and correct for these confounding effects. Test sample size. With respect to evaluation, a real problem is how to decide what sample size is adequate for making decisions with a stated confidence. The panel will examine this question—sometimes referred to as “How much testing is enough?”—for discussion in our final report. This effort may involve notions of Bayesian inference, statistical decision theory, and graphical presentation of uncertainty. Alternatives to hypothesis testing. We firmly believe that applying the standard approach to hypothesis testing in the operational test context is not appropriate for several reasons, including the asymmetry of the approach and its limited value in problems involving small sample sizes. The primary objective of operational test and evaluation is to provide information to the decision maker in the form
OCR for page 1
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report that is most valuable in deciding on the next course of action. Therefore, in analyzing operational test results, one should concentrate on estimating quantities of interest and assessing the risks associated with possible decisions. We will continue to explore how decision-theoretic formulations might be used in evaluating operational test results. Statistical applications in software testing. The panel believes that statistical methods can and should play a role in the testing and evaluation of software-intensive systems, particularly because not every user scenario can be tested. Consequently, in order to understand current practice in operational testing of software-intensive systems and how statistical methods might be applied, we continue to seek answers in this specific context to a set of general questions related to the statistical design and analysis of experiments: How does one characterize the population of scenarios to test and the environments of use? How does one select scenarios to test from the population? How does one know when to stop testing? What are the stopping criteria? How does one generalize from the information gained during testing to the population of scenarios not tested? How does one plan for optimal use of test resources and adjust the test plan as the testing unfolds? Implications of “intended use” concept for software testing. A shift to a new paradigm is taking place, driven by the concept of “intended use” articulated in the definition of operational testing. To implement this paradigm, it would be necessary to prescribe certain criteria that, if met, would support a decision that the software (or system containing the software) is fit for field use. These criteria might involve experiments, observational studies, or other means of evaluation, and they would have to be prescribed in technical detail—including specification of costs, schedules, and methods—thus establishing requirements and constraints on the design and development process. Opportunities and methods for combining reliability, availability, and maintainability data. We have concluded that operational tests of combat equipment are not, as a rule, designed primarily with reliability, availability, and maintainability issues in mind. Addressing these issues typically involves experiments of longer duration than is feasible in operational testing. Consequently, “certification” of operational suitability can be accomplished better through other means of testing. For example, data collected during training exercises, developmental testing, component testing, bench testing, and operational testing, along with historical data on systems with similar suitability characteristics, might be appropriately combined in an inference scheme that would be much more powerful than schemes in current use. In future work, we will seek to clarify the role hierarchical modeling might play in reliability, availability, and maintainability inference from such a broadened perspective. Prescriptions for use of modeling and simulation. In formulating a position on the use of simulation in operational testing, the panel will continue to seek answers to several specific questions: Can simulations built for other purposes be used in their present state in either developmental or operational testing? What modifications might generally improve their utility for this purpose? How can their results, either in original form or suitably modified, be used to help plan developmental or operational tests? How can the results from simulations and either developmental or operational tests be combined to obtain a defensible assessment of effectiveness or suitability? These questions are all related to the degree to which simulations can approximate laboratory or field tests, and their answers involve verification, validation, and accreditation. A taxonomic structure for defense systems. Various attributes of military systems require distinctive approaches to operational testing and to the application of statistical techniques. Because of the many different factors that must be considered in any particular test, the panel decided to undertake development of a scheme for classifying weapon systems and associated testing issues. Such a taxonomic structure, if developed, should satisfy three general objectives: (1) reflect the prevalence of various types of systems; (2) highlight attributes that might call for different statistical approaches,
OCR for page 1
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report affect decision tradeoffs, or involve qualitatively different consequences; and (3) facilitate the integration of commercial approaches by helping to align military and commercial contexts. Preliminary work on this topic suggests that producing a taxonomic structure will be difficult and that its appropriate scope, depth, and nature will depend strongly on its intended uses. Broader issues concerning operational testing and defense acquisition. In assessing how to make optimum use of best statistical practices in operational testing, we have repeatedly been led to consider various aspects of the larger acquisition process of which operational testing is a part. For example, starting operational testing earlier in the acquisition process—an idea that has won support in the DoD community and among members of our panel—has implications for how statistical methods would be applied. Similarly, the operational test design or evaluation of system performance might conceivably make use either of optimal experimental design methods that depend on parameters that must be estimated or of statistical techniques that “borrow strength” from data earlier in the process. These approaches might use information from developmental testing, but concern about preserving the independence of operational and developmental testing could make such ideas controversial. Organizational constraints and competing incentives complicate the application of sequential testing methods, as well as some statistical ideas about how to allocate operational testing resources as a function of the total system budget. Further, ideas of quality management that have gained great acceptance in industry seem relevant to the task of developing military systems, despite the obvious contextual differences, and the implementation of such ideas would require a complete understanding of the DoD acquisition process. In future work, we hope to articulate general principles and a philosophy of managing information and quality, drawn from broad experience in industry and government. The panel will continue to gather information from all military services, conduct additional site visits, examine industrial practice, study federal agencies that engage in product development and testing, and explore international experience in military testing and acquisition. We will seek to learn about the degree of statistical training in the test agencies and the expertise that agency personnel can draw on for difficult statistical issues. From our colleagues in the defense and statistical communities and from readers of this report, we welcome suggestions, especially advice concerning additional sources of information that would be useful in advancing our work.