The Panel on Statistical Methods for Testing and Evaluating Defense Systems was formed to assess current practice related to operational testing in the Department of Defense (DoD) and to consider how the use of statistical techniques can improve that practice. This chapter describes the scope of the panel's work. It begins with a hypothetical example of the complex task of operational test design, implementation, and evaluation. The panel then discusses changes in the DoD acquisition process that guided its expanded scope of work. It concludes by detailing how statistics can effectively contribute to acquiring high-quality defense systems.
A HYPOTHETICAL SCENARIO IN THE TESTING AND EVALUATION OF A COMPLICATED SYSTEM
Consider the following scenario. A new major has just rotated into a service test agency and has been assigned to oversee the operational test design and subsequent evaluation of the test results of QZ3, a major military system. The results of the test are to inform the assessment of the system's operational readiness. Operational readiness is determined by the system's performance in an operational setting, which itself can only be measured in the military contexts of training, doctrine, and the scale in which the system is to be used. Of necessity, operational testing will be limited in time, the level of realism that can be attained, and the quality or quantity of measurement data obtainable from what at best is an adequate approximation of the chaotic and uncontrollable environment in which the system will actually be used. In addition, the size of operational
testing is limited by the level of funding set by the system's program manager. The test report will help decide (along with other considerations, especially cost) whether the system should enter full-rate production, continue to be developed, or be terminated. If it is ready for full-rate production, the costs of production are likely to exceed $5 billion.
QZ3 is a substantial modification of an existing system, QZ2. One modification incorporates a substantial amount of new software, which requires more than I million new lines of code. QZ3 is designed for four different threats, three different types of engagement scenarios, and, possibly, three different environments (e.g., snow background, verdant temperate, and desert). There are 15 different measures of performance or effectiveness that will be used to evaluate QZ3, linked only loosely to the higher-level mission outcomes. For each of the 15 measures, and for each combination of threat, engagement scenario, and environment, the major must evaluate and compare the relative performance of QZ3 and QZ2, with the ultimate goal of answering: ''Does the test confirm that a unit equipped with QZ3 accomplishes its mission in a better way than one equipped with QZ2."
To be approved for full-rate production, QZ3 must outperform QZ2 in several of the 15 measures, and it cannot perform poorly in others. Unfortunately, many of the requirements for performance were vaguely worded in the Operational Requirements Document (ORD) and are essentially untestable in their original form. For example, the ORD states that the new system should be "easier to use," and a crucial component should have reliability of 1 failure for every 2,000 hours, yet the total time on test is limited to 600 hours. Because such requirements are not testable, the major has to negotiate with various parties to specify new requirements that are testable. Some of the 15 measures relate to how effective the system is in carrying out its function (system effectiveness), and some relate to how long or often the system is in a usable state (system suitability or reliability, maintainability, and availability). Moreover, a test designed to measure effectiveness by itself may be substantially different than one designed to measure only suitability.
To further complicate his task, the major is faced with numerous difficulties when trying to implement the QZ3 tests, including: scheduling of test facilities and soldiers; the availability and use of threats; the availability of environments; and safety, noise, and other environmental restrictions. In addition, the test runs are very complicated to coordinate, and uncontrolled events in the field often cause atypical outcomes that must be identified and analyzed separately. The entire test and evaluation process must be carried out in less than 2 years, including 2-3 months to evaluate the test data and write the summary test report. The report will focus on several key results from various statistical significance (or hypothesis) tests for differences of means or differences of percentages for individual measures of effectiveness or performance.
The program manager had originally budgeted for 15 test scenarios, with 5
test prototypes each. This permits a maximum of 75 replications to be distributed in 36 possible test scenarios (4 threats × 3 engagement scenarios × 3 environments). Thus, only an incomplete assessment of variability across scenarios and prototypes (potentially substantially incomplete) is possible. Given unanticipated system development costs, there is pressure to reduce the number of test replications. Various earlier developmental tests of this system, not conducted under operationally realistic conditions, have given results that differ substantially among themselves, and so there is a general reluctance to use this earlier test data to supplement the information from operational test. Simple pooling of the earlier data with operational test data will require unreasonable assumptions (e.g., operational conditions have no effect on performance) so combination will require sophisticated statistical models that can accommodate the effects of different test circumstances. Furthermore, it is difficult to use information from developmental and operational testing and field performance for similar systems either to help design QZ3's operational test or to supplement QZ3's operational test evaluation because all the conditions under which the developmental data were taken, and the developmental and operational test results, are not archived.
Given these difficulties in augmenting operational test information, it is hard to imagine how confirmatory testing with such small sample sizes, in the form of significance testing even at the levels of statistical confidence used for this purpose (typically 80%), could be accomplished. Statistical theory does offer experimental designs that use relatively small sample sizes efficiently. However, constraints and uncontrolled events complicate the design (and subsequent analysis) and often require non-standard techniques that would need substantial technical training to understand or develop. At the analysis stage, test events that resulted from uncontrolled circumstances may need to be removed, and the vast amount of collected data must be organized and summarized before statistical significance tests are carried out. Even if the major were comfortable using various exploratory data analytic and multivariate techniques for the QZ3 analysis, there is little time to determine whether a method can be used and, if not, to develop an alternative approach.
The decision rule for deciding whether QZ3 is ready for full-rate production generally does not accommodate use of information with respect to a conditional satisfaction of requirements, such as conditional on threat, type of engagement, environment, or other covariates of interest. It also does not indicate how to deal with partial satisfaction of requirements. For example, the evaluator is given no guidance on what to do if only various subsets of the 15 requirements for QZ3 are met in various environments or against various threats, nor was the contractor told how such trade-offs would be made. The production decision must, of course, be made, so the evaluator must commit.
As part of QZ3's suitability study, it is also necessary to test the new software. In less than 2 years, a small staff must check a "representative" set of 1.5 percent of the lines of code (approximately 15,000) for errors. There are no
precise rules about structuring the code in a way that would facilitate this operation.
Given the complexities and restrictions that are built into the current process (e.g., time, sample size, combination of relevant information, etc.), it is hard to even conceive of a series of tests with a rigorous statistical basis that could confirm that QZ3 was superior to QZ2. Yet with respect to the decision, the stakes are extremely high, immediately in terms of money and perhaps in the long run in terms of the best defense of the United States. The result is a problem that would require sophisticated statistical reasoning, coupled with insightful judgment of the real needs of the military user, as well as intimate knowledge of the system, its track record, and related systems being developed now or in the past. A tall order for our major.
The major, because of particular interest in statistical problems, took one course more than the l-year undergraduate sequence of probability and statistics that is required. However, even 1-1/2 years of undergraduate training is not likely to provide an understanding of the distinctions among assuming an exponential distribution for the distribution of waiting times to system failures; or using the Weibull distribution to model these waiting times; or modeling failure times as a non-homogeneous Poisson process, which might have important implications for how much time to test the system for suitability (see Chapter 7). The major also had a course on systems engineering at the undergraduate level, but it is not likely to have provided a full understanding of how to test sophisticated software for errors.
The above scenario describes many difficulties often encountered by those responsible for designing, carrying out, evaluating, and presenting results from operational tests on defense systems. Several, but not all, of the problems that are (implicitly) raised here are addressed in this report, but the key message is that the individual in charge of an operational test has an extremely difficult job, which is made even more difficult by the fact that defense system testing and evaluation has an essential statistical component.
Yet the hypothetical major should not be a statistician (or systems engineer) since the person responsible for test design and evaluation has to be knowledgeable about the system being tested, the military application, the management of a large enterprise, and various DoD rules and regulations governing testing and acquisition. Indeed, the first priority of operational testing should be to conduct a comprehensive series of tests under realistic and relevant conditions and environments; statisticians are generally not qualified to implement such tests. Without a well-designed, realistic test, any post-test statistical analysis, whether using standard or sophisticated techniques, can easily lead to incorrect decisions. It is important to emphasize that most of the statistical methods that the major would need to use are not immediate applications of standard formulas and procedures. Many of the problems that would occur—such as trend-free designs to accommodate player learning, analysis of non-orthogonal experimental designs, analysis
of variance with missing values, censored observations, combination of information across different types of experiments, analysis of failures with non-exponential lifetime distributions with small sample sizes—are non-standard statistical problems. It would not be reasonable to expect the major to possess this level of analytical expertise. It would be reasonable to expect the major to have this expertise available, either through use of civilian employees, consultants, or ties to academia. This level of expertise is available to the major through the national laboratories, federally financed research and development centers, and support contractors, but it is not used as frequently as would be desired: the quantity of expertise is limited, the major doesn't know how to get help, and the statisticians who are available are not knowledgeable about or are not interested in the system under test. These issues are discussed in Chapter 10.
Would more statistical expertise throughout the acquisition process result in superior decision making? If so, how should it be accomplished? What aspects of operational testing are likely to be improved through the application of state-of-the-art statistical techniques? These are some of the questions that we examine in this report. To complete the picture of why the answers to these questions are so important to defense system acquisition, we first examine the changes that have been occurring in the acquisition community that led to our study.
DYNAMICS IN THE ACQUISITION OF MILITARY SYSTEMS
The military systems under development and the development and testing process for military acquisition are continuing to change. Because of those changes, DoD will have to rethink the way that tests are designed, systems are evaluated, and, possibly, how the acquisition process is structured.1 These changes highlight the importance of statistical expertise within DoD.
Decreased Testing Budgets The current DoD budget is much smaller than it was 10 years ago, and there have been substantial budget cuts for testing (see U.S. Department of Defense, 1997). (Decreased testing budgets have also reduced government developmental testing.) Consequently, tests are often smaller,
shorter, and must be executed with fewer prototypes. The Director, Operational Test and Evaluation (DOT&E) must by law determine the number of low-rate production items needed for test but this is done without clear, published guidelines. However, more sophisticated statistical methods can help to make the most effective use of whatever resources are available. When appropriate, methods for combining test data with information from other sources—including developmental test results, knowledge about similar systems, and modeling and simulation results—can be used to provide additional information for decision making. Also, statistical decision theory can be used to help determine cost-effective test budgets.
More Complicated Systems Today's military systems are growing more complex, as modern warfare requires greater effectiveness and flexibility in defense systems. Complicated systems mean more measures of performance and effectiveness, which increases the complexity of test design and test evaluation, which in turn requires sophisticated statistical analysis.
More Software-Intensive Systems The automation of system control, and greater use of information management, fault checking, and other software-supported features, has substantially increased the number of software-intensive systems and the size of the software code embedded in them. Nearly all ACAT I systems that have gone to full production in the last 3 years-systems that are estimated to cost more than $2.135 billion (in fiscal 1996 constant dollars)-have had a substantial software component. More software-intensive systems, therefore, require the latest techniques in software engineering, which in turn depends on up-to-date statistical techniques.
More Upgrades to Systems, "Evolutionary Procurement" As the development process-from concept formulation to full production-becomes more complicated and costly, more development programs have become upgrades of existing systems. For example, the Longbow Apache helicopter is a substantial upgrade to the original Apache helicopter, primarily through the addition of a radar system.
Evolutionary procurement is an acquisition concept that was developed in an attempt to cope with difficulties inherent in procuring very complex systems, especially systems with a significant software component. In evolutionary procurement, a system is procured in stages, with additional functionality and features provided at each stage. An evolutionary procurement cycle can be regarded as a series of rapid upgrades. More system upgrades require the use of archived information from both previous stages in the system's development and from related systems both for test design and the ability to appropriately combine that information.
Greater Interest in System Reliability, Availability, and Maintainability Many of the changes and trends just described (especially decreased defense budgets) and the potential for retaining systems in military forces for more years (extending lifetimes) have created greater interest in developing systems that have higher reliability and require less maintenance. This increased emphasis on reliability, availability, and maintainability emphasizes the importance of test design and evaluation that places greater weight on reliability considerations.
These changes create major challenges for operational testing to meet its goals to ensure the effectiveness and suitability of defense systems. This report identifies (see Chapters 5-8) a number of ways that statistical analysis and statistical methods can contribute to more cost-effective defense testing. As a result of this examination and related implications, the report also considers changes beyond the constraints of the existing acquisition process.
PANEL'S CHARGE AND SCOPE OF WORK
The panel was charged with examining the applicability of statistical methods to defense system testing and evaluation, particularly operational test design and evaluation. The statement of work reads:
The panel would explore (1) developing measures of effectiveness, (2) designing operational tests and experiments with guidelines for determining the extent and types of testing required, (3) developing test and models that incorporate information from previous analyses in the acquisition process, and (4) representing and characterizing uncertainties in presenting results. In addition to making recommendations on how to improve operational testing under current requirements and criteria, the panel would also consider whether and to what extent technical criteria and organizational and legal requirements for testing and evaluation constrain optimal decision making.
The panel focused on statistical techniques in four primary areas: (1) assessment of reliability, availability, and maintainability; (2) use of modeling and simulation; (3) methods for software testing; and (4) use of experimental design techniques. We believe that our recommendations for improvements in these areas could be implemented immediately.
In the course of its work, however, the panel discovered that it was necessary to expand its charge. To provide the best advice to DoD, the panel examined the acquisition process as a whole and considered changes that might better support information gathering and, therefore, the development of more effective and suitable systems. The panel also examined the application of statistical science to operational testing, specifically, the process that leads to the decision whether to let a system proceed to full-rate production. The small sample sizes used in operational tests led the panel to consider whether and how developmental test activities (and operational tests for earlier systems) could be used to augment
operational testing through combining information across experiments. Operational test design and evaluation is more efficient when based on information from earlier tests, including developmental tests. Developmental and earlier test results provide information on the problems a system might have, estimates of variances for determining the statistical power that a planned operational test would have, and which distributional assumptions about performance measures are likely to be valid.
Questions about the appropriate size of operational tests raise further issues: How are operational test budgets determined? What is the purpose of operational testing in defense system acquisition: that is, how is it established that a system passes or fails an operational test and what happens when a system fails'? Recent trends in industry point to a procurement philosophy in which testing is viewed as an integral part of an acquisition process focused on designing and building quality into a product. In this view, end-of-the-line inspection is de-emphasized in favor of continuous information gathering to find and fix problems before a product is complete. The panel decided that it would be valuable in meeting its charge to consider whether these industry trends can be adapted to the DoD environment and if so, whether the result would be similar improvements in quality and cost-effectiveness to those observed in industry. As a result, our primary task naturally generalized to whether statistical principles could be used more effectively as an integral part of the broader defense acquisition process.
Statistics is more than a collection of individual techniques; it is a science concerned with the efficient use of information for effective decision making. Therefore, the panel examined whether the broader acquisition process was consistent with statistical principles that have been found useful in system development in other contexts. For example, the acquisition process, when attempting to take uncertainty in operational tests into consideration, makes use of significance testing as a key decision-making tool.2 We detail the limitations of this approach and suggest additional methods to enhance decision making. The panel also examined what constraints exist in the entire acquisition process that limit the use of statistical methods or broad statistical principles in operational testing and in associated operations (including developmental testing, the development of requirements, and testing). The panel acknowledges that the recommendations in this area involve organizational change. Implementation of these changes requires careful thought and planning and will take considerable time and effort. In the panel's view, the potential benefit is well worth the effort.
Many defense systems are extraordinarily complex and have idiosyncratic characteristics, at times using novel technologies in unique situations. It could be
premature to draw conclusions on the basis of the panel's in-depth examination of only a few systems. However, we believe that the systems we have studied are illustrative of best current practice. We are confident that our recommendations have applicability beyond the systems that we examined.
The defense testing community views statistics mainly as estimating means or percentages and of testing hypotheses. These are important and valuable tools, but they are often too simple to deal with important and complex issues involving enormous costs. Even when these tools are used, the expected costs and benefits must be evaluated for each competing strategy, and must be used in deciding what experimentation to do and how much to spend on it.
The panel has been told many times that the "statistical answer" is prohibitively expensive and so cannot be used. This often reflects severely reduced budgets determined without proper consideration of costs and benefits. The response should not be to disregard the statistical approach, but to use it either to evaluate what is the best that can be done under the given financial constraints, or to reset the constraints.
The complex problems arising in operational testing require sophisticated and well trained statisticians who understand how to develop methods appropriate for the underlying problems.
The report outlines the statistical methods that, both in the short-and long-term, will have a dramatic effect on the quality of defense systems acquired by DoD, and the costs of this acquisition. These include new approaches to software testing, new methods for test design for reliability, availability, and maintainability, methods for validation of modeling and simulation, and methods for obtaining more information from operational and other testing. This also includes a new view for the role of testing as part of military system development. In addition, the panel believes that a regular interaction between the statistical and defense testing communities must be established so that new statistical techniques and applications will be developed specifically tailored to meet the challenges unique to defense testing and evaluation.