PAPERBACK
\$42.00

## PART IAPPLICATIONS OF STATISTICAL PRINCIPLES TO DEFENSE ACQUISITION

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 9
PART I APPLICATIONS OF STATISTICAL PRINCIPLES TO DEFENSE ACQUISITION

OCR for page 9

OCR for page 9
1 Introduction The Panel on Statistical Methods for Testing and Evaluating Defense Systems was formed to assess current practice related to operational testing in the Department of Defense (DoD) and to consider how the use of statistical techniques can improve that practice. This chapter describes the scope of the panel's work. It begins with a hypothetical example of the complex task of operational test design, implementation, and evaluation. The panel then discusses changes in the DoD acquisition process that guided its expanded scope of work. It concludes by detailing how statistics can effectively contribute to acquiring high-quality defense systems. A HYPOTHETICAL SCENARIO IN THE TESTING AND EVALUATION OF A COMPLICATED SYSTEM Consider the following scenario. A new major has just rotated into a service test agency and has been assigned to oversee the operational test design and subsequent evaluation of the test results of QZ3, a major military system. The results of the test are to inform the assessment of the system's operational readiness. Operational readiness is determined by the system's performance in an operational setting, which itself can only be measured in the military contexts of training, doctrine, and the scale in which the system is to be used. Of necessity, operational testing will be limited in time, the level of realism that can be attained, and the quality or quantity of measurement data obtainable from what at best is an adequate approximation of the chaotic and uncontrollable environment in which the system will actually be used. In addition, the size of operational

OCR for page 9
testing is limited by the level of funding set by the system's program manager. The test report will help decide (along with other considerations, especially cost) whether the system should enter full-rate production, continue to be developed, or be terminated. If it is ready for full-rate production, the costs of production are likely to exceed \$5 billion. QZ3 is a substantial modification of an existing system, QZ2. One modification incorporates a substantial amount of new software, which requires more than I million new lines of code. QZ3 is designed for four different threats, three different types of engagement scenarios, and, possibly, three different environments (e.g., snow background, verdant temperate, and desert). There are 15 different measures of performance or effectiveness that will be used to evaluate QZ3, linked only loosely to the higher-level mission outcomes. For each of the 15 measures, and for each combination of threat, engagement scenario, and environment, the major must evaluate and compare the relative performance of QZ3 and QZ2, with the ultimate goal of answering: ''Does the test confirm that a unit equipped with QZ3 accomplishes its mission in a better way than one equipped with QZ2." To be approved for full-rate production, QZ3 must outperform QZ2 in several of the 15 measures, and it cannot perform poorly in others. Unfortunately, many of the requirements for performance were vaguely worded in the Operational Requirements Document (ORD) and are essentially untestable in their original form. For example, the ORD states that the new system should be "easier to use," and a crucial component should have reliability of 1 failure for every 2,000 hours, yet the total time on test is limited to 600 hours. Because such requirements are not testable, the major has to negotiate with various parties to specify new requirements that are testable. Some of the 15 measures relate to how effective the system is in carrying out its function (system effectiveness), and some relate to how long or often the system is in a usable state (system suitability or reliability, maintainability, and availability). Moreover, a test designed to measure effectiveness by itself may be substantially different than one designed to measure only suitability. To further complicate his task, the major is faced with numerous difficulties when trying to implement the QZ3 tests, including: scheduling of test facilities and soldiers; the availability and use of threats; the availability of environments; and safety, noise, and other environmental restrictions. In addition, the test runs are very complicated to coordinate, and uncontrolled events in the field often cause atypical outcomes that must be identified and analyzed separately. The entire test and evaluation process must be carried out in less than 2 years, including 2-3 months to evaluate the test data and write the summary test report. The report will focus on several key results from various statistical significance (or hypothesis) tests for differences of means or differences of percentages for individual measures of effectiveness or performance. The program manager had originally budgeted for 15 test scenarios, with 5

OCR for page 9
test prototypes each. This permits a maximum of 75 replications to be distributed in 36 possible test scenarios (4 threats × 3 engagement scenarios × 3 environments). Thus, only an incomplete assessment of variability across scenarios and prototypes (potentially substantially incomplete) is possible. Given unanticipated system development costs, there is pressure to reduce the number of test replications. Various earlier developmental tests of this system, not conducted under operationally realistic conditions, have given results that differ substantially among themselves, and so there is a general reluctance to use this earlier test data to supplement the information from operational test. Simple pooling of the earlier data with operational test data will require unreasonable assumptions (e.g., operational conditions have no effect on performance) so combination will require sophisticated statistical models that can accommodate the effects of different test circumstances. Furthermore, it is difficult to use information from developmental and operational testing and field performance for similar systems either to help design QZ3's operational test or to supplement QZ3's operational test evaluation because all the conditions under which the developmental data were taken, and the developmental and operational test results, are not archived. Given these difficulties in augmenting operational test information, it is hard to imagine how confirmatory testing with such small sample sizes, in the form of significance testing even at the levels of statistical confidence used for this purpose (typically 80%), could be accomplished. Statistical theory does offer experimental designs that use relatively small sample sizes efficiently. However, constraints and uncontrolled events complicate the design (and subsequent analysis) and often require non-standard techniques that would need substantial technical training to understand or develop. At the analysis stage, test events that resulted from uncontrolled circumstances may need to be removed, and the vast amount of collected data must be organized and summarized before statistical significance tests are carried out. Even if the major were comfortable using various exploratory data analytic and multivariate techniques for the QZ3 analysis, there is little time to determine whether a method can be used and, if not, to develop an alternative approach. The decision rule for deciding whether QZ3 is ready for full-rate production generally does not accommodate use of information with respect to a conditional satisfaction of requirements, such as conditional on threat, type of engagement, environment, or other covariates of interest. It also does not indicate how to deal with partial satisfaction of requirements. For example, the evaluator is given no guidance on what to do if only various subsets of the 15 requirements for QZ3 are met in various environments or against various threats, nor was the contractor told how such trade-offs would be made. The production decision must, of course, be made, so the evaluator must commit. As part of QZ3's suitability study, it is also necessary to test the new software. In less than 2 years, a small staff must check a "representative" set of 1.5 percent of the lines of code (approximately 15,000) for errors. There are no

OCR for page 9
precise rules about structuring the code in a way that would facilitate this operation. Given the complexities and restrictions that are built into the current process (e.g., time, sample size, combination of relevant information, etc.), it is hard to even conceive of a series of tests with a rigorous statistical basis that could confirm that QZ3 was superior to QZ2. Yet with respect to the decision, the stakes are extremely high, immediately in terms of money and perhaps in the long run in terms of the best defense of the United States. The result is a problem that would require sophisticated statistical reasoning, coupled with insightful judgment of the real needs of the military user, as well as intimate knowledge of the system, its track record, and related systems being developed now or in the past. A tall order for our major. The major, because of particular interest in statistical problems, took one course more than the l-year undergraduate sequence of probability and statistics that is required. However, even 1-1/2 years of undergraduate training is not likely to provide an understanding of the distinctions among assuming an exponential distribution for the distribution of waiting times to system failures; or using the Weibull distribution to model these waiting times; or modeling failure times as a non-homogeneous Poisson process, which might have important implications for how much time to test the system for suitability (see Chapter 7). The major also had a course on systems engineering at the undergraduate level, but it is not likely to have provided a full understanding of how to test sophisticated software for errors. The above scenario describes many difficulties often encountered by those responsible for designing, carrying out, evaluating, and presenting results from operational tests on defense systems. Several, but not all, of the problems that are (implicitly) raised here are addressed in this report, but the key message is that the individual in charge of an operational test has an extremely difficult job, which is made even more difficult by the fact that defense system testing and evaluation has an essential statistical component. Yet the hypothetical major should not be a statistician (or systems engineer) since the person responsible for test design and evaluation has to be knowledgeable about the system being tested, the military application, the management of a large enterprise, and various DoD rules and regulations governing testing and acquisition. Indeed, the first priority of operational testing should be to conduct a comprehensive series of tests under realistic and relevant conditions and environments; statisticians are generally not qualified to implement such tests. Without a well-designed, realistic test, any post-test statistical analysis, whether using standard or sophisticated techniques, can easily lead to incorrect decisions. It is important to emphasize that most of the statistical methods that the major would need to use are not immediate applications of standard formulas and procedures. Many of the problems that would occur—such as trend-free designs to accommodate player learning, analysis of non-orthogonal experimental designs, analysis

OCR for page 9
of variance with missing values, censored observations, combination of information across different types of experiments, analysis of failures with non-exponential lifetime distributions with small sample sizes—are non-standard statistical problems. It would not be reasonable to expect the major to possess this level of analytical expertise. It would be reasonable to expect the major to have this expertise available, either through use of civilian employees, consultants, or ties to academia. This level of expertise is available to the major through the national laboratories, federally financed research and development centers, and support contractors, but it is not used as frequently as would be desired: the quantity of expertise is limited, the major doesn't know how to get help, and the statisticians who are available are not knowledgeable about or are not interested in the system under test. These issues are discussed in Chapter 10. Would more statistical expertise throughout the acquisition process result in superior decision making? If so, how should it be accomplished? What aspects of operational testing are likely to be improved through the application of state-of-the-art statistical techniques? These are some of the questions that we examine in this report. To complete the picture of why the answers to these questions are so important to defense system acquisition, we first examine the changes that have been occurring in the acquisition community that led to our study. DYNAMICS IN THE ACQUISITION OF MILITARY SYSTEMS The military systems under development and the development and testing process for military acquisition are continuing to change. Because of those changes, DoD will have to rethink the way that tests are designed, systems are evaluated, and, possibly, how the acquisition process is structured.1 These changes highlight the importance of statistical expertise within DoD. Decreased Testing Budgets The current DoD budget is much smaller than it was 10 years ago, and there have been substantial budget cuts for testing (see U.S. Department of Defense, 1997). (Decreased testing budgets have also reduced government developmental testing.) Consequently, tests are often smaller, 1   Some of this is already happening. The 5000 series was updated in 1996 as part of DoD's acquisition reform effort (U.S. Department of Defense, 1996). These revised directives emphasized several major themes, including the importance of cross-functional teams, tailoring the acquisition process to better fit each system, empowering program managers and other acquisition professionals. and incorporating best commercial practices when appropriate. In addition, the Army has developed and implemented the Army Performance Improvement Criteria (APIC) based on the Malcolm Baldrige National Quality Award Criteria for Performance Excellence and the Presidential Quality Award Criteria (U.S. Department of Defense, 1998a). The goals for APIC include its use as: (1) a resource to guide planning, assessing, and training; (2) helping raise performance expectations and standards; and (3) implementing state-of-the-art business practices.

OCR for page 9