The Panel on Statistical Methods for Testing and Evaluating Defense Systems was charged with examining the statistical techniques currently used in the design and evaluation of operational tests in the U.S. Department of Defense (DoD) and making recommendations for improvement. Operational testing and evaluation is an independent assessment of whether a system is effective and suitable for its intended use; it is a key part of military system development. The DoD's Office of the Director, Operational Test and Evaluation (DOT&E) is responsible for providing advice to the Secretary of Defense on whether military systems are ready for full-rate production and for prescribing policies and procedures to the military services regarding operational test and evaluation. Given the importance of the nation's defense, the high cost of many weapon systems, and the substantial cost of testing them, even modest improvements in operational testing by using the most appropriate statistical methods can lead to more efficient use of public funds and considerable improvements in the reliability and effectiveness of the systems deployed.
The panel's examination comes at a time of substantial change in the development and testing process and, more broadly, in military acquisition. There are five major components of that change: decreased testing budgets; more complicated systems; more software-intensive systems; more upgrades to existing systems ("evolutionary procurement"); and greater interest in system reliability, availability, and maintainability.
Early in its work the panel realized that a narrow focus only on the use of sophisticated statistical techniques in test design and evaluation would not provide the best advice to DoD. Rather, the panel decided it must also consider the
aspects of the acquisition process as a whole that affected the application of statistical techniques. It became clear to us that adopting effective statistical practices that command wide and consistent support within the DoD acquisition community would yield substantial gains.
CONCLUSIONS ABOUT THE CURRENT PROCESS
The panel's main conclusions concerning the current use of operational testing as part of system development cover broad aspects of DoD operational testing and evaluation.
Currently, operational testing is a collective final event in system development. Since many design flaws in both industrial and defense systems become apparent only in operationally relevant use, some testing with operational realism needs to be carried out as early as possible. Inexpensive, small, focused preliminary tests with operational aspects could help to identify problems before a design is final. Such tests could also identify scenarios in which the current design performs poorly and which system characteristics should be the focus of subsequent tests.
In addition, it is currently uncommon to use developmental test data, or test data for related systems, to augment data from operational testing except for direct pooling of reliability test data. This omission derives in part from understandable concerns about the relevance of developmental test data or test data from related systems for the evaluation of a system's operational performance, but it also originates in part, from a lack of statistical expertise about how to use the information and a lack of access to the information.
Conclusion 2.1: For many defense systems, the current operational testing paradigm restricts the application of statistical techniques and thereby reduces their potential benefits by preventing the integration of all available and relevant information for use in planning and carrying out tests and in making production decisions.
Also, the incentive structure in military procurement provides each major participant-including the program manager for a system, the test director, the contractor, various activities in the Office of the Secretary of Defense (OSD), and Congress-with strong, often differing and even competing perspectives. This set of complicated dynamics affects a variety of aspects of the test and evaluation process-budgets, schedules, test requirements, test size, which test events should be excluded because of possible test abnormalities, and even the rules for scoring a test event a "failure." It is critical that the perspectives of the participants are understood and taken into account in decision making concerning test design and test evaluation.
Further, for operational tests of most complicated systems, the required sample sizes that would support significance tests at the usual levels of statistical
confidence typically are not affordable. As a result, there is a common perception in the DoD acquisition community, including middle-and high-level management, that the design and conduct of "statistically valid" tests are unaffordable. This impression stems in part from a lack of communication between the test and acquisition and statistical communities about how statistical theory and methods can be applied to complex problems. For example, effective use of statistical methods is not limited to a determination of the appropriate sample size so that a test yields interval estimates at the required level of statistical confidence. It would often be preferable, given a fixed test budget, to design a test aimed at maximizing the amount of information from the resulting test data in order to reach supportable statements about system performance. Failure to use such a test is tantamount to using more test cases than needed or to throwing away information from test cases after conducting the test.
Conclusion 2.2: The operational test and evaluation requirement, stated in law, that the Director, Operational Test and Evaluation certify that a system is operationally effective and suitable often cannot be supported solely by the use of standard statistical measures of confidence for complex defense systems with reasonable amounts of testing resources.
However, operational test and evaluation is a crucial element of assessing whether a system is ready for full-rate production.
Conclusion 2.3: Operational testing performs a unique and valuable function by providing information on the integration of user, user support (e.g., training and doctrine), and equipment in a quasirealistic operational environment.
Moreover, we believe that:
Conclusion 3.1: Major advances can be realized by applying selected industrial principles and practices in restructuring the paradigm for operational testing and the associated information gathering and evaluation process in the development of military systems.
A NEW PARADIGM
These conclusions support a new approach to more effectively employing testing as part of military system development. We therefore propose a new paradigm in which testing is more fully integrated as a part of military system development and recommend that Congress change the law to reflect the shift from testing primarily to confirm to one of testing to learn. Specifically, we recommend:
Recommendation 3.1: Congress and the Department of Defense should broaden the objective of operational testing to improve its contribution
to the defense acquisition process. The primary mandate of the Director, Operational Test and Evaluation should be to integrate operational testing into the overall system development process to provide as much information as possible as soon as possible on operational effectiveness and suitability. In this way, improvements to the system and decisions about continuing system development or passing to full-rate production can be made in a timely and cost-efficient manner.
With this redirection of focus, the new DoD paradigm for test and evaluation as part of acquiring defense systems would have the following four essential characteristics:
It is a continuous process of information gathering and decision making in which operational testing and evaluation play an integral role. This orientation is consistent with contemporary trends in industrial practice de-emphasizing "inspecting defects out" in place of integrated development processes and "building quality into the product."
There is an environment of systematic data collection, analysis, and documentation. All sources of data and information should be clearly documented, and the process made consistent across the military services so that development teams are able to identify and learn from other relevant studies and findings. To create this environment, DoD could institute multi-service developmental and operational test and evaluation standards similar to those contained in ISO 9000.
Efficient statistical methods are used for decision making based on all available, relevant data. An environment of continuous assessment and systematic data collection should enable the use of efficient statistical methods, including decision-theoretic methods, yet would still allow the decision maker to be held accountable for judgments on system procurement.
The life-cycle costs of acquiring and maintaining new military capabilities are reduced. Integrating operational test and evaluation in a process of continuous assessment (and improvement) means that problems of consequence will be discovered earlier in the development process when they are easier and cheaper to solve. A system would therefore not be likely to enter the final, confirmatory stages of operational test until it clearly is ready for such a stage, and therefore also likely ready for full-rate production. The production of higher quality and more reliable systems will reduce the amount of logistic support required to maintain the system in the field, a major contributor to total cost over the lifetime of the system.
We note that these characteristics have not been totally missing from all aspects of military system development. They are consistent with several recent initiatives in DoD. They are found in various places, for particular items, at some times. But what has not existed is a consistent, department-wide, and institution
alized application of these features to the acquisition process. Adopting the new paradigm will enable DoD to obtain the best possible information at the lowest possible costs from operational testing and evaluation.
In the new paradigm, the role of DOT&E should be expanded to include providing assistance to the test and evaluation service agencies in the application of statistical and quality management principles to the overall decision-making process. This direction and oversight role involves the following:
Careful definition and documentation of the key steps in the overall test and evaluation process, and what each step involves, to ensure that there is less variability in the process (from program to program while permitting tailoring for different types of programs) and that best current practices are used widely and consistently.
Through the use of uniform terminology, protocols, and standards for data collection, archival, and documentation, ensuring that information from different sources can be compared and combined when necessary for both test design and evaluation. Information from both developmental and operational tests on the same and related systems can best be used for improved test design and test evaluation if it is accessible and usable in a test archive. It is most useful if data are collected and archived in standard ways using standard definitions and that essential information on every test exercise and condition is included. Adding information on field use would support the operation of feedback loops that would help improve the development of individual systems as well as the process of test design and evaluation. The use of standard terminology and practices will also facilitate the work of DOT&E in approving test plans, verifying test results, and assessing the validity of the results.
Ensuring the full use of the information that is collected, archived, and documented through the use of state-of-the-art statistical methods and models. These methods and models permit use of data from different sources to design tests more efficiently and to evaluate tests by incorporating the specific information from each test data set that is relevant for the assessment of operational performance. In addition, statistical models can be constructed to provide such specific information as which test factors have the greatest effect on system performance and which test results seem most anomalous. The costs of operational testing and the importance of correctly deciding whether to proceed to full-rate production makes it extremely important to base decisions on all of the available, relevant information.
Requiring careful documentation of each test evaluation and inclusion of confidence statements, along with sensitivity analyses and external validation of modeling and simulation used to augment operational tests. Careful but appropriately condensed documentation of the entire testing and evaluation process permits all the participants in defense acquisition to see how each evaluation was made, where there is still uncertainty, and whether and what further testing would
be useful. It also allows current and future users to better understand causes of variation and unanticipated problems and how they were addressed, to better manage the process.
Consistent with the new paradigm and the expanded revised role for DOT&E, the panel recommends:
Recommendation 2.1: The Department of Defense and the military services should provide a role for operational test personnel in the process of establishing verifiable, quantifiable, and meaningful operational requirements. Although the military operators have the final responsibility for establishing operational requirements, the Operational Requirements Document would benefit from consultation with and input from test personnel, the Director, Operational Test and Evaluation, and the operational test agency in the originating service. This consultation will ensure that requirements are stated in ways that promote their assessment.
Recommendation 2.2: The Director, Operational Test and Evaluation, subject to the approval of the Secretary of Defense on a case-by-case basis, should have access to a portion of the military department's acquisition funding reserves (being set up as a result of the first quadrennial defense review) to augment operational tests for selected weapon systems.
Recommendation 3.2: Operational evaluations should address the overall performance of a defense system and its ability to meet its broadly stated mission goals. When a system is tested against detailed requirements, these requirements should be shown to contribute to necessary performance factors of the system as a whole.
Recommendation 3.3: The Department of Defense and the military services, using common financial resources, should develop a centralized testing and operational evaluation data archive for use in test design and test evaluation.
Recommendation 3.4: All services should explore the adoption of the use of small-scale testing similar to the Army concept of force development test and experimentation.
ADVANCES IN STATISTICAL METHODS
At a more technical level, the panel also concludes that current practices in defense testing and evaluation do not take full advantage of the benefits available from the use of state-of-the-art statistical methodology. We present here our
general conclusions and recommendations. Additional, detailed recommendations are in Chapters 5-10, including a recommendation in Chapter 10 that DoD form a Statistical Analysis Improvement Group, using the best statisticians in OSD and the military services on a part-time basis to advise senior decision makers.
Test Planning Comprehensive operational test planning is not uniformly conducted. Test planning collects and uses information on the purpose of the test, the test factors and how they need to be treated in the test design, the test environment and associated constraints, and preliminary data on system performance (e.g., comparison of system variation within and between scenarios) to help determine how large a test should be and which test scenarios (in which order) should be used to achieve the test goals. It is also used to resolve such questions as how and what test data are to be recorded. Proper test planning can avoid serious mistakes in test design and can help identify designs that produce substantially more information at less cost.
Estimates of Uncertainty All estimates of system performance from operational test should be accompanied by statements of uncertainty. Such statements would identify which levels of system performance were consistent with the test results, so that the extent of the information provided by the test is clear to decision makers. Estimates of uncertainty make it possible to use all information about the performance of a system in combination for the purpose of deciding on full-rate production. If more testing is needed, statements of uncertainty will usually make that clear. Estimates of uncertainty for performance in important individual scenarios should also be provided, as should approximate estimates of uncertainty of the information from use of modeling and simulation.
Use of All Relevant Information All relevant information from tests and field use of related systems, developmental tests, early operational tests, and training and contractor testing should be examined for possible use in both the design and evaluation of operational tests. Given the importance of the decision on whether to proceed to full-rate production, state-of-the-art statistical methods for combining information should be used, when appropriate, to make tests and their associated evaluations as cost-efficient as possible.
Experimental Design Methods State-of-the-art methods for experimental design should be routinely used in designing operational tests. A comprehensive literature is available to use in producing test designs that accommodate a broad array of complexities, especially various constraints, that arise in testing associated with system development. Routine reliance on this literature will produce more effective tests and help make efficient use of operational test funds.
Appropriate Statistical Models Statistical models that are based on empirically supported assumptions should be used in test design for and evaluation of system reliability (as well as for system effectiveness). It is typical for the service test agencies to make use of a specific model (the exponential model) of the time to first failure, in reliability design and evaluation. Use of this model, when inappropriate, can often result in unnecessarily large test sizes and inappropriate results for significance tests. There are many other models that are often more appropriate for this kind of application.
Software-Intensive Systems Operational tests of software-intensive systems should employ a usage-based perspective, in order to demonstrate that the system is fit for its intended use. When testing schedules and budgets are tightly constrained, usage-based testing yields the highest practical reliability because if failures are identified, they are likely to be the high-frequency failures.
In summary, by making testing a more useful and integrated component of military system development, by adopting up-to-date statistical practices, and by changing the paradigm in which test and evaluation is used in defense system development, the acquisition of military systems will become even more effective and efficient, and benefit from considerable savings in system life-cycle costs.