A New Paradigm for Testing and Evaluation in Defense Acquisition
The message of the preceding chapter, simply stated, is that the current paradigm for how operational testing is used as part of defense system development is not coherent. In an increasingly complex environment, the current paradigm does not address the goals originally specified for operational testing and evaluation. The incoherence of the current paradigm stems from three fundamental difficulties. First, the use of operational testing and evaluation as a final test event prevents the identification of operational problems early in system development. Second, while this testing event is often portrayed in the milestone system as supporting a decision to pass or fail a defense system in development, systems are almost never terminated as a result of tests (nor should they be). When doing destructive testing or testing complex systems with high unit costs, it is often infeasible to have a big enough operational test to make clear pass/fail decisions with any reasonable level of statistical confidence. Third, current limitations on using all sources of relevant information also reduce the efficiency of testing and evaluation; policies restricting such use or limitations in expertise increase test costs.
These conclusions have led us to propose a new paradigm to operational testing and its role in system development. We are not the first to come to this conclusion. As discussed below, we have seen encouraging signs at the highest levels of the testing and acquisition community of an emerging new view of the role of operational testing and evaluation in the acquisition process. The panel has focused its deliberations around the question: How should operational testing and evaluation work to fully realize potential gains in efficiency and effectiveness through the improved use of statistical methods? This chapter presents a
new view of the process of system development in which the role of operational testing and evaluation as a late series of tests for go/no go decisions is broadened to a role of continuous information gathering to support the development of effective and suitable systems. We do not presume to offer a blueprint for such a process. Instead, we discuss key ideas and principles for such a paradigm, and draw conclusions with respect to improvements that can be made within that paradigm, trusting details of their adoption and implementation to the defense acquisition community.
ESSENTIAL FEATURES OF A NEW PARADIGM
A new and more effective paradigm for the use of testing and evaluation as part of defense system development should possess four essential characteristics:
A continuous process of information gathering and decision making in which operational testing and evaluation plays an integral role. This orientation is consistent with contemporary trends in industrial practice, deemphasizing ''inspecting defects out" in place of integrated development processes and "building quality into" a product.
An environment of more systematic data collection, archival, and documentation. All sources of data and information should be archived and clearly documented and the process made consistent across the military services so that development teams are able to identify and learn from all relevant studies and findings. To create this environment, DoD could institute multiservice operational test and evaluation standards similar to those contained in ISO 9000 (see Appendix D).
The use of efficient statistical methods for decision making. An environment of continuous assessment and systematic data collection would facilitate the use of efficient statistical methods, including decision-theoretic methods, yet would still allow the decision maker to be held accountable for judgments on system procurement.
Reductions in the life-cycle costs of acquiring and maintaining new military capabilities. Integrating operational testing and evaluation in a process involving more systematic data collection, analysis, and documentation plus continuous assessment (and improvement) means that problems of consequence will be discovered earlier in the development process, when they are easier and cheaper to solve. A system would therefore not be likely to enter the final, confirmatory stages of operational testing until it is clearly ready for that stage, and, therefore, also more likely to be ready for full-rate production. The production of higher quality and more reliable systems will reduce the amount of logistic support required to maintain the system in the field, a major contributor to total cost over the lifetime of the system.
We acknowledge that the ideas in our "new paradigm" are not completely new with respect to DoD test and acquisition. These ideas have often individually been suggested, and they are applied in various specific, narrow instances throughout DoD test and acquisition. What is new and important is the consistent, widespread, institutionalized use of the processes discussed here, with full recognition of their importance and value.
Our proposed new paradigm of testing as part of system development should not change the fundamental independent advisory and assessment role of DOT&E. There is still need for an independent assessment of the test and evaluation process as part of defense procurement. In fact, the many opportunities in this new paradigm for collective decision making—including establishing requirements, setting test budgets, evaluation of test results, and setting up the data archive—create increased responsibilities for DOT&E. DOT&E officials would continue to provide an independent evaluation of all operational test results. A paradigm shift to focus the test community on providing as much information as possible (as early as possible) on operational effectiveness and suitability does not preclude the independence and accountability of DOT&E. The importance of this independent advisory and assessment role was strongly supported in a recent report (U.S. General Accounting Office, 1997) that considered this independence key to the effectiveness of DOT&E.
Such a shift is also consistent with changes that are evolving in the role of government in defense system development. Changes in technology, including DoD's greater use of technology developed by industry, congressional pressure to reduce the size of the DoD acquisition workforce, DoD's use of integrated product teams that include industry members, and increasingly popular views that industry should have a greater role in such areas as DoD systems logistics support imply a larger role for the industrial sector in defense acquisition in general. In many systems developmental and support areas, DoD's prime contractors, subcontractors, and support contractors will become the sources for much of the data needed to design test and evaluation programs. In addition, contractors could be specifically asked for information, explanations, or analysis during the course of an operational test. To preserve the desired and desirable independence, contractors should not be involved in the testing itself, but might be able to provide worthwhile assistance both in planning tests and in the interpretation of test results. Congress should consider lifting the constraint on contractors' participation in operational test and evaluation activities and permit or encourage DoD to propose new guidelines for the limited use of contractor data and personnel in the operational test process. The independent evaluation role can and should remain in DoD and DOT&E.
In the sections that follow we describe a new paradigm for operational testing and evaluation as an integral part of defense acquisition. We begin by noting that discrepancies between operational testing on paper and in practice suggest
how the role of operational testing might be constructively redefined. Next, we discuss the consonance of such a redefinition with successful ideas from private industrial practice. Finally, we lay out the implications for specific components of the new paradigm for DoD acquisition and testing, as well as some recent and promising developments in test and evaluation.
INSIGHTS FROM CURRENT PRACTICE
The goal of operational testing as stated in the authorizing legislation is "to confirm the effectiveness and suitability of the system." This has been interpreted by some members of the test and acquisition community, as well as its critics, to mean that operational testing is intended to provide the basis for deciding between procurement and cancellation of a prospective defense system. In actual use, however, the third milestone decision regarding operational test and readiness for production is almost never a simple dichotomy between either canceling a program or moving into deployment with no further development or evolution of the system design.
The manner in which operational test results are used in practice suggests that the most valuable functions of operational testing are to identify deficiencies in the performance of a system, to characterize the conditions under which deficiencies are likely to occur, and to isolate and remove the causes of such deficiencies. Thus, a testing program can most usefully focus on such questions as: "What needs to be fixed?" "How likely is it that the needed fixes will be technologically feasible and affordable?" "What sequence of tests can be performed that will show the weaknesses, as well as the strengths, of a system?" These activities can and should be undertaken throughout system development. If operational testing is performed to determine whether a system needs further development, then it is extremely wasteful to conduct a very expensive test after the system design is final. It would be much more efficient to create test policies and procedures that identify operational deficiencies earlier in the development process.
Since it is well recognized in DoD that the implied objective of operational test in DoD acquisition to provide statistical certification of whether a system is operationally effective and suitable is generally unrealistic, some more specific alternative objective is implicit. We believe that the existing implicit objective and the new paradigm we propose have much in common. Operational testing should be viewed less as a final exam used before deciding whether to approve full-rate procurement. Instead, its overriding objective, using testing and evaluation with operational criteria, should be to provide decision makers with timely necessary information in the form most useful to make effective decisions regarding the need for further development.
LEARNING FROM INDUSTRIAL PRACTICES
It is now generally acknowledged that quality cannot be "inspected" into a product. Product testing is often performed too late in the process to change deficiencies in a timely and cost-efficient manner. Rather, quality must be designed into any complex product at virtually every stage of its development. This is accomplished by creating acquisition processes that make use of continuous monitoring and testing.
Applicability to Defense Acquisition
We acknowledge that there is a danger in drawing analogies between developing new products in industrial and other private- and public-sector areas and the development of new defense systems. There is no question that certain aspects of defense acquisition are unique. For example, a new military technology might be used in an emergency before it is completely ready for routine operation. A defense system may also serve deterrent purposes even if it is only partly successful. On the other side, commercial concerns often must commit to a product's development to stay in business, so a pass/fail decision may be less relevant in commercial applications. The development of defense systems involves security and classification issues (though industry also values secrecy), and requires evaluation that is independent of the contractor. Finally, many military systems, especially ACAT I systems, and their associated tests are much more complicated and costly than for non-military systems. So in examining industrial practice for developing new products, one must be cautious about transferring lessons learned to the unique situation of defense acquisition.
However, there are also substantial similarities. A recent report (U.S. General Accounting Office, 1996) pointed out that, in an industrial manufacturing process for a product that had both military and nonmilitary customers, the production line for the non-military customer used automation and process control throughout production. This reduced the need—as far as the non-military customers were concerned—for end product testing. However, the military customer required a potentially wasteful 100 percent end product testing.
In spite of conventional wisdom, all of the decision makers for an industrial corporation do not have a common profit objective. Just as in the military defense system development process, the incentives and performance measures of design engineers, testers, manufa cturing and production personnel, and marketers are often disparate. Some of the procedures developed by industry to account for and overcome these disparate interests are relevant to DoD.
Quality Improvement and the Role of Statistics
U.S. industry, especially the manufacturing sector, has been forced to un
dergo major shifts in the process of system development and reengineering over the last two decades in order to become more competitive in the international marketplace. Managers have recognized that achieving long-term improvements in quality and productivity is one way to regain their competitive edge. Statistical thinking, methods, and practices have played a critical role in these developments. Much of this renewed emphasis on quality and productivity was in response to the competitive success of Japanese industry, sparking an associated interest in studying world-class quality practices and in adapting them to fit the U.S. environment.
The result has been a reexamination of old, accepted notions about quality. End-of-production inspection activities and the use of "military standards," developed in response to the needs of war-time activities, were popular in the 1960s and 1970s, and inspection activities were based on the philosophy that defined quality as "conformance to specifications." The first of these techniques was aimed at preventing bad products from being shipped to customers, but it did not contribute to the more important task of improving (where needed) the processes that led to the production of bad products.
The other technique, the "specification limit" definition of quality, ignores several critical realities. First, products closer to a design target generally perform better. Moreover, they are less likely to drift out of the specification limits over time than those that are further from the target but still within design specifications. Second, the cost of quality is not necessarily a "0-1" function of the amount by which a product deviates from its design specifications. Third, quality is defined more by the needs of the customer than a designer with some predetermined view of manufacturing capabilities.
The concepts of inspection and 0-1 cost structures have been replaced by concentration on "quality by design" and reorganizing in order to make it right the first time, philosophies that emphasize that quality should be built into the product at the design stage. The focus of quality improvement has shifted to the design and development phases of both products and processes. There has been a concordant emphasis on developmental testing, often with operational aspects, continuous product improvement, and increased emphasis on the use of sophisticated statistical methods. Proactive techniques such as extensive use of statistical design of experiments and accelerated reliability testing are used in the design and development phase to optimize the product and process design. Feedback control, process monitoring, and failure and process diagnostics are used to control and reduce variation. In particular, specified statistical methods are now used extensively during the production realization process, in order to understand and manage the different sources of variation: they range from simple techniques, such as the "basic seven tools" (Ishikawa, 1985), the "new seven tools" (Mizuno, 1988), and quality function deployment to computer-implemented advanced methods of real-time and sequential decision making, Bayesian methods, and interactive and adaptive experimental designs mentioned earlier. There has also
been an emphasis on reducing the length of the product development cycle in order to be more competitive. Reducing product development cycle time in DoD is important as well both to reduce or control costs and to provide soldiers with new technology quickly.
Quality Management and the Role of Statistics
Although the use of technical development is important in its own right, what has made these tools more effective are corresponding improvements in quality management. The success of Japanese industry has clearly demonstrated that technological superiority alone is not enough to be competitive in the long run. Although advantages of technological innovation can be compensated for by the ability to perform product reengineering, competitive success is based on an ability to understand and manage processes and the commitment to long-term quality improvement.
Statistical thinking has been at the heart of much of this development of quality and process management methods over the last two decades. In particular, total quality management is one of several modern management tools. It consists of an overall philosophy as well as methods for leading and managing an organization efficiently and effectively in order to meet the customers' needs. Some fundamental principles of total quality management relevant to our discussion here include:
Each decision in the systems development and acquisition process is a series of interconnected subprocesses.
All these processes have random variations that underlay outcomes;
Understanding (and reducing) these variations is a key to success; and
Effective decisions must be based on accurate and agreed-upon data.
Since statistical concepts and statistical thinking play a major role in studying, managing, and reducing variation, they are at the heart of process and quality management. This is, in fact, one of the basic messages in Deming's management theory (Deming, 1986).
Reduction and management of process variation is also the underlying theme in internationally accepted efforts at standardization, such as ISO 9000 (see Appendix D). These standards are based on several fundamental operating principles:
Processes affecting quality must be documented: documenting often identifies important ad hoc methods that are inconsistently applied. Adhering to consistent applications of the process reduces the variability in the product. When processes are well documented, customer requirements are more fully understood, the staff is more capable of satisfying the customer because of an improved
understanding of how products and services are designed and produced, and consistent means are achieved for investigating root causes and resolving customer complaints.
Records of important decisions and valuable data must be retained. Decisions must be periodically reviewed during audits and scrutinized for authorization. Data describing information about the quality of the product or service must be retained for a defined time period. These data form a valuable source of information used for improvement and development of new products or services.
Processes must be in control: effective documentation and record retention result in bringing processes under control. This is a necessary step before a process of continuing improvements can be implemented.
Examining a process and documenting it leads to better understanding of the causes of variation, variability reduction, and better process management. The resulting continuous improvement leads to identification and root-cause analysis of systemic problems and suggests corrective action.
The panel believes that some of the lessons learned in industry are indeed applicable to defense acquisition.
IMPLICATIONS FOR DEFENSE ACQUISITION AND OPERATIONAL TESTING AND EVALUATION
There are several implications of adopting the proposed new paradigm as the foundation for DoD testing and evaluation as part of the system development process. To avoid unintended consequences, DoD should attempt to understand how each change will affect aspects of the acquisition process before adoption.
Evaluating a System Against Broadly Stated Mission Needs
All parties from the operational requirements community should avoid using unnecessarily simplistic metrics and detailed specifications as key measures of merit or success. Specifically, all parties—including program managers and the test community—should accept the goal of measuring progress against broadly stated mission needs. This might involve evaluating and making trade-offs (using modeling and simulations, for example) as well as judgment.
For example, consider the early operational testing that showed flaws in the Intervehicular Information System (IVIS) of the M1A2 battle tank. This system was an integral part of the upgrade that defined the new equipment as the M1A2 tank. The unsatisfactory performance could have been attributable to one or more of three possible causes: inadequate training of the tank crews, hardware immaturity, or software errors. Although that performance could have been attributable to one or more of these causes, it was appropriate to use overall system performance as the measure of the problem, rather than, say, adequacy of
training. (In this case, the key issue was whether to cancel the entire upgrade, cancel only the IVIS subsystem, or attempt to fix the problems with IVIS while accepting the upgrade.)
The recent operational test and evaluation of the Longbow Apache helicopter seems to be representative of a healthy change in this direction. According to presentations to the panel by the test manager, this helicopter was evaluated against broad statements of operational needs rather than against detailed measures of performance that did not translate into higher level mission outcomes. He noted that there were a few individual performance characteristics that failed the operational test but they were considered unimportant for operational needs.
If DOT&E is to function more as an information-gathering agency, then absolute requirements are less meaningful than measures, evaluated continuously, linking performance to some explicit notion of military value or utility. Decisions would not be made on individual acquisition programs without simultaneous consideration of other related programs and their expected joint contribution to military capabilities.
When a new system or an improvement to an existing system is under development, responsibility is assigned to a single service, even though the system may be useful to other services. Many observers believe that DoD should give more attention to joint testing, not only for systems with joint use, but also for systems that will be used by a single service in joint operations. There are notable examples of successful joint tests—successful in the sense of cooperation between or among the services. An obvious one is the Air Force C-17: although the Air Force had the lead, the Army had a full-time presence at Edwards Air Force Base during the test. In contrast, the High Mobility Multipurpose Wheeled Vehicle, used by all the services, was essentially an Army only test. DoD should consider establishing a point of contact at the Joint Chiefs of Staff level for operational testing, with representation on all ACAT I operational tests.
There are many views that need to be considered in the design and testing of defense systems. Each can offer valuable information on appropriate and objective measures of performance and effectiveness and should have input into the definition of these measures. In particular, the setting of requirements should involve representatives from the program manager, the service test agency, and DOT&E (see Recommendation 2.1).
Archiving and Using Performance Data
DoD could benefit in a variety of ways from standardized test data archival practices. These include:
providing the information needed to validate models and simulations, which in turn could be used to plan for (or reduce the amount of) experimentation needed to reach specified operational test and evaluation goals;
facilitating the "borrowing" of information from past studies (if they are clearly documented and there is consistent usage of terminology and data) to inform the assessment of a system's performance, by means of statistical methods;
making data from developmental testing widely available for efficient operational test design;
facilitating learning from best current practices across the services; and
lead to an organized accumulation of knowledge within the Department of Defense.
To accomplish all of this, the test data archive should include both developmental and operational test data, and, possibly, training data; use of uniform terminology in data collection across services; and careful documentation of development and test plans, development and test budgets, test evaluation processes, and the justification for all test-related decisions, including decisions concerning resource allocation. In addition, the critical circumstances that produced all the data must be clearly documented. While this is extremely important for data from developmental tests, it is also important for training data, which could be used to alert DoD about the necessity of a post-production review. Finally, it is important that the in-use performance of systems, when available, should also be included in the archive.
We point out that the trade-off between the utility of the information in operational testing and that from real use is not completely clear. Data from field use can at times be less useful than those from operational testing since the circumstances surrounding field use are less controlled than in operational testing. Furthermore, careful data collection is often extremely difficult in real use.
A key benefit of documentation and archival of test planning, test evaluation, and in-use system performance is the creation of feedback loops to identify system flaws for system redesign and to identify when tests or models have missed important deficiencies in system performance. The performance of systems in the field can then be compared with developmental and operational tests, and modeling results to help improve future system development, test design, and improve modeling and simulation techniques, and to better understand the limitations of various approaches to testing systems. More specifically, the archive could provide information about: (1) operational test successes and limitations—by comparing results from operational tests with observations on actual use; (2) sources of system problems—by comparing observed system (especially suitability) problems with problems observed in test history; and (3) the validity of modeling and simulation—by comparing the results from modeling and simulation to those from actual use. Such comparisons would be especially relevant for reliability, availability, and maintainability issues (see Chapter 7). Of course, the relevance of feedback loops requires that differences between the systems being compared (e.g., related systems, systems in development, systems once fielded)
and the test circumstances are either not important or are taken into consideration in the analysis. To help ensure comparability in comparisons, the archive should include a description of any substantial system modifications and when they were instituted.
Access to such an archive would necessarily have to be restricted because of the sensitive nature of a system's performance while it is in development. The question of access can be addressed by arranging different levels of information access, including access to abstracts, study summaries, and data. Operational performance information might be available only to DOT&E and the service operational test agencies while the system is still in development. However, once a system has entered full-rate production, access to its data should be broadened to those with a legitimate "need to know."
Standardization, documentation, and data archiving are important because they facilitate use of all available information for efficient evaluation and decision making. The service test agencies should investigate the use of industrial models as examples of ways of collecting and archiving test data. This includes adherence to modifications of ISO 9000. Although data archives may have been of limited value in the past, they can be made much more useful with modern technology. Since individual programs would not be able to support such an undertaking, DoD should support investments in test infrastructure by centrally funding this common warehouse for test and performance data.
In developing such an archive, it will be important to explicitly acknowledge the costs and benefits of data collection and to develop an incentive structure that ensures the effective participation of all involved parties. A recent RAND study (Galway and Hanks, 1996) discusses the difficulties that can arise when data are collected by one unit of an organization for use by another unit. The production and flow of useful information across organizational units can be hampered by poor communication and understanding of multiple purposes of data collection and archiving, even when such activities serve the collective good of the organization. If Recommendations 3.3 and 10.2 are adopted, the Statistical Analysis Improvement Group (see Chapter 10) should be tasked to recommend the party or parties to be responsible for managing the test data archive and to address the complex issue of who should and who should not have access to it.
Establishing a Continuum of Experimentation, Testing, Evaluation, and Reporting
Testing should not serve solely as a "final examination" of a system. Testing is more effective when it identifies system problems as they occur so that design changes can be instituted before substantial resources are committed to a flawed design. To identify system problems earlier that are unique to operational experience, the continuous testing must mimic aspects of operational use. This means using smaller scale testing with typical users, interactions with enemy systems,
and use of less scripted activities. This approach would be similar to the Army's force development test and experimentation, which provide insights into possible new operational concepts and doctrine for equipment before it is fully developed, and recognize system limitations and their causes related to operational performance as soon as it is feasible. These operationally-oriented early tests could provide important timely information about operational deficiencies that could help in operational test design and evaluation, and system redesign. These early tests would also be used to identify the key factors that would limit system performance beyond the defined operating conditions. Along these lines, if the above were not feasible, some real benefit would often result from taking existing developmental testing and modifying it to have whatever operational aspects were practicable. Operational test personnel might assist developmental testers in incorporating operational-type scenarios in their tests, possibly in part through providing them with operational test strategies earlier in program planning.
In this approach, operational evaluation reports would be prepared in recognition of the need for multiple assessments of the operational performance of a system under development, and the reports would be issued continuously throughout all operational test activities. Such reports will be helpful in providing information for the feedback loop to inform system development and will minimize system faults that are discovered in late-stage operational testing. These reports could also, in extreme circumstances, be used to support early termination of an unsuccessful program.
POSITIVE DEVELOPMENTS IN RECENT PRACTICE
The fiscal 1996 annual report of the Director, Operational Test and Evaluation (U.S. Department of Defense, 1997) discusses in some detail efforts to implement former Secretary of Defense Perry's five themes of operational testing and evaluation in the ''new world" of DoD acquisition:
Earlier involvement of operational testing in the acquisition process: "to take advantage of operational test insights early in systems acquisition programs, to identify problems and fixes early, and to avoid the program disruption and costs which can come when problems are found later."
More effective use of modeling and simulation: "modeling and simulation can help us determine . . . when the probability of test success is high enough to warrant actually beginning the test or whether some additional work needs to be done before valuable test resources are employed."
Combining tests, including developmental tests and operational tests: "This theme includes combining developmental tests (DT) with operational tests (OT) and sometimes combining tests in different programs. It also includes making all feasible use of data gathered in DT, in EOAs [early operational assessments], and in any other way that makes sense during OT."
Combining testing and training: "Training exercises are often quite realistic and can help provide the kind of test environment needed for operational tests. Similarly, operational tests can add a richness and complexity that can be valuable in a training environment. Together, testing and training employ many of the same resources, often at the same range."
Advanced concept technology demonstrations (ACTD) (testing for "insight and understanding, not a 'final exam' grade"): "ACTD programs are wideranging in size and scope and require tailored technical and managerial approaches and strong conceptual and operational links to the warfighter."
These themes are consistent with the ideas discussed above, and we encourage efforts within DoD to attain the objectives. Statistics and statisticians can and should play a key role in the identification and development of appropriate methodologies for each of these themes, but we also argue that organizational changes are needed to fully realize potential gains. We applaud the expression of commitment from the highest levels of policy within DoD.
Establishing the general goals of a new approach to operational testing and evaluation is important, but it must be followed by implementation and tangible success in practice. The panel has had little direct exposure to recently innovated approaches, but some are cited in the DOT&E report (U.S. Department of Defense, 1997), and we have seen some evidence that the lessons learned in private industry are at times now being put into practice by the military services in defense acquisition. One example is the development of electronic warfare systems. Historically, most electronic warfare systems have not met all their requirements, so now new electronic warfare programs are being exposed to operational test and evaluation planning, and operational contexts are being addressed early as part of developmental testing. We encourage wider application of these initiatives.
A NEW SYSTEM: TWO STORIES
To help develop a picture of the advantages that stem from adopting a new paradigm for defense acquisition, we compare what might happen to a fictitious system, QZ5, under the current paradigm to what might happen under a new paradigm containing the features we recommend. While QZ5 is fictitious, each of the problems below have occurred with real systems.
The Current Paradigm
QZ5 is a high priority ship-based missile system representing a technological advance that gives it a substantial advantage over the system it is designed to replace. The program manager's staff and the relevant service test agency know that the technology in QZ5 has been used successfully in an existing land-based
system, QZ4, and so are enthusiastic and confident about the possibility of the system's success both in operational testing and when fielded.
The design of QZ5 is fundamentally sound, but there are several reliability problems with the new technology that could be solved with relatively minor design alterations. These reliability problems—involving certain components when they are exposed to cold, wet conditions—are much more likely to be demonstrated in an operational setting rather than in a laboratory. The primary measure of effectiveness for QZ5 is "p," the probability of hit against a specific enemy threat. The value of p for QZ5 must be greater than 0.8 to justify its acquisition. Developmental testing has shown that QZ5 has a value of p much greater than 0.8 for the commonly used, unstressful levels of a particular kind of obscurant, and therefore should ultimately be acquired once the design changes to enhance reliability are implemented. However, developmental testing also has shown that the value of p is only 0.3 if there are extreme levels of the obscurant.
QZ5 had undergone developmental testing by expert users, but it had not been subjected to any operational exercise before beginning operational testing. The program manager is aware of both the possible reliability problems and the problem with extreme levels of obscurants, but he does not believe that the former is serious or the latter is relevant, and he hopes that the system will be certified for full-rate production. At worst, a few minor design modifications may ultimately be needed, and since the system is a distinct improvement over the currently available system, the program manager argues strongly that it quickly be brought into service, with the recognition that there may be some potential small costs for retrofitting later on.
System QZ5 then enters into operational testing. Because there is no test archive, the commander responsible for the design and evaluation of QZ5's operational test is unaware of the test results for QZ4, and so designs a general purpose operational test that does not include certain test scenarios—including testing under simultaneously cold and wet conditions—that were troublesome for this technology in the previous system. In addition, even though the results of the developmental test were known, the levels of obscurant that were used in the operational test scenarios were modest compared with the extreme levels used in the developmental test since those levels would only rarely be observed in practice.
When the operational testing of QZ5 was concluded, the statistical significance tests for system suitability were all acceptable: that is, nothing indicated the suitability problems. This "false negative" occurred because (1) the test design failed to use the information about scenarios related to reliability problems from QZ4, (2) the results of QZ4's operational test were not used in conjunction with the operational test results for QZ5, and (3) the small sample size (due to the expensive nature of the expendables) of the operational test resulted in a test that only had a 60 percent chance of identifying the existing problems. This 60 percent chance of identifying the problem was not communicated to the decision
makers, since only the results of the statistical significance test were presented in the evaluation report. Furthermore, the test performed had 80 percent power for testing p, the primary measure of effectiveness at the required level, so the low power with respect to suitability was not challenged in test design. Finally, the levels of obscurant tested were modest, providing no information about the system's ability to handle more intense levels of obscurants.
System QZ5 was approved for full-rate production. Unfortunately, the reliability problems necessitated that the system undergo an expensive retrofitting, which took 12 months to implement, during which time the missile system was out of service. After retrofitting, a potential enemy had developed a new obscurant, which represented a higher level than that tested. There was interest in using the QZ5 system in this situation, but it was not clear whether it would be effective given the low levels of obscurant used in the operational test. The best guess (but erroneous conclusion) was that it might be effective, given the fact that it had easily passed the operational test.
The New Paradigm
Under the new paradigm, the problems with QZ4 with respect to wet and cold conditions were well known since they were documented in a test and performance archive. Throughout QZ5's development small-scale tests with operational aspects were used, revealing the causes of the reliability problems, and the system design was modified early to overcome these faults. These tests were designed to use scenarios with low temperatures and high precipitation that were similar to those for which the QZ4 system had previously experienced reliability problems.
When the system was subjected to its final operational test, it was anticipated that the system would experience no remaining problems. The operational test was therefore an opportunity to learn as much as possible about the characteristics of the system. The system passed the reliability-related statistical significance tests easily due to the design changes. The operational tests were much shorter and less expensive than previously since data from QZ4's operational test were used to supplement the information from the tests of QZ5, and it was known which scenarios to focus on. Rather than being concerned with passing statistical significance tests on scenarios that fell well within the definition of ordinary use, QZ5 was also placed in test scenarios that explored how the performance of the system varied with typical but also atypical levels of stress. It was discovered that the system was sensitive to high levels of obscurant and, in particular, would not be appropriate for use in an operational setting in which this countermeasure was likely to be used.
System QZ5 passed its operational test, as fully expected, and needed no post-production retrofitting, saving considerably more funds than were used both to continuously test it throughout development and to contribute to the test data
archive. The results of the testing of the QZ5 system were archived to assist in the test design and evaluation of any related systems in the future.
CONCLUSION AND RECOMMENDATIONS
Conclusion 3.1: Major advances can be realized by applying selected industrial principles and practices in restructuring the paradigm for operational testing and the associated information gathering and evaluation process in the development of military systems.
Recommendation 3.1: Congress and the Department of Defense should broaden the objective of operational testing to improve its contribution to the defense acquisition process. The primary mandate of the Director, Operational Test and Evaluation should be to integrate operational testing into the overall system development process to provide as much information as possible as soon as possible on operational effectiveness and suitability. In this way, improvements to the system and decisions about continuing system development or passing to full-rate production can be made in a timely and cost-efficient manner.
Recommendation 3.2: Operational evaluations should address the overall performance of a defense system and its ability to meet its broadly stated mission goals. When a system is tested against detailed requirements, these requirements should be shown to contribute to necessary performance factors of the system as a whole.
Recommendation 3.3: The Department of Defense and the military services, using common financial resources, should develop a centralized testing and operational evaluation data archive for use in test design and test evaluation.
Recommendation 3.4: All services should explore the adoption of the use of small-scale testing similar to the Army concept of force development test and experimentation.