Testing and Evaluation in an Evolutionary Acquisition Environment
CONTINUOUS PROCESS OF TESTING, LEARNING, AND IMPROVING SYSTEMS
Operational testing and evaluation, as portrayed in the current milestone system, supports a decision to pass or fail a defense system before it goes to full-scale procurement. However, the U.S. Department of Defense (DoD) and the Services have not been following this consistently, as evidenced by the fact that critical systems have rarely been terminated solely on the basis of testing.
A 1998 report by a National Research Council panel considered many of the themes that are of interest in this report. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements (National Research Council, 1998:35) proposed a new paradigm in which testing should be viewed as a “continuous process of information gathering and decision making in which operational testing and evaluation plays an integral role.” This new paradigm, suggested in the context of the traditional single-stage development, is even more important in the evolutionary acquisition environment.
With staged development, if a system has gone into full-scale production in Stage I, then a goal in subsequent stages typically will be to identify ways to efficiently incorporate additional capabilities. Furthermore, earlier versions of a fielded system should provide very useful data on strengths and weakness prior to the next and future stages of system development.
Thus, it is even more critical to view testing as a process of experimentation: one that involves continuous data collection and assessment, learning about the strengths and weaknesses of newly added capabilities or (sub)systems, and using all of this information to determine how to improve the overall performance of the system. This should not be viewed as an activity to be carried out solely by contractors near the initiation of system development or by DoD near the end; instead, it should become an intrinsic part of system development with facilitated communication and interaction between the contractor and government testers throughout the developmental process.
In the evolutionary acquisition context, experimentation in early stages can be used to identify system flaws and to understand the limitations of system design. The focus in later stages should be on problems identified in the field and/or unresolved from earlier testing, evaluating the most recent modifications to the system, and assessing the maturity of a new component or subsystem design. This experimentation can be at the component level, at the subsystem level, or at the system level, with varying degrees of operational realism, depending on the goals.
Operational testing and evaluation (or testing for verification) will still have a major role to play, since it is the only way to verify that the systems are in fact operationally effective and suitable. In fact, it is critical that there is adequate oversight and accountability in this flexible environment. However, it is not realistic to undertake comprehensive operational tests at each stage of the development process. These should be undertaken only at stages encompassing major upgrades, the introduction of new, major capabilities, or new major (sub)systems. At other stages, a combination of data and insights from component or subsystem testing and developmental tests (reflecting operational realism as feasible) can be used instead, along with engineering and operational user judgment.
Conclusion 1: In evolutionary acquisition, the entire spectrum of testing activities should be viewed as a continuous process of gathering and analyzing data and combining information in order to make effective decisions. The primary goal of test programs should be to experiment, learn about the strengths and weaknesses of newly added capabilities or (sub)systems, and use the results to improve overall system performance. Furthermore, data from previous stages of development, including field data, should be used in design, development, and testing at future stages. Operational testing (testing for verification) of
systems still has an important role to play in the evolutionary environment, although it may not be realistic to carry out operational testing comprehensively at each stage of the development process.
The use of the term “testing” to describe this continuous process is unfortunate, and we do not expect that it will be changed. Nevertheless, it is important for DoD to recognize explicitly that the primary goal of test and evaluation programs in evolutionary acquisition will be to experiment, learn, and use the results to improve system performance.
Recommendation 1: The under secretary of defense (acquisition, technology and logistics) and the director of operational test and evaluation should revise DoD documents and processes (e.g., DoD Directive 5000.1 and DoD Instruction 5000.2) to explicitly recognize and accommodate a framework in which the primary goal of all acquisition testing and evaluation programs is to experiment, learn about the strengths and weaknesses of system components, and to incorporate these results into system enhancement initiatives.
TESTING OUTSIDE THE ENVELOPE
In a learning environment, testing should go beyond estimating typical performance of a new or modified system to also emphasize the early discovery of failure modes,1 deficiencies, and weakness in the system design. This will require adding more operational realism to developmental tests as well as testing with demanding and possibly even accelerated stress environments. This is consistent with the practices of some companies (for example, Ford Motor Company), in which reliability improvement is based primarily on failure-mode avoidance, that is, finding and eliminating failure modes early in development rather than merely demonstrating compliance with prescribed reliability metrics.
The General Accounting Office, on the basis of a study of industrial best practices, advocates a knowledge-based approach to system develop-
ment (see U.S. General Accounting Office, 2004). Katherine Schinasi described this approach in a presentation at the workshop. Failures in early-stage testing are viewed as rich sources of information about design inadequacies. The objective is to “break it big early,” rather than to utilize test events that concentrate on performance under typical stresses. This is consistent with suggestions in Statistics, Testing, and Defense Acquisition (National Research Council, 1998) to test outside the envelope and to include characteristics of operational realism early in developmental testing. This will assist in discovering both reliability failure modes and deficiencies in operational use as early as possible—when it is least costly to explore redesign and improvement options, and to implement specific modifications and enhancements (these are sometimes referred to as “countermeasures” in industry, although the term has a different meaning in DoD).2
This early stage experimentation is particularly important for the development of reliable defense systems. Tom Christie’s introduction to the FY 2004 Director of Operational Test and Evaluation’s annual report states:
The Defense Science Board in 2000 pointed out that 80 percent of recent U.S. Army defense systems brought to operational test failed to achieve even half of their reliability requirement. This was followed later by data showing that with all the streamlining of the acquisition process, the number of systems failing to meet the reliability requirement had increased. As stated earlier, this trend is evident in the reports DOT&E sends to Congress.
The experience of the acquisition community is that system reliability in operational tests is generally substantially worse than that in developmental tests. For example, operational test mean times to first failure typically are a factor of roughly two to four times shorter than those for developmental tests. Clearly, developmental testing gives an incomplete assessment of operational system quality. In addition, if failure modes iden-
tified in operational tests (or when fielded) had been discovered earlier in the development process, they would have been much less expensive to fix and would have contributed to more rapid system maturation. This emphasizes the need to test defense systems more stressfully and as early as possible.
We think that for test purposes, “edge of the envelope” can be defined fairly rigorously. The space of conceivable military scenarios for operational testing includes a number of uncontrollable dimensions (e.g., environmental characteristics, potential missions, threat objectives and characteristics, etc.), and these dimensions can be usefully parameterized to identify the edge of the envelope. For example, in Bonder (2000), with the parametric operational situation (POS) space formulation,3 each point in this space represents an operational situation that U.S. forces might have to be deployed to and operate in. Some of these situations are more stressful than others. Operational testing should be performed in the most stressful situations in which U.S. forces and/or system performance can barely be successful (i.e., the edge of the envelope). This will provide information about system performance/force capability under very stressful conditions and facilitate estimating performance under less stressful conditions (i.e., inside the envelope) as interpolations rather than extrapolations.
The use of a POS space to design a testing strategy is another reason to include modeling and simulation as an integral part of the operational testing process. The POS space is developed by identifying: (1) likely geographic areas around the world that U.S. forces might have to be deployed to and operate in, (2) the set of uncontrollable “operational parameters” that affect U.S. military/system performance in those areas (terrain, weather, threat objectives, levels, equipment, tactics, countermeasures, etc.), and (3) the likely range of each parameter in the POS space. Using this space, parametric simulation-based analyses should be used to identify feasible stressful situations for operational testing. As noted elsewhere in this re-
port, modeling and simulation-based analyses should also be used to identify issues and hypotheses for testing, to guide the development of testing strategies and designs, and to analyze and extend test results.
Conclusion 2: Testing early in the development stage should emphasize the detection of design inadequacies and failure modes. This will require testing in more extreme conditions than those typically required by either developmental or operational testing, such as highly accelerated stress environments.
The current incentive structure in DoD may discourage testing outside the envelope or identifying component or system limitations and failure modes early in the process. Test strategies by both contractors and DoD apparently are designed to emphasize situations in which a defense system can “do well.” This may be one of the reasons that reliabilities assessed during developmental tests are much higher than those assessed during operational tests. For example, the ineffectiveness in the early identification of failure modes may be due to the fact that the current acquisition system uses the funding mechanism to punish early failures in programs compared with programs that do not appear to have initial problems. Pressures on program officials to meet budgets and deadlines, due to congressional and other oversight, result in test strategies geared toward demonstrating “successful” performance. Thus, testing is often carried out under benign or typical stresses and operating conditions, rather than striving to determine failure modes and system limitations by testing under more extreme circumstances. Comprehensive testing targeted toward identifying system failure modes and system limitations—a cornerstone of commercial system development—does not appear to have a high priority in DoD.
Conclusion 3: To have a reasonable likelihood of fully implementing the paradigm of testing to learn about and to improve systems prior to production and deployment, the roles of DoD and congressional oversight in the incentive system in defense acquisition and testing must be modified. In particular, incentives need to be put in place to support the process of learning and discovery of design inadequacies and failure modes early and throughout system development.
DEVELOPMENTAL AND OPERATIONAL TESTING
As noted already, there is a need for more operational realism in early test events to help identify failure modes and system performance shortcomings that will appear under operational circumstances. In evolutionary acquisition, it will be practical to conduct full-scale operational tests only at stages with major upgrades or substantive new capabilities. At other stages, only developmental tests of components and subsystems and some limited tests on functionality, interoperability, etc., will be feasible. These tests, while not full-scale operational tests, should use operational realism to the extent needed to assess the performance of these components and subsystems from an operational perspective.
Thus, the current distinction between operational and developmental testing will have to be reconsidered. A new paradigm, better coordinated between the two approaches, would allow system development to benefit from a more continuous, strategic approach to testing—efficiently supporting the evaluation of competing designs at the beginning of system development, and then the evaluations of new components, subsystems, and complementary systems prior to integration with the existing overall system of interest. Testing at the component or subsystem levels would also emphasize the identification of operationally relevant failure modes and design limitations.
Recommendation 2: The under secretary of defense (acquisition, technology and logistics) and the director of operational test and evaluation should revise DoD testing procedures to explicitly require that developmental tests have an operational perspective (i.e., are representative of real-world usage conditions) in order to increase the likelihood of early identification of operational failure modes and system deficiencies, so that appropriate actions can be developed and deployed in a timely fashion.
DoD’S ROLE IN SYSTEM DESIGN AND DEVELOPMENT AND INTERACTION WITH CONTRACTORS
In commercial enterprises, techniques for experimentation are most effectively applied at the product design and development stages. In DoD, however, most of the testing efforts (including elements of developmental testing) are performed too late in the process to detect deficiencies, improve
the design, or enhance system performance. To quote Statistics, Testing, and Defense Acquisition (National Research Council, 1998:38) again, “It is now generally acknowledged [in industry] that quality cannot be ‘inspected’ into a product…. Rather, quality must be designed into any complex new product at virtually every stage of its development. This is accomplished by creating acquisition processes that make use of continuous monitoring and testing.”
Testing in the design and development stage will aid critical decisions on the choice of materials, design layout, selecting parameter values for optimizing system performance, and exploring robustness to the impact of various uncontrollable “noise” factors in manufacturing and field-use conditions. In the commercial sector, a considerable part of the design and development takes place within the company, with direct control over these decisions. In DoD, in contrast, there appears to be no mechanism for the relevant DoD test and acquisition officials to be closely involved with the contractors in the design and development phase. Under current contractual procedures, the contractor’s experiments, processes, simulation models, test results, and decisions often are proprietary.
Admittedly, even in private industry, some manufacturers rely on specifications, testing, and inspection to enforce the quality and reliability of suppliers’ products. But leading companies like Toyota maintain very close relationships with suppliers and work hand-in-hand with them to improve their processes and products (see Box et al., 1988). This has allowed them to implement leading-edge quality practices, such as just-in-time and lean manufacturing, with cooperation and information throughout the value chain. Current rules and regulations do not allow DoD to take advantage of such practices.
Effective implementation of evolutionary acquisition requires closer interaction and a high degree of coordination and communication among system developers, testers, and system users. A critical component of such interaction is the need for DoD testers to have full and early access to all contractor data sources in order to support a mutually flexible and iterative approach for design, development, and testing.
Conclusion 4: In the evolutionary acquisition environment, effective system development and optimization will require a high degree of coordination and communication among system developers, government testers, and system users. In particular, government testers should have early access to all contractor data sources, including test plans,
results of early stage testing and experimentation, and the results of all pertinent modeling and simulation products.
Recommendation 3: The under secretary of defense (acquisition, technology and logistics) should develop and implement policies, procedures, and rules that require contractors to share all relevant data on system performance and the results of modeling and simulation developed under government contracts, including information on their validity, to assist in system evaluation and development.
The committee does not underestimate the difficulty of getting contractors to share their test data and in separating information that is proprietary from data that should properly be shared with the program officials and DoD test community. Clearly, this will require considerable discussion and effort. Nevertheless, the committee thinks that these should be carefully negotiated up front as part of the system development contract.
Cooperation between DoD and contractors may have been discouraged in the past due to concerns over the possible impact it would have on their independence and the need for objective assessment of system performance. While such independence is needed, it is also important to learn from industrial best practices in order to work closely with the contractors and suppliers. DoD must recognize the value of the information on upstream design problems in guiding test design and in supporting the continuous assessment and improvement of system performance.
ENSURING MATURITY OF NEW TECHNOLOGY
In addition to overly optimistic operational requirements (discussed in Chapter 4), incorrect assessment of the maturity of new technology and complications in converting technological advances into producible and reliable products are major causes of slippage in an acquisition schedule and associated cost growth. For example, the U.S. General Accounting Office (1992:51) found that “successful programs have tended to pursue reasonable performance objectives and avoid the cascading effects of design instability.” In 2001, the Center for Naval Analyses (CNA) addressed the instability of requirements and performance characteristics for major DoD acquisition programs. On the basis of interviews, it reported that senior
DoD officials believed the Army’s acquisition process had too many sources of requirements and too little coordination or corporate direction. As a result, “too many requirements get ‘approved’ without adequate consideration of resource availability or near-term feasibility.” CNA also reviewed the acquisition program baselines (APBs) for 70 DoD acquisition category I (ACAT I) programs to assess changes in performance characteristics (APBs are the “contracts” between DoD/Service acquisition executives and their program managers containing the performance goals and thresholds for system performance). For 70 Army, Navy, and Air Force programs, CNA found that there were 20 to 30 performance characteristics per program and that 10 to 30 performance characteristics changed for each program (i.e., about 50 to 100 percent changed). They also noted that the frequency of APB changes was roughly one per every 2.5 to 2.8 years.
In the traditional single-stage acquisition environment, it may take a decade or more before a new but possibly risky technology can be incorporated into an existing system. This may have encouraged the incorporation of risky, immature technology into acquisition programs. There are also currently no management penalties for using such immature technologies.
Evolutionary acquisition provides increased opportunities to better discipline the management of acquisition programs. Decision makers can weed out unwarranted optimism in draft “requirements” and choose only mature technologies to include in the early stages of acquisition programs. They can delay the introduction of risky, immature technology to future stages. Engineers can demonstrate a satisfactory level of technological maturity in separate advanced technology demonstrations (ATDs) or advanced concept technology demonstrations (ACTDs). Demonstrating technological maturity before including the technology in a formal acquisition program will eliminate the need to delay the entire acquisition program because of one risky technology area or risk using technology that may not be sufficiently effective or reliable. (This separation of the roles of technology development and product development is supported in U.S. General Accounting Office, 2004.)
The under secretary of defense (acquisition, technology and logistics) or the relevant service acquisition executive could request the director of defense research and engineering to certify (or refute) sufficient technological maturity for critical components to be included in any particular stage of the system’s development. The relevant acquisition authority could also provide for special reviews of the analysis of alternatives (AoA) for assess-
ment of technology risk and maturity called for in Section 3.5.3 of DoD Instruction 5000.2 or the materiel developer’s assessment, during developmental test and evaluation, of technical progress and maturity against critical technical parameters (as documented in the test and evaluation master plan) called for in Section E5.5.4 of DoD Instruction 5000.2. By taking such steps to achieve early confirmation of maturity or to delay introduction of an immature technology to a later stage of development, the acquisition executive can avoid putting the entire acquisition program in an unacceptable high-risk position.
Recommendation 4: The under secretary of defense (acquisition, technology and logistics) should require that all technologies to be included in a formal acquisition program have demonstrated sufficient technological maturity before the acquisition program is approved or before the technology is inserted in a later stage of development. The decision about the sufficiency of technological maturity should be based on an independent assessment from the director of defense research and engineering or special reviews by the director of operational test and evaluation (or other designated individuals) of the technological maturity assessments made during the analysis of alternatives and during developmental testing and evaluation.
COMPARISON WITH BASELINE SYSTEMS IN EVOLUTIONARY ACQUISITION
In traditional acquisition programs, operational tests and evaluations often test both the system under development and the baseline system (i.e., the control) that is scheduled to be replaced (or the system or family of systems that currently performs comparable missions). Baseline testing provides direct contrasts—both relative to specific required performance characteristics and on a broader mission performance scale—that can establish the degree to which the new system is an improvement over the old one. The ability to make such comparisons is especially important when the integrity of prescribed performance requirements for the new system can be questioned (e.g., lacking a solid basis, sensitive to test scenario and test execution specifics, subject to data measurement uncertainties), when those requirements are limited in scope (e.g., focused solely on technical performance characteristics rather than operational mission accomplishment), or when the new system’s performance in the operational test and evaluation
appears to come up short relative to one or more specific established requirements.4
In an evolutionary acquisition framework, baseline testing should retain its essential role. The current system provides the baseline in Stage I, while the new system from the previous stage will serve that role in subsequent stages. When there are major changes to the system, the additional system(s) or capabilities (or both) that distinguish the new stage should be subjected to extensive technical testing and operational assessments that explore functionality, interoperability, safety, etc., over the spectrum of relevant environmental conditions.
In a staged development process, the system at the initial stage will have advanced through its own test and evaluation. In addition, there will be field data on actual field performance. These sources of data will provide very useful comparative information on the performance of the baseline system, much more so than in the current acquisition process, in which an existing system is compared with a completely new prototype.
OPERATIONAL TESTING OF VERY COMPLEX DEFENSE SYSTEMS
There is another consideration that is independent of the issue of evolutionary acquisition, and that is the increasing complexity of major defense systems. The complexity includes: (1) components, subsystems, and systems of systems and their interactions, (2) network centricity, (3) leading-edge materials technology, (4) leading-edge guidance and control technology, (5) unknown human factors, (6) unknown vulnerability to countermeasures, and (7) software demands on the order of tens of millions of lines of code. As a result, DoD fast is approaching a period in which a single all-encompassing large-scale operational test, as currently practiced, will cease to be feasible.5
The sheer number of subsystems, components, materials, software, and resulting system interactions may become overwhelming. Operational testing that is typically limited to at most a few dozen test scenarios cannot hope to exercise the systems in enough ways to discover all of the important design deficiencies with a reasonable degree of confidence. As discussed above, extensive operational testing will be reserved for stages that have the most substantive changes involving the most components and subsystems, and will be limited to a relatively small number of test scenarios, even across stages of development. For other stages, testing will be limited to component, subsystem, and software testing.
Thus, it is unlikely that the operational evaluation of complex systems can be based primarily on an operational test representing the full spectrum of applicable (simulated) combat conditions in the field at an affordable cost, within a reasonable time, and with meaningful conclusions. The number of all relevant combinations of scenarios for testing the system will simply be too large for any traditional operational testing.
This concern about the ability to operationally test very complex systems is not new. A good example is the Peacekeeper missile, which was tested in the early 1980s. It could not be comprehensively tested using standard methods, since, for example, only 20 operational replications were feasible. As a result, assessments of performance over a wide range of operationally realistic trajectories could not be definitive. Given these circumstances, the operational evaluation needed to be based largely on a model of the system, with the model’s validity established as well as possible given the test replications. Various “tricks” were used so that each test replication provided as much information as possible. For example, a flight might have several accelerometers and instrumentation might be very detailed. The model depended heavily on all of the information that went into the original design, developmental testing, and other sources. This example demonstrates that even with very limited testing, one could ultimately be confident in one’s assessment of the system’s performance capabilities.
were the case, the general discussion of the proper size of the test budget needs to be considered in the broader context of estimates of life-cycle costs and the effectiveness of the fielded systems given more continuous testing. It is the general experience of industrial systems development that identification of failure modes early enough to identify superior system designs not only reduces the frequency of retrofitting and redesign work, but also improves the performance of the system in the field. Therefore, greater use of testing that identifies failure modes and system limitations early can pay for itself in reduced maintenance, repair, and replacement costs, as well as in protocols for how the systems are to be effectively used.
An alternative approach to testing is to get experts knowledgeable in the system together with experts in statistical experimentation design to (a) model and analyze the system design at important levels of abstraction for important scenarios of use; (b) identify the most likely critical or vulnerable components, interfaces, or interactions and design and conduct relatively cheap, focused tests of these components or interfaces; (c) present the results and case analysis to key people in the acquisition processes; and (d) refocus the development to the critical or vulnerable components.
Traditional operational test and evaluation can and should still be conducted for simpler systems to demonstrate that these systems are likely to work as intended in the field. But as the percentage of defense systems that are extremely complex appears to be increasing, the operational test and evaluation strategy must focus on the essential components, subsystems, and interfaces. Although such an approach will not provide a guarantee that the comprehensive and complex system will work, it will provide useful evidence that certain scenarios and zones of use will work reliably. Analogous to mapping a mine field, testing can show “safe paths” of important use. For example, there are techniques for identifying small subsets of all possible combinations of components, subsystems, and environments, so that critical pair-wise or higher order combinations are tested; Cohen et al. (1994) discuss an application to software testing. Also, aspects of the “exploratory analysis under uncertainty” paradigm (e.g., Davis et al., 1999) may be applicable.
In summary, with unlimited time and test budgets, the preferred approach would be to test the system or its components in every operationally relevant scenario. However, given the large number of all possible combinations of the factors and test scenarios, testing will have to make use of clever strategies. These include testing only in selected scenarios to examine the performance of those components or interfaces that seem to be most problematic, testing only a subset of all possible interactions or inter-operability features, testing those scenarios that correspond to the most frequent types of use, or some combinations of these strategies. Clearly, there is a need to investigate various alternatives and develop specific proposals.
Conclusion 5: The DoD testing community should investigate alternate strategies for testing complex defense systems to gain, early in the development process, an understanding of their potential operational failure modes, limitations, and level of performance.