Read "Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements" at NAP.edu

Page 20 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

2
Operational Testing and System Acquisition

The main objective of defense acquisition is to obtain quality weapon systems, in a cost-effective and timely manner, that meet an operational need. Two broad types of testing are used to assist in this goal. Developmental testing covers a wide range that includes component testing, modeling and simulation, and engineering systems testing. Developmental testing presents the first opportunity to measure the performance and effectiveness of the system against the criteria developed in the analysis of alternatives (formerly cost and operational effectiveness analysis). Operational testing and evaluation examine the performance of a fully integrated set of systems, including the subject system, under realistic operating environments, perhaps for the first and only time before implementation. It is the process that the U.S. Department of Defense (DoD) uses to assess whether a weapon system actually meets it planned capability before deciding whether to begin full-rate production. Operational testing and evaluation is an independent assessment of whether a system is effective and suitable for its specified use; this independent assessment is vitally important to acquisition.

The first two sections of this chapter briefly describe operational testing and evaluation and discuss the differing perspectives of the parties involved in defense acquisition. The panel then considers how well the current operational testing and evaluation structure supports the effective acquisition of defense systems and how modern statistical practices, combined with fundamental changes in the current paradigm (the topic of Chapter 3), could result in substantial improvements.

Page 21 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

A SHORT HISTORY OF OPERATIONAL TESTING

The Office of the Director, Operational Test and Evaluation (DOT&E) in DoD was established in 1983. Before then, there were several organizations in the Office of the Secretary of Defense (OSD) involved in one or more aspects of operational testing; none had the responsibility of managing or monitoring operational testing and evaluation as a whole. In addition, each service was generally responsible for planning, conducting, evaluating, and reporting its own operational tests. With limited DoD oversight, each service developed unique procedures and regulations for operational testing and evaluation. They differed in the flexibility of operational test guidelines, in the extent to which test results were used iteratively to improve weapon systems, and in the degree to which operational test agencies were subordinate to organizations responsible for system development.

Without any unified structure or policy for operational testing and evaluation, DoD began to be strongly criticized in the late 1960s. Testing agencies often suffered from high turnover and were often subordinate to organizations responsible for new system development. There were excessive layers of bureaucracy separating the test agencies from the chiefs of staff (direct reporting of test results to decision makers was all but non-existent), and operational commands generally set aside insufficient funding, personnel, or facilities to accomplish adequate testing and evaluation. (For detailed historical information prior to 1970, see Blue Ribbon Defense Panel, 1970.)

In 1971 Congress enacted Public Law 92-156 which, among other things, required DoD to begin reporting operational test results to Congress. The Deputy Secretary of Defense directed the military services to designate field commands, independent of the system developers and the eventual users, to be responsible for planning, conducting, and evaluating operational tests (U.S. General Accounting Office, 1986). These agencies were instructed to report directly to the appropriate chief of staff.

The next major change in DoD came in 1983, with the congressionally established DOT&E. DOT&E is headed by a director, who is the principal department adviser to the Secretary of Defense on operational testing and evaluation and is responsible for prescribing policies and procedures for its conduct. By law, a major defense acquisition program may not proceed beyond low-rate initial production until initial operational test and evaluation is completed. The law requires that the director shall analyze the results and prepare a report stating the opinion of the director as to:

whether the test and evaluation performed were adequate; and
whether the results of such test and evaluation confirm that the items or components actually tested are effective and suitable for combat.

Page 22 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

This provision has led many officials to view operational testing and evaluation as a final series of test events; the system either passes and is produced or fails and is either returned to development or terminated.

A 1987 General Accounting Office evaluation of the effectiveness of DOT&E found several important deficiencies (U.S. General Accounting Office, 1987b). First, DOT&E was criticized for failing to sufficiently monitor the services: on-site observations of operational tests were inadequate to ensure compliance with DoD testing policy. Second, DOT&E's independent assessments of operational test results relied too heavily on the services' own test reports. DOT&E is required to conduct independent analyses using the actual test data and not the services' reports; however, in many cases DOT&E reports were copied verbatim from service documents. Third, DOT&E failed to maintain accurate records of its principal activities. DOT&E officials concurred with all of the General Accounting Office findings, citing understaffing as the major problem.

DOT&E had attempted to upgrade the conduct of operational testing and evaluation by concentrating its early efforts on improving test planning. The General Accounting Office reported that these efforts were largely successful (U.S. General Accounting Office, 1987b). Traditionally, the services had not given much emphasis to preparing test plans; with increased DOT&E oversight, GAO evaluations support the assertion that the quality of test design was improved. A recent GAO report (U.S. General Accounting Office, 1997) has strongly endorsed the value of the different functions of independent review, approval, and assessment of the various stages of operational test design and evaluation provided by DOT&E.

Since the establishment of DOT&E, the conduct of operational testing has undergone several minor changes. With the assistance of DOT&E, two successive Secretaries of Defense rewrote the directives that govern the acquisition of major systems (including Directive 5000.1, Defense Acquisition, and Directive 5000.2, Major System Acquisition Policies and Procedures). These documents acknowledge and emphasize the importance of testing and evaluation in the acquisition cycle. In the past 10 years, DOT&E has eliminated many of the deficiencies noted in the 1987 evaluation: for example, DOT&E assessments now rely less heavily on service test reports.

Congress has imposed some additional changes on the acquisition process. Operational testing and evaluation of major defense acquisition programs may not be conducted without formal approval from DOT&E. In addition, vulnerability and lethality testing must be completed under the oversight of DOT&E.

Currently, DOT&E has identified several service initiatives as priorities, including: (1) earlier involvement of operational testers in the acquisition process through the use of early operational assessments and integrated product teams; (2) more effective use of modeling and simulation before operational testing and evaluation and as part of early operational assessments; (3) combining

Page 23 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

developmental and operational tests and using information from all viable sources to make operational testing and evaluation more effective; (4) operational command partnerships between testers and users; and (5) combining testing and training (U.S. Department of Defense, 1997). The panel supports these initiatives and encourages the services, along with DOT&E, to vigorously pursue them.

THE DECISION CONTEXT: ROLE OF INCENTIVES

To understand the use of testing as part of system development, it is important to have a general understanding of the acquisition process. We very briefly describe the different milestones below; for a more detailed description, see Statistical Methods for Testing and Evaluating Defense Systems (National Research Council, 1995:Appendix A).

The Acquisition Process

The procurement of a major weapon system follows a series of event-based decisions called milestones. At each milestone, a set of criteria must be met if the next phase of acquisition is to proceed; see Figure 2-1. For all acquisition programs, test planning is supposed to begin very early in the process and involve both the developmental and operational testers. Both should prepare the Test and Evaluation Master Plan, which is a requirement for all acquisition programs. For ACAT I programs, this plan must be approved by DOT&E and the Director, Test, System Engineering, and Evaluation (the OSD office with oversight of developmental testing, which reports to the Defense Acquisition Executive).

Milestone II follows the demonstration and validation phase. At this point the milestone decision authority, in consultation with DOT&E, determines the low-rate initial production quantity to be procured before initial operational testing is completed. The number of prototypes required for operational testing are also specified by the service test agencies and by DOT&E for ACAT I programs.

After obtaining milestone II approval, the system enters the engineering and manufacturing development phase. One of the testing objectives during this phase is to demonstrate that the system satisfies the mission need and meets contract specifications and minimum operational performance requirements. Based on what has been outlined in the Test and Evaluation Master Plan, the testers and evaluators prepare detailed planning documents that are subjected to OSD review. However, resource constraints may prevent certain system characteristics from being evaluated. In such cases, the testers identify what they can accomplish given the constraints.

The operational test results are interpreted by many separate agencies, including the relevant service's operational test agency and DOT&E. DOT&E prepares independent operational test and evaluation reports for the Defense Acquisition Board, the Secretary of Defense, and Congress.

Page 24 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

FIGURE 2-1 DoD Acquisition Process. SOURCE: U.S. General Accounting Office (1997:35).

In making the milestone III recommendation to initiate full-rate production for ACAT I systems, the Defense Acquisition Board considers among other things (e.g., costs or changes in a threat) the developmental test results and the reports from DOT&E and the service test organizations. If the Undersecretary of Defense for Acquisition and Technology approves full-rate production, the full-rate production contracts are then awarded, consistent with any DoD or congressional restrictions. Follow-on operational testing is performed during the early stages of the production phase to monitor system performance and quality.

Incentives in the Acquisition Process

Everyone directly or indirectly involved in defense testing and evaluation faces constraints and rules that give each a unique perspective and result in different incentives. While neither good nor bad, these different incentives have

Page 25 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

a profound impact on the test and evaluation decision-making process (e.g., budgets, schedules, access to information). Therefore, when examining how these decisions are made, it is important to understand and take into consideration each participant's role in that process.

For example, a scoring conference is a group of about six to eight people—including testers, evaluators, and representatives of the program manager—who review information about possible test failures and determine whether to count such events as failures or exclude them from the regular part of the test evaluation. Each member brings a different perspective to the table and may have an interest in the test results that are unrelated to the value of the system under consideration. These different perspectives and motivations must be examined and considered; if they are not, improving the statistical methods used in testing and evaluation are likely to have no or limited effect.

As studies of organizational behavior have long demonstrated, the goals of a large organization need not be exactly concordant with the smaller organizational units they comprise. It is entirely possible to confront a situation in which individually attractive behavior can lead to collectively undesirable consequences. In the rest of this section we describe the various actors in defense testing and acquisition in a stylized and somewhat oversimplified manner by drawing a slightly exaggerated representation of conscientious individuals' best efforts on behalf of their country and their positions. In describing the generation and flow of information and how the incentives of those involved in the process affect decisions, our goal is clarity and not offense. We realize that there are oversimplifications in this description, but we are convinced that the points made are fully relevant to the current process of DoD acquisition. At a minimum, the arguments offered demonstrate that incentives, if ignored or not balanced or otherwise taken into consideration, can have a substantial impact on the acquisition process.

The primary organizations involved in operational testing and evaluation are DoD and the military services, contracting firms, legislators, and the news media; many of them have diverse parts, such as test personnel, the program managers and their staffs and immediate supervisors, the DOT&E staff and managers, the project managers for the contractors, members of Congress and their staffs, General Accounting Office personnel, and finally, the U.S. public. For the purpose of this stylized exercise, we limit our scope to OSD management, the program staff, test personnel, DOT&E, the contractor, the public, legislators (which broadly includes Congressional members, their staffs, and General Accounting Office auditors), and the news media. This and other such simplifications allow us to provide a succinct and simplified description that gives some insight into the effects of each participant's objectives and incentives.

Ultimately, national security is being provided for the U.S. population. Using game-theoretic terminology, they are the ''principals" while the legislators are an "agent." Citizens delegate DoD oversight to Congress; therefore, in the con

Page 26 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

gressional-DoD relationship, the legislators are the principal and DoD the agent. It is important to note that this vernacular should not be interpreted to mean that legislators have "better" or "purer" objectives than DoD. The only group with presumably pure objectives is the U.S. population.

Contracts determine the nature of relationships between principals and agents. No agent will ever act exactly as the principal desires unless (1) the agent can be perfectly monitored or (2) the agent acquires the entire stake or interest of the principal, thereby becoming the principal. In the context of DoD acquisition, neither one of these occurs. So the contract under which a "player" functions will almost certainly not lead the player to act exactly as the principal would want.

Legislators are concerned with issues that can affect their reelection, such as employment in their districts. Program managers receive promotions when they can take their program to the next milestone or full-rate production. Consequently, they have a strong incentive to meet set schedules. System deficiencies are typically viewed as problems that will be fixed in the future.¹ The tester is, ideally, an independent and objective professional who is not affected by the pressures built into the process. Such independence may be compromised, however, if a tester is influenced in some way by the program staff. When this happens, the tester attempts to craft a test that (partially) reflects the program staff's perspective.

A rather dramatic tension exists between the program staff and their superiors and legislators. The program manager must comply with regulatory and reporting requirements, but the goal is to have the program approved for full-rate production, and so program staff have an incentive not to convey all known data about the system to legislators if negative information could delay or perhaps even terminate the program. (This incentive to conceal potentially unfavorable data was recognized by Congress in the early 1980s and was a primary motivation for the creation of DOT&E.) At the same time, the desire to avoid negative publicity creates an incentive for additional regulation and oversight of the acquisition process.

There are legitimate reasons, however, for a program staff to be concerned with complete disclosure to legislators. For example, a system may have a problem that the program staff knows can be fixed without jeopardizing the overall program. Legislators, however, may find these assurances to be unverifiable and not believable. For legislators, the risk of supporting a program with a

¹

A recent report (U.S. General Accounting Office, 1997) provides another perspective on the interaction between acquisition and test officials: "In reviews of individual weapon systems, we have consistently found that testing and evaluation is generally viewed by the acquisition community as a requirement imposed by outsiders rather than a management tool to identify, evaluate, and reduce risks, and therefore a means to more successful programs. Developers are frustrated by the delays and expense imposed on their programs by what they perceive as overzealous testers."

Page 27 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

major flaw is greater than the program staff's risk: if a program is flawed but funding continues, it is Congress and OSD management that will be held responsible by the public. Yet if the program becomes a great success, DoD will enjoy the accolades.

Another source of tension lies in the fact that participants in the acquisition process have different amounts of, access to, and abilities to process information. A reasonable ranking from "least informed/least capable to process" to "most informed/most capable to process" would be: public, news media, legislators, DOT&E, DoD management, contractor, program staff. Testers have specific information that makes it difficult to assess where they fit in the ranking. The news media often report only on weapon system deficiencies. Although these reports may be part of a bigger story that cannot be told without compromising national security, people are naturally upset at this apparent misuse of public funds. Members of Congress know that only a part of the information has been made public; nevertheless, they still tend to be responsive to a public outcry because of reelection concerns.

Operational testing and evaluation provides the program staff with valuable information and is the means by which Congress determines (through DOT&E certification) that it is getting the product that was promised. Even though operational testing and evaluation has a demonstrated value, the program staff commits limited resources to it because of overall program budget constraints, choosing instead to reserve the majority of available funds to program development. In fact, those programs with serious problems may have the smallest testing budgets since more funds will likely be allocated for development, leaving less for operational testing and evaluation.

How does this system of conflicting incentives operate? For one example, consider the potential go/no go² decision that is associated with the current milestone system and operational testing and evaluation, particularly the negotiations between the program staff and the contractor. If there were no such decision point, we conjecture that the contractor would be less focused on research and development, more willing to let problems go unresolved, and generally less concerned about job performance. If so, then the go/no go decision plays a role in advancing systems to production (see next section). This incentive structure also explains why it is common for system requirements to be aggressive or optimistic; the goal is to get the system approved for development. One can also use the above considerations to show that additional oversight could result in less testing, thereby making the acquisition process worse (Gaier and Marshall, 1998). Therefore, while this game theoretic framework is oversimplified, it might, for

²

In this report, we use the phrase "pass/fail" to mean "produce/terminate": a decision is made to either proceed to full-rate production or terminate the program. We use the phrase "go/no go" to mean "produce/delay decision/terminate": a decision is made to proceed to full-rate production, conduct further testing or development, or terminate the program.

Page 28 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

example, be able to assist Congress in examining potential legislation to see if it might have unintended consequences. Further efforts to expand this approach to represent more of the complexity of decision making in acquisition could have real benefits.

Recommendations

The above discussion suggests how incongruent and conflicting incentives may have a negative effect on defense acquisition, and specifically on testing and evaluation. Although simple models of portions of the acquisition process are instructive to develop and examine and should be further pursued (see, e.g., Gaier and Marshall, 1998), the number of participants and the complexity of their interaction poses too difficult a challenge to permit elaborate modeling. Therefore, recommendations related to individual participants or steps in the process that seem sensible when that person or step is examined in isolation run the real risk of ignoring important interactions. This consideration led the panel to be cautious in its recommendations related to incentives and information flow. Overall, DoD should consider the complex, real situation when developing and assessing proposals for acquisition reform and should strive for structures in which the incentives of the participants are as congruent as possible.

There are two areas affected by this interplay of incentives that are easy to conceptualize and are in need of improvement: determining operational requirements and allocating resources. As noted above, there is a strong incentive to be optimistic with system requirements to ensure program initiation approval. The result can be parameters that are untestable or requirements that cannot be met. If operational testers reviewed the Operational Requirements Document, it would help build integrity and quality into the process from the start. It might also discipline the test community to focus on the operational needs as expressed in the Operational Requirements Document and not expand "test requirements" with measures of performance that are included in other documents.

Recommendation 2.1: The Department of Defense and the military services should provide a role for operational test personnel in the process of establishing verifiable, quantifiable, and meaningful operational requirements. Although the military operators have the final responsibility for establishing operational requirements, the Operational Requirements Document would benefit from consultation with and input from test personnel, the Director, Operational Test and Evaluation, and the operational test agency in the originating service. This consultation will ensure that requirements are stated in ways that promote their assessment.

The federal government is often criticized for mandating work to the states

Page 29 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

but not providing the necessary funds. As part of its oversight function, DOT&E can order that more tests be performed, but with budget constraints affecting all of DoD, the services cannot be expected to have the additional resources needed to cover the cost. A reserve fund would ensure that an appropriate number of tests are performed when necessary. Additional tests may show serious system flaws that require a return to development, something a program staff clearly wants to avoid. If program managers know that DOT&E will be able not only to mandate additional tests, but will have access to a source of funds to make sure they are actually implemented, they might be encouraged to undertake more complete operational testing at an earlier stage in the process. Finally, just the use of these funds should be seen as an indication of a potentially substantial problem either with how the system is being tested or even about the worthiness of the system itself.

Recommendation 2.2: The Director, Operational Test and Evaluation, subject to the approval of the Secretary of Defense on a case-by-case basis, should have access to a portion of the military department's acquisition funding reserves (being set up as a result of the first quadrennial defense review) to augment operational tests for selected weapon systems.

PROBLEMATIC CHARACTERISTICS OF THE CURRENT SYSTEM

In addition to the problems posed by the incentives in operational testing and evaluation, there are several other characteristics of the acquisition system that must ultimately be changed if operational testing and evaluation and acquisition are to meet their greatest potential.

Operational Testing as a Go/No Go Decision

The results of operational testing are critical inputs for the decision makers who will determine the future of the system under test. Significance testing is sometimes used to give a rigorous statistical grounding to this decision process: in those instances, if a system fails a significance test for a key operational requirement or for a large collection of other requirements, the milestone decision can be portrayed as resulting in the termination of the system. (For a detailed discussion of the potential problems in using significance tests, see Chapter 6.) However, the actual decision process is much more continuous than the idealized milestone system would indicate, as it should be in light of the large investment in system development.

It is extremely unusual for a defense system to be terminated. Because of the momentum built up during the many years of developmental activity (and be

Page 30 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

cause perfection is a process and not a state), a system must convincingly fail one or more operational tests or have other problems to be terminated. The panel is aware of only four ACAT I systems in the last 12 years whose failure during operational testing has been believed by many (though not all) observers to have been a primary element in its subsequent termination (DIVAD ]Sergeant York], ASPJ, AQUILA, and ADATS). As a result, systems typically must undergo corrective action when they fail operational test and are later retested.

Clearly, a system redesign should be done when it is relatively straightforward and cost-effective (i.e., modest in scope and expense), or when there are no alternatives to providing the operational requirements. However, when this is not the case and termination should be considered, there are strong disincentives to do so in the current system. There are very few people with decision authority in the acquisition community who would advocate terminating a marginal system in their own areas of responsibility. Therefore, if the choice is between termination and redevelopment, a deficient system most likely will be returned to development.

There is the serious related problem that results from the costs of retrofitting a system because its deficiencies were found after full-rate production had begun. Post-production redesign is particularly troublesome with software-intensive systems. As more systems incorporate extensive software components, it is becoming more common for software problems to arise during operational tests. These problems are often erroneously considered to be easier to fix than hardware problems: as a result, decision makers are inclined to pass a system with the expectation that any software errors will be subsequently solved in parallel with full-rate production. This assumption has caused problems because software failures are often very serious. (Software test issues are discussed in Chapter 8.)

Constraints on Statistics from the Current Application of Test to Military System Development

There are many specific problems presented by the need to effectively carry out operational testing and evaluation that can be substantially improved by state-of-the-art statistical practices. However, any improvements gained through the application of modern statistical practices (detailed in Chapters 5-9) are limited by the current structure of operational testing within military system development.

Because of both limited statistical modeling expertise and the fact that developmental test data, absent statistical modeling, are typically not relevant to evaluation of a system's operational readiness, it is not typical for data other than operational test-data (except for pooling data for assessing reliability, availability, or maintainability) to be used to assess whether a system is ready for full-rate production. Furthermore, information from tests on related systems and from developmental test, including information on test circumstances, are not avail

Page 31 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

able to operational test evaluators since there is no well-maintained repository of data on test and field performance. This lack of information restricts the use of modern statistical techniques for effective test design when designing operational tests and for the combination of information at the evaluation stage. As a result, test efficiency is diminished because statistical practices cannot be effectively used and test costs are increased.

A lack of continuous test and evaluation of systems throughout development with respect to their operational performance—which would result from developmental testing having more operational realism or through the use of small, early operational tests—also hampers test design and limits additional possibilities for combining information. Finally, systems are at times permitted to change during operational testing, especially software systems. All of these practices restrict opportunities for full use of statistical methods. (Chapter 3 presents more detail on these information and statistical practice constraints.)

How Much Testing Is Enough?

The 1983 legislation effectively mandates that a decision to proceed beyond low-rate initial production must be based on operational testing that is adequate and that confirms the system's effectiveness and suitability for its intended uses. Assessing whether a test is adequate and has confirmatory power leads naturally to one question: How much testing is enough? We examine this question from a statistical perspective. (See Chapter 5 for some of the decision-theoretic considerations.)

As noted by Seglie (1992), the question has several dimensions. In it simplest form, "how much testing is enough" might be regarded as a relatively straightforward exercise in determining sample size requirements. However, if operational testing is viewed as a pass/fail step in the milestone process, then the costs and benefits of this decision must be considered. As Vardeman (1992) observes: "even in apparently simple situations, producing an honest answer to the question requires hard work and typically involves genuinely subtle considerations."

Significance-test-based statistical justifications result in sample sizes for operational tests that often exceed credible resource limits. For example, consider a prospective missile system for which the total projected procurement would be 1,000 missiles. Assume that the missile should come within lethal range of a primary target at least 80 percent of the time. In order to estimate the missile's ability to deliver a warhead within lethal range to within 5 percentage points with 90 percent confidence, approximately 148 missiles would have to be fired in destructive testing (174 would result from ignoring the finite population correction) (see, e.g., Cochran, 1977:75-76). A proposed operational test design in which 15 percent of the projected arsenal would be consumed in live-fire testing would be justifiably challenged as an inappropriate allocation of military

Page 32 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

acquisition resources. This problem is even more complicated for defense systems that need to be evaluated in many different operating scenarios, which effectively reduces the sample size in any given scenario.

The resource constraints on operational tests come in direct conflict with the information demands and test sample sizes that are required for rigorous statistically supported confirmation of system effectiveness and suitability. The panel recognizes this places DOT&E in a difficult position when it attempts to carry out its work as mandated by Congress.

Although supportable certification of system effectiveness and suitability can be and is made on the basis of the information (and judgment) generally available at the conclusion of operational testing, operational testing can also perform a different but extremely valuable function. Operational testing produces information about system performance in settings that come closer to real use than any other testing activity, and therefore operational testing uniquely informs about problems in system design and limitations in system performance from an operational perspective. It has long been recognized that operational testing involves a "system" that includes not only the prototype piece of hardware, but also the human operators, the training of those operators, and the military tactics and doctrine used in the weapon's deployment. Thus, a system's failure to perform in an effective or suitable manner may be attributable to poorly trained operators or inappropriate tactics. It is important to understand the causes of ineffective or unsuitable performance in real use, which can best be accomplished with operational test and evaluation. Elements of the planned operational tests that the panel reviewed at Fort Hunter Liggett were extremely useful for understanding how the Longbow Apache helicopter would function with realistic challenges, typical users, in day and night, and under different scenarios of attack. The panel is not arguing against use of operational testing. Rather, the panel is arguing that in a new view of operational testing and evaluation as part of defense system development, operational tests might play an expanded and more informative role in evaluating new defense systems.

Conclusions

The panel believes that neither the U.S. public, Congress, nor DoD is well served by perpetuating the fiction that, if done properly, operational testing will always provide sufficient information to make definitive statistical assessments of an individual system's effectiveness and suitability.

Conclusion 2.1: For many defense systems, the current operational testing paradigm restricts the application of statistical techniques and thereby reduces their potential benefits by preventing the integration of all available and relevant information for use in planning and carrying out tests and in making production decisions.

Page 33 Cite

Suggested Citation:"2 Operational Testing and System Acquisition." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

×

Conclusion 2.2: The operational test and evaluation requirement, stated in law, that the Director, Operational Test and Evaluation certify that a system is operationally effective and suitable often cannot be supported solely by the use of standard statistical measures of confidence for complex defense systems with reasonable amounts of testing resources.

Conclusion 2.3: Operational testing performs a unique and valuable function by providing information on the integration of user, user support (e.g., training and doctrine), and equipment in a quasirealistic operational environment.

In remarks to a symposium of the test and evaluation community, Paul G. Kaminski, former Under Secretary of Defense for Acquisition and Technology, stated his belief that "a cultural change is necessary . . . one that can only begin by reexamining the fundamental role of test and evaluation in the acquisition of new military capabilities" (Kaminski, 1995). Similar remarks suggest that others share the panel's view that operational testing and its role as part of system development should be reconfigured to increase its effectiveness and efficiency in producing information about prospective military systems.

The panel believes that substantial advances can be realized by modifying the current defense acquisition and operational testing paradigm by approaching operational testing as an information-gathering activity. Although we do not offer a complete blueprint for reorganizing defense testing as part of system development, we believe we can contribute to ongoing discussions about the role of test and evaluation in defense acquisition. Chapter 3 describes improvements that the panel believes could result from moving toward a new paradigm in which modern statistical methods support a quality-focused acquisition process. The approach we advocate is consonant with recent trends both in industry and within DoD.