Robust testing is an important part of effective system development. It can lead to early detection and correction of design deficiencies, and it facilitates high quality and reliability throughout system development. Testing at the U.S. Department of Defense (DOD) has been a subject of several previous National Research Council (NRC) reports. This section summarizes the conclusions and recommendations from those reports that are relevant to this panel’s charge and offers additional analysis and suggestions.
Operational testing and evaluation (OT&E) is intended to support a decision to pass or fail a defense system before it goes into full-scale production, but this practice has not been consistently followed by DOD. The National Research Council (1998) proposed a new paradigm recommending that testing be viewed as a continuous process of information gathering and decision making in which OT&E plays an integral role.
The new paradigm stressed the importance of adding operational realism to developmental testing. A key motivation for this focus, which is relevant to this report, is to discover design flaws much earlier in system development than currently occurs, when such defects are much less expensive to fix. It is well known that operational testing unearths many design problems missed in earlier developmental testing due to the better representation of operational realism. Adding operational real-
ism to developmental testing is very likely to help discover these flaws earlier in the development process. Another benefit of adding operational realism to developmental testing is that it provides a closer connection between developmental and operational testing, thereby facilitating combining information between the two forms of testing.
We also note that operational testing as currently done is typically too short to be able to discover many reliability deficiencies, such as fatigue. The time for developmental testing is also typically too short to find some of these flaws. These weaknesses in the current testing approach motivate the discussion below on accelerated testing, which, when properly implemented, can effectively expedite the discovery of design flaws.
A later report (National Research Council, 2006:15) noted that continuous testing is especially appropriate for systems that are acquired in stages, as one “learns about strengths and weaknesses of newly added capabilities or (sub) systems, and uses the results to improve overall system performance.” This report also recommended that DOD documents and processes be revised “to explicitly recognize and accommodate [this] framework” (p. 3) so that the testing community is engaged in a joint effort to learn about and improve a system’s performance. Although such formal changes have not been made, practices within DOD appear to be moving in this direction, one that is consistent with commercial industry practices.
There are a number of challenges in implementing the above paradigm. Test data from various sources need to be readily available, including field data from similar systems, data from previous stages of development, contractor data, developmental data, and data from modeling and simulation. Information from these sources can then be combined and exploited for effective test planning, design, analysis, and decision making. There are, however, major obstacles to meeting the challenges and accomplishing this approach in DOD: lack of data archives (see discussion below); use of multiple databases (with their own formats and incompatibilities); lack of access to data; and perhaps most importantly, lack of an incentive structure that emphasizes early detection of faults and sharing of information. As noted in the NRC report (2006:19): “incentives need to be put in place to support the process of learning and discovery of design inadequacies and failure modes early.” In addition, the NRC recommended that DOD require that contractors share all relevant data on system performance and results of modeling and simulation developed under government contracts. Similarly, Adolph et al. (2008:219) noted: “Sharing and access to all appropriate system-level and selected component-level test and model data by government DT [developmental testing] and OT [operational testing] organizations” should be required in defense contracts. Despite these recommendations, there has been a lack of progress in this key area.
The importance of collecting and using all available data for effective decision making has been emphasized in several NRC reports.1 Furthermore, it was the major focus of a subsequent report (National Research Council, 2004). Chapter 2 in that report deals with combining information to improve test planning and test design as well as analysis, and Chapter 3 discusses methods and examples related to reliability and suitability assessment. There is also an extensive statistical literature on this topic; in particular, an earlier NRC report (1992) is still a very useful reference.
Our contribution in this section is to provide some concrete ideas on how to parametrize the test space in order to improve test design and to combine results from different testing environments.2
A defense system is typically designed with some specific missions in mind. These missions can be characterized (at least partially) by variables that describe the environment of use (temperature, precipitation, wind speed, day/night, terrain, speed during use, weight of cargo, etc.). Other relevant factors include presence of countermeasures and enemy systems and the amount of training that the test personnel will have (which can vary widely from the so-called golden crews to the amount of training users will receive when a system is fielded). These factors may be ordered categorical variables or continuous variables. All possible combinations of these factors characterize the intended operational environment and hence the test space. These characterizations will often be incomplete in some respects since there may be some nominal (unordered) factors or some nuisance or noise factors that cannot be fully captured. The more effort that is placed in identifying and characterizing this space, the more efficient the testing program will be.
Both operational and developmental tests can be viewed as points in this space. Operational testing will use typical scenarios in the field and so may fall in the middle region in the test space (at least for some of the factors). Often, a systematic approach, such as statistical design of experi-
1See Recommendation 7.8 in National Research Council (1998:120) and Recommendation 2 in National Research Council (2003:53).
2The National Research Council (2006:18) report discussed such a test space: “We think that for test purposes, ‘edge of the envelope’ can be defined fairly rigorously. The space of conceivable military scenarios for operational testing includes a number of uncontrollable dimensions (e.g., environmental characteristics, potential missions, threat objectives and characteristics, etc.), and these dimensions can be usefully parameterized to identify the edge of the envelope.” Bonder (1999) discusses parametric operational situation (POS) space formulation: “Each point in this space represents an operational situation that U.S. forces might have to be deployed to and operate in. Some of these situations are more stressful than others.”
ments, is used to select the combinations of factor settings. Developmental testing is more ad hoc and will not examine the space systematically. Furthermore, it is likely to be based on more extreme scenarios, or what is often referred to as testing at the edge of the envelope.
Most of the operational test studies that we have seen are simple analyses that do not model the behavior of the factors over the test space. There is clearly some value in such analyses that do not make any assumptions and treat all the factors as nominal. But it would be very useful to also conduct additional analyses in which the effects of the factors are modeled parametrically (fitting parametric functions). Such analyses will allow a framework in which data from developmental tests (which may be isolated points in the test space) can be combined with data from operational tests to improve the information. Of course, part of the exploratory analysis will include checking for consistency among the developmental testing, operational testing, and other types of data, both empirically using extrapolations and using knowledge of the similarities and differences in the testing environments—and even for components and subsystems when available. If developmental testing includes scenarios at the edge of the envelope, the data can be interpolated to check for consistency with operational test data before they are combined. This framework also allows for the use of sequential testing during developmental testing with the aim of collecting more information in areas of the test space in which there are higher levels of uncertainty.
The panel recognizes that there are inherent dangers in combining data across heterogeneous sources without carefully considering the differences in the data sources and the reasons for the differences. Moreover, the ideas described here may not be applicable in all situations. For example, developmental test data may often be available only on components or subsystems. Nevertheless, it is important to examine different ideas on how to effectively combine data and effectively use test resources.
As the term suggests, accelerated testing involves conducting tests at conditions that are quite different from the operating conditions. Testing at the edge of the envelope, discussed above, can be viewed as one example. The discussion in this section deals mainly with reliability testing for suitability assessment.
The main goal in accelerated testing is to induce failures or degrade performance rapidly. Highly accelerated tests are commonly used by reliability engineers to identify failure modes. We focus here on the use of moderate acceleration regimens to estimate product or system reliability.
(An important caveat in these situations is that the acceleration should not induce failure modes that would not occur during normal operation.) Accelerated tests have been used extensively in industry. They are needed to estimate the reliability of highly reliable components or systems since few failures will occur during the (short) test phase of product development.
There are two common types of acceleration schemes: (1) increasing usage rate and reducing idle time and (2) using higher stress levels, such as temperature, voltage, humidity, and pressure. In the latter case, the appropriate stress factor(s) will depend on the component and failure mode of interest—corrosion, fatigue, mechanical wear, etc. There is an extensive discussion of stress factors corresponding to different types of failure mechanisms in the engineering literature.3
There is also considerable literature on the planning, design, and analysis of accelerated testing for life tests, where the outcome is lifetime data. The approach has also been used with degradation data (continuous measures of performance) although this literature is not as extensive (see Meeker and Escobar, 1998:Chs. 13, 21). Accelerated testing relies critically on the use of models to extrapolate the test results to normal use conditions. The literature emphasizes the need for using subject-matter knowledge and caution in extrapolating and suggests the use of extensive sensitivity analyses to assess the effects of using different models.
Accelerated testing is well known in the reliability community, and the panel expects that it is used extensively by defense contractors. However, given the inherent assumptions involved in these studies, it would be desirable for testers from DOD to either participate in their planning and analyses or have access to the test schemes in advance. Accelerated testing can and should play a prominent role in suitability assessment by DOD.
Software systems are a major part of defense acquisition programs, either as exclusive systems or as critical parts of hardware systems. Software problems are also ubiquitous in poorly performing defense systems.4 Although the use of processes such as agile development may lead to higher software quality, testing will remain crucially important. There is a substantial literature on software testing, and so we do not provide
3For an example, see Reliability, Life Testing and the Prediction of Service Lives: For Engineers and Scientists (Saunders, 2007).
4For example, see the report of the Defense Science Board, Task Force on Defense Software (2000).
an overview here. In particular, the NRC (2003:Ch. 3) has described techniques for software testing and related issues, including model-based testing, Markov-Chain usage models, and the use of combinatorial experimental designs.
There are some unique challenges with embedded systems, in which the software is embedded in hardware and has limited functionality (e.g., a GPS receiver) or is intended to react to a wide range of stimuli, such as the avionics for a jet fighter. These and other factors will determine if the software should be considered as simply a component of the full system during either developmental or operational testing or if the software needs to be tested separately from the remainder of the system. There is only a limited literature on the testing of embedded systems (but see Bringmann and Kramer, 2008, for some possibilities).