This chapter describes the role that developmental testing plays in assessing system reliability. The requirements for a system specify the functions it is expected to carry out and the operational situations in which it is expected to do so. The goal of testing—particularly in situations where theory or prior experience do not predict how well a system will function in specific environments—is to show whether or not the system will function satisfactorily over the specified operational conditions. Thus, the goal of the design of developmental testing is to be able to evaluate whether a system can do so. In our context, it is to assess whether it will be reliable when deployed.
Developmental testing, like all forms of testing, is not a cost-effective substitute for thorough system and reliability engineering, as a system is developed from concept to reality. However, developmental testing is an essential supplement. Testing can provide the hard, empirical evidence that the system works as designed or that there are reliability problems that the system designers did not anticipate.
For complex systems intended for use in multidimensional environments, designing an efficient, effective, and affordable testing program is difficult. It requires a mix of system engineering, subject-matter knowledge, and statistical expertise. For the U.S. Department of Defense (DoD), there are a wide variety of defense systems, and there is no “one-size-fits-all” menu or checklist that will assure a satisfactory developmental test program. The knowledge, experience, and attitudes of the people involved will be as important as the particular methods that are used. Those methods will have to be tailored to particular situations: hence, there is a need
for the people involved to have the requisite knowledge in reliability engineering and statistics to adapt methods to the specific systems.
In this chapter, the first two sections look briefly at the role of contractors in developmental testing and basic elements of developmental testing both for contractors and for DoD. We then describe in more detail three aspects of developmental testing: experimental design to best identify reliability defects and failure modes and for reliability assessment; data analysis of reliability test results; and reliability growth monitoring, which has as its goal identification of when a system is ready for operational testing.
We recommend (see Recommendation 12 in Chapter 10) that contractors provide DoD with detailed information on all tests they perform and the resulting data from those tests. If this recommendation is adopted, then early developmental testing becomes a continuation and extension of contractor testing, which focuses primarily on identifying defects and failure modes, including design problems and material weaknesses. However, as the time for developmental testing approaches, contractors likely carry out some full-system testing in more operationally relevant environments: this testing is similar to the full-system testing that will be carried out at the end of developmental testing by DoD. Making use of more operationally relevant full-system testing has the important benefit of reducing surprises and assuring readiness for promotion to developmental testing. Given the potential similarity in the structure of the tests, this contractor testing would also increase the opportunities for combining results from contractor and later developmental (and operational) DoD tests.
While early developmental testing emphasizes the identification of failure modes and other design defects, later developmental testing gives greater emphasis to evaluating whether and when a system is ready for operational testing.
Several elements are important to the design and evaluation of effective developmental tests: statistical design of experiments, accelerated tests, reliability tests, testing at various levels of aggregation, and data analysis:
- Statistical design of experiments involves the careful selection of a suite of test events to efficiently evaluate the effects of design and operational variables on component, subsystem, and system reliability.
- Accelerated tests include accelerated life tests and accelerated degradation tests, as well as highly accelerated life tests, which are used to expose design flaws (see Chapter 6).
- Reliability tests, which are often carried out at the subsystem level, include tests to estimate failure frequencies for one-shot devices, mean time to failure for continuously operating, nonrepairable devices, mean time between failures for repairable systems, and probabilities of mission success as a function of reliability performance for all systems.
- Testing at various levels of aggregation involves testing at the component, subsystem, and system levels. Generally, more lower-tiered testing will be done by the contractor, while at least some full-system testing is best done by both the contractor and DoD.
- Analysis of developmental test data has two goals: (1) tracking the current level of reliability performance and (2) forecasting reliability, including when the reliability requirement will be met. If contractor and developmental test environments and scenarios are sufficiently similar, then combining information models, as described in National Research Council (2004), should be possible. However, merging data collected under different environmental or stress conditions is very complicated, and attempts to do so need to explicitly account for the different test conditions in the model.1
To make the best use of developmental testing, coordination between the contractor and government testers is important. Effective coordination requires a shared view that testing should be mutually beneficial for the contractor, DoD, and taxpayers in achieving reliability growth. Testing from an adversarial perspective does not serve any of the parties or the ultimate system users well.
If, as we recommend, contractor test data are shared with the DoD program personnel, then those data can provide a sound basis for subsequent collaboration as further developmental testing is done in order to improve system reliability, if needed, and to demonstrate readiness for operational testing by DoD. There are also technical aspects to the recommended collaboration, such as having DoD developmental test design reflect subject-matter knowledge of both the developer and the user. This collaboration includes having a test provide information for what are believed to be the
1 This is what is done when developing models to link accelerated test results to inference for typical use, but the idea is much more general. It includes accounting for the developmental test/operational test (DT/OT) gap in reliability, as in Steffey et al. (2000), and providing estimates for untested environments that are interpolations between environments in which tests were carried out.
most important design and operational variables and the most relevant levels of those variables, agreement on the reliability metrics, and, more broadly, defining a successful test.
One example of collaboration stems from the need for reliability performance to be checked across and documented for the complete spectrum of potential operating conditions of a system. If an important subset of the space of operational environments was left unexplored during contractor testing—such as testing in cold, windy, environments—it would be important to give priority during developmental testing to the inclusion of test replications in those environments (see Recommendation 11 in Chapter 10).
Developmental tests for reliability are experiments. Their purpose is to evaluate the impact of operational conditions on system reliability. To be efficient and informative, it is critical to use the principles of statistical experimental design to get the most information out of each test event. A recent initiative of the Director of Operational Test and Evaluation (DOT&E, October, 2010, p. 2) is summarized in a memorandum titled “Guidance on the Use of Design of Experiments (DOE) in Operational Test and Evaluation,” and very similar guidance applies to developmental testing:
- A clear statement of the goal of the experiment, which is either how the test will contribute to an evaluation of end-to-end mission effectiveness in an operationally realistic environment, or how the test is likely to find various kinds of system defects.
- The mission-oriented response variables for (effectiveness and) suitability that will be collected during the test (here, the reliability metrics).
- Factors that (may) affect these response variables. Test plans should provide good breadth of coverage of these factors across their applicable levels. Further, test plans should select test events that focus on combinations of factors of greatest interest, and test plans should select levels of these factors that are consistent with the test goals.
- The appropriate test matrix (suite of tests) should be laid out for evaluating the effects of the controlled factors in the experiment on the response variables.2
2 We note that specific designs that might prove valuable in this context, given the large number of possible factors affecting reliability metrics, include fractional factorial designs, which maximize the number of factors that can be simultaneously examined, and Plackett-Burman designs, which screen factors for their importance: for details, see Box et al. (2005).
- Determination of the experimental units and blocking.
- Procedures that dictate control of experimental equipment, instrumentation, treatment assignments, etc. should be documented for review.
- Sufficient test scenario replications should be planned to be able to detect important effects and estimate reliability parameters.
The panel strongly supports this guidance. The main additional issue that we raise here is the degree of operational realism that is used in the non-accelerated developmental tests. Using operationally relevant environments and mission lengths is important for both identifying defects and for evaluation. It is well known that system reliability is often assessed to be much higher in developmental tests than in operational tests (see Chapter 8), which is referred to as the DT/OT gap. Clearly, some failure modes appear more frequently under operationally relevant testing. Therefore, to reduce the number of failure modes left to be discovered during operational testing, and at the same time have a better estimate of system reliability in operationally relevant environments, non-accelerated developmental tests should, to the extent possible, subject components, subsystems, and the full system to the same stresses that would be experienced in the field under typical use environments and conditions. This approach will narrow the potential DT/OT gap in reliability assessments and provide an evaluation of system reliability that is more operationally relevant.
A primary goal of reliability testing is to find out as much as possible about what conditions of use contribute to the system being more or less reliable. This goal then supports a root-cause analysis to determine why those conditions caused those reductions in reliability. Therefore, the object of most developmental test data analysis is to measure system reliability as a function of the factors underlying the test environments and the missions. To do so, it is necessary to distinguish between actual increases and decreases in system reliability and natural (within and between) system variation.3
Given the limited number of replications of reliability tests and therefore limited ability to identify differences in reliability between scenarios of use, as well as the high priority of determining when requirements have
3 We define within-system variation as the variability in the performance of a given system in a given environment over replications and between-system variation as the variability in performance in a given environment between different prototypes produced using the same design and manufacturing process.
been met and a system can be approved for operational testing, it is not surprising that learning about differences in performance between scenarios of use is sometimes ignored. The reliability requirement to be assessed is typically an average reliability taken over operationally relevant environments of use: that requirement is often taken from the operational mode summary/mission profile (OMS/MP). This average is compared with the requirement, which is similarly defined as the same type of average. Although interest in this average is understandable, it is also important to have assessments for the individual environments and missions, to the extent that that it is feasible given test budgets.
Another important type of disaggregation is distinguishing between two different ways of treating failure data from reliability tests. One could look at any failure as a system failure and make evaluations and assessments based on the total number of failures that occur. However, the process that generates some types of failures may be quite different from the processes generating others, and there may be advantages (statistical and operational) to analyzing different types of failure modes separately. For example, reliability growth modeling may produce better estimates if failure modes are categorized into hardware and software failures for analytic purposes and then such estimates are aggregated over failure type to assess system performance.
Systems themselves can be usefully grouped into three basic types with respect to reliability performance (see Chapter 3), and the preferred analysis to be carried out depends on which type of system is under test. The basic types are one-shot devices, continuously operating systems that are not repairable, and continuously operating system that are repairable.
For one-shot systems, the primary goal of developmental test data analysis is to estimate the average failure probability. However, it is also important to communicate to decision makers the imprecision of the average estimated failure probabilities, which can be done through the use of confidence intervals. As mentioned above, to the extent possible, given the number of replications, it would also be useful to provide estimated probabilities and confidence intervals disaggregated by variables defining mission type or test environment.
In doing so, it is important not to pool nonhomogeneous results. For example, if the test results indicate high failure probabilities for high temperatures but low failure probabilities at ambient temperatures (based on enough data to detect important differences in underlying failure probabilities), then one should report separate estimates for these two different types of experimental conditions, rather than pool them together for a combined
estimate. (Of course, given that the requirement is likely to be such a pooled estimated, its estimate must be provided.)
The common practice of reporting only pooled failure data across multiple mission profiles or environments (e.g., high and low temperature test results) does not serve decision makers well. Discovering and understanding differences in failure probabilities as a function of mission variables or test environments is important for correcting defects. If such defects cannot be corrected, then it might be necessary to redefine the types of missions for which the system can be used.
Nonrepairable Continuously Operating Systems
For nonrepairable continuously operating systems, the goal of the developmental test data analysis is to estimate the lifetime distribution for the system, to the extent possible, as a function of mission design factors. Such an estimate would be computed from lifetime test data. In planning such a test, it would be best to run the test at least long enough to cover mission times of interest and with enough test units to provide sufficient precision (as might be quantified by the width of a confidence interval).
To understand the dependence on design factors, one would develop a statistical model of the lifetime distribution using the design factors as predictors and using the test data to estimate the parameters of such a model. Such a model may need to include estimates of the degree to which the reliability of the system was degraded by storage, transport, and similar factors. The fitted model can then be used to estimate the probability of the system’s working for a period of time greater than or equal to the mission requirement, for the various types of missions the system will be used for. For these tests, too, it is important to provide information on the uncertainty of the estimates, usually expressed through use of confidence intervals. In some cases, resampling techniques may be needed to produce such confidence intervals.
It is common for DoD to use mean time to failure as a summary metric to define system reliability requirements for continuously or intermittently operating systems. Although mean time to failure is a suitable metric for certain components that are expected to wear out and be replaced, such as a battery, it is inappropriate for high-reliability components, such as integrated circuit chips, for which the probability of failure is small over the technological life of a system. In the latter case, a failure probability or quantile in the lower tail of the distribution would be better. In addition, given missions of a specific duration, it is important to measure the distribution of time to failure, from which one can estimate the probability of mission success, not necessarily under the assumption of an exponential distribution of time to failure.
Repairable Continuously Operating Systems
For repairable systems, the mean time between failures is a reasonable metric when failures in time can be assumed to be described by a Poisson process. However, if the underlying failure mechanism is governed by a nonhomogeneous Poisson process (such as the AMSAA-CROW model4) that has a nonconstant rate of occurrence of failures, mean time between failures would be a misleading metric. In such cases, one should instead study the average cumulative number of failures up to a given time. Ideally, a parametric formulation of the nonconstant rate of occurrence of failures is used, and reliability is assessed through the parameter estimates. A step-intensity or piecewise-exponential model can be used for reliability growth data that are collected from a developmental test in order to emphasize the effect of the design changes.
Because the time on test for any individual prototype and for any design configuration is often insufficient to provide high-quality estimates of system reliability, methods that attempt to use data external to the tests to augment developmental test data are worth considering. Several kinds of data merging are possible: (1) combining test results across tests of the system for different levels of aggregation, (2) combining information from different developmental tests either for the same system or related systems, and (3) combining developmental and contractor test data, although this raises issues of independence of government assessment.
In some cases, one will have useful data from testing or other information at multiple system levels: that is, one will often have data not only on full system reliability, but also on component and subsystem reliabilities. In those cases, one may be able to use models, such as those implied by reliability block diagrams, to produce more precise estimates of system reliability through merging this information. Of course, there is always concern about combining information from disparate experimental conditions. However, if such differences can be handled by making adjustments, then one might be able to produce system reliability estimates with associated confidence limits based on an ensemble of multilevel data, for which the estimates would be preferred to the estimates using only the test data for tests on the full system. The primary work in this area is the PREDICT methodology developed at Los Alamos (see, e.g., Wilson et al., 2006; Hamada et al., 2008; Reese et al., 2011). The development
4 This is a reliability growth model developed by Crow (1974) and first used by the U.S. Army Materiel Systems Analysis Activity: see Chapter 4.
of such models will be situation specific and may not provide a benefit in some circumstances.
In some situations there is relevant information from tests on previous versions of the same system or similar systems or on systems with similar components or subsystems. In such situations, the assumptions about a prior distribution may be clearly seen to be valid, so that Bayesian methods for combining information may be able to produce preferred estimates to those based only on the data from tests on the current system. An example would be information about the Weibull shape parameter based on knowledge of an individual failure mode in similar systems (see, e.g., Li and Meeker, 2014).
In using Bayesian techniques, it is crucial to document exactly how the prior distributions were produced, and the validation of the use of such priors, and to assess the sensitivity of the estimates and measures of uncertainty as a function of the specification of the prior distribution. There are also non-Bayesian methods for combining data, such as the Maximus method for series and parallel systems and extensions of this research, although some aspects of inference may be somewhat more complicated in such a framework (see, e.g., Spencer and Easterling, 1986). In general, it is reasonable to assume that combining information models would be more useful for one-shot and nonrepairable systems than for repairable systems.
One way of combining data across tests of a system as the system undergoes changes to address discovered defects and failure modes is to use reliability growth modeling (see Chapter 4). However, such models cannot accommodate tests on different levels of the system and cannot use information on the environments or missions under test.
Finally, combining information over developmental tests is complicated by the fact that design defects and failure modes discovered during developmental testing often result in changes to the system design. Therefore, one is often trying to account not only for differences in the test environment, but also for the differences in the system under test. We recommend (Recommendation 19, in Chapter 10) that the delivery of prototypes to DoD not occur until a system’s performance is assessed as being consistent with meeting the requirement. If that recommendation is adopted, then it would limit the number of defects needing to be discovered to a small number, which would result in the systems in developmental testing undergoing less change, which would greatly facilitate the development of combining information models.
Monitoring progress toward meeting reliability requirements is now mandated by a directive-type memorandum DTM 11-003 (U.S. Department of Defense, 2013b, p. 3), which states
Reliability Growth Curves (RGC) shall reflect the reliability growth strategy and be employed to plan, illustrate, and report reliability growth. A RGC shall be included in the SEP at MS A, and updated in the TEMP beginning at MS B. The RGC will be stated in a series of intermediate goals and tracked through fully integrated, system-level test and evaluation events until the reliability threshold is achieved [emphasis added]. If a single curve is not adequate to describe overall system reliability, curves will be provided for critical subsystems with rationale for their selection.
At least three technical issues need to be faced in satisfying this mandate. First, how are such intermediate goals to be determined? Chapter 4 presents a discussion of the value of formal reliability growth modeling when used for various purposes. As argued there, under certain assumptions, formal reliability growth models could at times produce useful targets for system reliability as a function of time in order to help discriminate between systems that are or are not likely to meet their reliability requirements before operational tests are scheduled to begin. Oversimplifying, one would input the times when developmental tests were scheduled into a model of anticipated reliability growth consistent with meeting the requirement just prior to operational testing and compare the observed reliability from each test with the model prediction for that time period.
Unfortunately, the most commonly used reliability growth models have deficiencies (as discussed in Chapter 4). Given the failure to represent test circumstances in the families of reliability growth models commonly used in defense acquisition, such models will often not provide useful estimates or predictions of system reliability under operationally relevant conditions. To address this deficiency, whenever possible, system-level reliability testing should be conducted under OMS/MP-like conditions, that is, under operationally realistic circumstances. Such models also assume monotonic improvement in reliability over time, but if there have been some major design changes, reliability might not be monotonically increasing until the changes are fully accommodated in terms of interfaces and other factors. Therefore, if such tests include a period or periods of development in which major new functionality was added to the system, then the assumption of monotonic reliability growth could possibly not hold, which could result in poor target values.
Since DTM 11-003 does not specify what models to use for this purpose, analysts are free to make use of alternative models that do take the
specific test circumstances into consideration. We encourage efforts to develop reliability growth models that represent system reliability as a function of the conditions of the test environments, along the lines of physics-of-failure models (see Chapter 5).
Second, how should one produce current estimates of system reliability? It is likely that most developmental tests will be fairly short in duration and will rely on a relatively small number of test units because of the need to budget an unknown number of future developmental tests to evaluate future design modifications. As mentioned above, to supplement a limited developmental test in order to produce higher quality reliability estimates, assuming the tests are relatively similar, one could smooth the results of several test events over time, or fit some kind of parametric time series model, to model the growth in reliability. However, it is much more likely that the developmental tests will differ in important ways, which would reduce the applicability of such approaches. For instance, some of the later developmental tests may use more realistic test scenarios than the earlier ones, or different test scenarios.
More fundamentally, some tests are likely to use acceleration of some type, and some will not; moreover, some will be at the component or subsystem level, and some will be at the system level. Therefore, any type of combining information across such tests would be challenging, and would have to develop something similar to PREDICT to use the information from these various developmental tests. Of course, one could increase the duration and sample size of a developmental test so that no modeling was required to produce high-quality estimates, but this is unlikely to happen given current test resources.
Third, given that one is comparing model-based target values with current measures of system reliability, one would need to take into consideration the uncertainty of the current estimates of system reliability so that one can formulate decision rules with good type I and type II error rates. This consideration will ensure that such decision rules are formulated so that the systems that ultimately meet their reliability requirement are rarely flagged and systems that fail to meet their reliability requirement are frequently flagged. But developing confidence intervals for estimates based on merged test information may not be straightforward.
It is unclear how the uncertainty in estimated reliability is currently handled in DoD in analogous situations, such as determining whether a performance characteristic in an operational test is consistent with a requirement. In that application, one interpretation is that a confidence interval for the estimated reliability needs to lie entirely above the requirement in order for the system to be judged as satisfying the requirement. Given the substantial uncertainty in reliability assessments common in operational test evaluation, this would be an overly strict test that would
often fail systems that had in fact met the requirement. (In other words, this rule would have a very high producer’s risk.) Another possibility is that the confidence interval for the estimated reliability would only need to include the requirement to pass the operational test. This rule has a large consumer risk in small tests, because one could then have a system with a substantially lower reliability than the requirement, but if the uncertainty was large, one could not reject the hypothesis that the system had attained the requirement. In fact, with the use of such a decision rule, there would be an incentive to have small operational tests of suitability in order to have overly wide confidence intervals that would be likely to pass such systems.
A preferred approach would be for DoD to designate a reliability level that would be considered to be the lowest acceptable level, such that if the system were performing at that level, it would still represent a system worth acquiring. The determination of this lowest acceptable level would be done by the program executive office and would involve the usual considerations that make up an analysis of alternatives. Under this approach, a decision rule for proceeding to operational testing could be whether or not a lower confidence bound, chosen by considering both the costs of rejecting a suitable system and the costs of accepting an unsuitable system, was lower than this minimally acceptable level of reliability. Such a decision rule is, of course, an oversimplification, ignoring the external context, which might decide that an existing conflict required the system to be fielded even if it had not met the stated level of acceptability.
This technique could be easily adapted to monitoring reliability growth by comparing the estimated reliability levels to the targets produced using reliability growth models. As mentioned above, such a decision rule would also have to take into consideration any differences between test scenarios used to date and those used in operational testing, which is similar but somewhat more general than the problem we have been referring to as the DT/OT gap. Furthermore, this decision rule would also have to make some accommodation for model misspecification in the development of the target values, because the model used for that purpose would not be a perfect representation of how reliability would grow over time.
Finally, the last sentence quoted above from DTM 11-003 raises an additional technical issue, namely, when would it be useful to have reliability growth targets for subsystems as well as for a full system. The panel offers no comment on this issue.