There are two ways to produce a reliable system. One can design in reliability, and one can improve the initial design through testing. Chapter 5 discussed designing reliable systems; this chapter describes improving system reliability through testing. Because it is difficult to test long enough to experience a large number of failures, testing is often accelerated both to understand where reliability problems might surface and to assess system reliability. Given that, a large fraction of this chapter deals with accelerated testing and related ideas.
Reliability testing is used to identify failure modes and to assess how close a system is to the required reliability level. Reliability assessment is also important for understanding the capabilities and limitations of a system in operational use. Reliability testing (and assessment) can be divided into two separate issues. First, there is testing for the reliability of the system as produced (for instance, at the time of system acceptance). One might refer to this as out-of-the-box reliability. Second, there is testing for the reliability of the system after it has been in use for some time, that is, testing to predict long-term reliability performance.
It is important to keep in mind that reliability is always a function of the environment and nature of use. Therefore, a reliability assessment needs to be a function of both the history of environments of use and of profiles of use (e.g., speed, payload, etc.). Also, when used for prediction, reliability assessments rely on the validity of the estimation models used in
conjunction with the test data collected. The types of testing and estimation procedures preferred for use depend on the stage of development of the system and the purpose of the test.
We note that the systems referred to in this chapter are generic, encompassing full systems, subsystems, and components. Some of the testing techniques are only applicable for hardware systems (such as accelerated life testing), although other techniques described are applicable to both hardware and software systems, such as demonstration testing. As discussed in Chapter 3, we remind readers that different reliability metrics are appropriate for different kinds of systems.
Data from reliability tests are used to estimate current reliability levels through use of a (properly) selected reliability metric. One can use these assessed reliability levels to track the extent to which they approach the required level as the system improves. Tracking growth in reliability over time is important for discriminating between systems that are and are not likely to achieve their reliability requirements on the basis of their current general design scheme. By identifying systems that are unlikely to achieve their required reliability early in development, increased emphasis can be placed on finding an alternate system design, which might include using higher reliability parts or materials, or allocating additional reliability testing resources to identify additional sources of reliability failures.
Because it is extremely inefficient to make substantial design changes in later stages of development or after deployment, it is important to identify any problems in design or, more generally, the likely inability for a system to meet its reliability requirements, during the design and early development stages. Therefore, careful reliability testing of systems, subsystems, and components while the design remains in flux is crucial to achieving desired reliability levels in a cost-efficient way prior to fielding. During development, the program of test, analyze, fix, and test1 can be used to identify and eliminate design weaknesses inherent to intermediate prototypes of complex systems. Using this approach is generally referred to as “reliability growth.” Specifically, reliability growth is the improvement in the true but unknown initial reliability of a developmental program as a result of failure mode discovery, analysis, and effective correction.
In addition to testing during development, feedback from field use or from tests after fielding can also be used to improve system design and, consequently, the system’s long-term reliability. However, as discussed in previous chapters, postdevelopment redesign is very cost inefficient in comparison with finding reliability problems earlier during the design and
1 Note the difference between this approach and that of test, analyze, and fix, discussed in Chapter 4: this approach adds a second test that can be useful in providing an assessment of the success of the fix.
development stages. Yet, it is still useful to point out that corrective actions can extend beyond initial operational test and evaluation. One or more focused follow-on tests and evaluations can be conducted after the initial operational test and evaluation, allowing previously observed deficiencies and newly implemented redesigns or fixes to be examined.
Several kinds of reliability tests are typically used in industry. Some of them are useful for identifying undiscovered failure modes, and some of them are useful for estimating current reliability levels. In this section we discuss three of them: highly accelerated life testing; reliability demonstration or acceptance tests; and accelerated life testing and accelerated degradation testing.
Highly Accelerated Life Testing
Highly accelerated life testing (HALT) is an upstream method of discovering failure modes and design weaknesses. HALT tests use extreme stress conditions to determine the operational limits of systems, which are the limits beyond which various failure mechanisms will occur other than those that would occur under typical operating conditions. HALT is primarily used during the design phase of a system.
In a typical HALT test, the system (or component) is subject to increasing levels of temperature and vibration (independently and in combination) as well as rapid thermal transitions (cycles) and other stresses related to the intended environments of use of the system. In electronics, for example, HALT can be used to locate the causes of the malfunctions of an electronic board. These tests can also include extreme humidity or other moisture, but because the effect of humidity on a system’s reliability requires a long time to assess, HALT is typically conducted only under the two main stresses of temperature and vibration. The results of HALT tests enable the designer to make early decisions regarding the components to be used in the system.
The results from HALT tests are not intended for reliability assessment because of the short test periods and the extreme stress levels used. Indeed, HALT is not even a form of accelerated life testing (see below) because its focus is on testing the product to induce failures that are unlikely to occur under normal operating conditions.2 One goal of HALT is to determine the root cause of potential failures (see Hobbs, 2000). The stress range and methods of its application in HALT (e.g., cyclic, constant, step increases)
2 In accelerated life testing, the linkage between accelerated use and normal use is modeled to provide reliability assessments.
are dependent on the component to be tested, its requirements and failure modes, and the stresses to which it will be subject. Because such knowledge may only be held by the developer, it is important that such testing be conducted prior to delivery to the U.S. Department of Defense (DoD). Given that DoD can use such information to help design its own developmental tests, it is important that records of such testing, including the stresses applied, the failures discovered, and any design modifications taken in response, be made available to DoD to guide its testing.
Reliability Demonstration or Production Reliability Acceptance Test
When a milestone of a system in development is completed, in order to ascertain whether reliability growth of a component or subsystem is satisfactory, the contractor can carry out a reliability demonstration test. The basic idea is that, under normal operating conditions, a number of units are tested for a specified amount of time, and the resulting data are used to assess whether the observed reliability is reflective of a target value. Given the necessary number of units under test to provide such information, such tests are generally done at the component or subsystem level, rather than at the full system level.
The goal of a production reliability acceptance test is similar to that of the reliability demonstration test, but it is undertaken when a contractor intends to deliver a batch of products for actual use or inventory, and the contractor and DoD have to design a test plan that ensures that the probability of accepting a batch of production that has defective products and the probability of rejecting a good batch are both small.
Accelerated Life Testing and Accelerated Degradation Testing
In many cases, accelerated life testing (ALT) may be the only viable approach to assess whether a component or subsystem can be expected to meet a requirement for reliability over its lifetime, in contrast to the reliability prior to initial use. ALT can be conducted using three different approaches, one for testing full systems and two others that are more relevant for tests of subsystems and components. The first approach is conducted by accelerating the “use” of the system at normal operating conditions, such as in the case of systems that are used only a fraction of the time in a typical day. Examples include home appliances and auto tires, which are tested under use for, say, 24 hours, rather than for the much shorter period of time that would usually be used.
The second approach is generally carried out for components and subsystems relatively early in system development by subjecting a sample of components or subsystems to stresses that are more severe than normal
operating conditions. The third approach is referred to as accelerated degradation testing: it is used to examine systems for signs of degradation rather than outright failure. It is conducted by subjecting components or subsystems that exhibit some type of degradation such as stiffness of springs, corrosions of metals, and wear-out of mechanical components to accelerated stresses. Both accelerated life tests and accelerated degradation tests are most useful in situations in which there is a predominant failure mode that is a function of a single type of stress. Consequently, these techniques are commonly applied at the component level, rather than the full-system level.
The reliability data obtained from ALT are used to estimate the parameters of a model that predicts the reliability of the component, subsystem, or system under normal operating conditions. This model is either statistics or physics based, and it is used to link the failure time distribution for time under normal use to a failure time distribution for time under extreme use. The assessed validity of such models should affect the degree to which the resulting estimates are trusted, which in turn could affect decisions about system redesign and determination of preventive maintenance schedules. For example, if the reliability prediction based on ALT results shows that the units exhibit constant failure rates, then it is not reasonable to conduct preventive maintenance for such units, because older units are not less reliable than new units. In contrast, if the units exhibit increasing failure rates (e.g., from wear-out), then plant maintenance or condition-based maintenance strategies would be economical to implement (for details, see Elsayed, 2012).
Reliability estimates from ALT depend not only on the linkage models, but also on the experimental design of the test plans. Stress loadings, such as constant stress, ramp stress, or cyclic stress; the allocation of test units to stress levels; the number of stress levels; the appropriate test duration; and other experimental variables can improve the accuracy of the resulting reliability estimates.
It may be that initial guesses at a model to link extreme to normal stresses may turn out to be invalid for many ALT situations. Therefore, to better assess long-run reliability, it would be useful for DoD and contractors to work together closely to determine both good designs of accelerated life tests and acceptable reliability prediction models based on subject-matter assumptions that are agreed to be reasonable.
Most reliability data from ALT are time-to-failure measurements obtained from testing samples of units at different stresses and noting failures. However, particularly for tests at stress levels close to normal operating conditions, instead of failing, components may suffer measurable degradation as a prelude to failure. For example, a component may start a test with an acceptable resistance value, but as test time progresses the
resistance may drift so that it eventually reaches an unacceptable level that causes the component to fail.
In such cases, measurements of the degradation of the characteristics of interest (those whose degradation will ultimately result in failure of the part) are frequently taken during the test. The degradation data are then analyzed and used to predict the time to failure at normal conditions. We refer to this as accelerated degradation testing, which requires a reliability prediction model to relate degradation results of a test under accelerated conditions to failures under normal operating conditions. Proper identification of the degradation indicator is critical for the analysis of degradation data and subsequent decisions about maintenance schedules and replacements. An example of such an indicator is hardness, which is a measure of degradation of elastomers. Other indicators include loss of stiffness of springs, corrosion rate of beams and pipes, and crack growth in rotating machinery. In some cases, the degradation indicator might not be directly observed, and destruction of the unit under test is the only alternative available to assess its degradation. This type of testing is referred to as accelerated destructive degradation testing.
In some applications, it is possible to use accelerated degradation testing instead of accelerated life testing. Degradation tests often provide more information for the same number of test units, as well as information that is more directly related to the underlying failure mechanism, which often provides a sounder basis for determining a model that can be used for extrapolation to use conditions. This advantage comes at the cost of needing to validate the model linking the current degree of degradation and the distribution of remaining system lifetime. This might be done by using the degradation data to effectively predict the time when the degradation of the unit crosses a specified threshold level. Therefore, if feasible, and if there is a well-defined degradation model, an accelerated degradation test would be an effective approach for predicting system reliability after different amounts of use. It is important to note, however, that some systems do not exhibit degradation during use before the occurrence of sudden failures.
The design of ALT has experienced many advances over the past decades; Elsayed (2012) provides a description of many of the recent ideas. They include designs to measure the following stress types: mechanical stresses, which are often a result of fatigue (due to elevated temperature, shock and vibration, and wear-out); electrical stresses (e.g., power cycling, electric field, current density, and electromigration); and environmental stresses (e.g., humidity, corrosion, ultraviolet light, sulfur dioxide, salt and fine particles, alpha rays, and high levels of ionizing). There are different ways stress can be applied, including constant level, step-stresses (either low-to-high or high-to-low), cyclic loading, power-on power-off,
ramp tests, and various combinations of these. Elsayed (2012, p. 7) notes: “[D]ue to tight budgets and time constraints, there is an increasing need to determine the best stress loading in order to shorten the test duration and reduce the total cost while achieving an accurate reliability prediction.” There is now a growing literature, cited in Elsayed (2012) on ALT designs that indicate what types of designs are better for different types of components and subsystems.
Given the important benefits from effective testing for reliability growth and for reliability assessment, the panel recommends that DoD take several steps to ensure that contractors use tests that are capable of measuring agreed-on metrics; that the designs of test plans and the validation of reliability models linking extreme to normal use and degradation to failures are examined by reliability engineers prior to application; and that contractors supply DoD with all test data relevant to reliability assessment (see Recommendation 12 in Chapter 10).3 We also recommend creation of a database that includes the results not only from contractor testing, but also from DoD developmental and operational testing and from field use. Such a database would support the model validation for accelerated life testing and accelerated degradation testing. In addition, for degradation testing, measurement of the degree of degradation should also be collected as part of this database. In addition, such databases could also include the estimated reliability performance of fielded systems to provide better “true” values for reliability attainment. Finally, there is also a need to save sufficient detail describing the fielding environment(s), including the technology type and the specified design temperature limits.
3 All the panel’s recommendations are presented in Chapter 10.