Read "Reliability Issues for DOD Systems: Report of a Workshop" at NAP.edu

Page 10 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

2
The Measurement and Management of Reliability Growth

Reliability growth is not a new topic in either engineering or statistics. It has been the subject of intense investigation and spirited application at least since the early 1960s. While the area has been recognized as important in both industrial and military settings for some time, it is featured in this summary, as it was at the workshop, for two important reasons. One is that the area has evolved substantially over the past four decades, yet its utility and power in modern applications do not appear to be widely recognized. A more important reason is that the latest approaches to reliability growth involve a sea change in perspective—from a focus on the measurement or estimation of observed growth to an emphasis on the interdisciplinary collaborations and opportunistic interventions that combine to assist in the identification and understanding of system faults and the creation and attainment of reliability growth goals.

This chapter begins with a brief review of the history of reliability growth estimation. It then proceeds to a discussion of six presentations at the workshop specifically dedicated to the subject—four addressing tools for measuring reliability growth and two reviewing tools for managing reliability growth. The treatment here alters the order in which these presentations were made to roughly parallel the historical review.

HISTORY OF RELIABILITY GROWTH

The historical account provided here of the theory and applications of reliability growth is necessarily brief. The interested reader is directed to

Page 11 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

Crow (1984), Jewell (1984), or Ushakov (1994) for a more detailed description of the models and methods mentioned in this chapter, as well as for discussion of related topics that we chose not to feature here.

While there were both formal and informal developments in the area of reliability growth prior to 1964, the field as we know it today had its beginning that year with a highly influential paper by J. T. Duane. The subject of the paper was the observed growth in reliability of specific manufactured items related to the aerospace industry. A simple regression analysis appeared to suggest that the logarithm of the cumulative failure rate of an item at time t was linearly related to the logarithm of t, a relationship that might be expressed as

ln(λ_t) = a – b ln(t).

The coefficient b of ln(t) appeared to vary from one application to the next, depending, for example, on whether the item in question was mechanical or electronic, but the fit of the Duane model in a large collection of quite different applications appeared truly uncanny. Duane’s application called for a coefficient of b = 0.5, but as applications proliferated, it was observed that b would generally fall in the interval [0, 0.6]. One famous military application of the Duane curve was to the failure rate of the F15-A fighter when its performance was tracked for 4 years in the mid-1970s.

The Duane model gained substantial popularity through the 1960s and 1970s. It appeared to fit reliability growth processes well enough that attempts were initiated to predict the future reliability of an item based on its fitted Duane curve. Such a practice was a bold move indeed, given that the fitted curve offered no explanation of the concomitants of growth, providing no understanding of the growth process itself. Surely the various interventions that occurred as a prototype was developed and improved were somehow linked to the reliability improvement one would experience, but such interventions played no formal role in the Duane model. The model appeared to be saying that it matters little what one does (as long as one does something); the improvement seen will follow a Duane curve.

In the early 1980s, Larry Crow, working at the Army Materiel System Analysis Activity (AMSAA), developed a modification of the Duane model that proved to be substantially more flexible. The essence of the AMSAA model was that it was in reality a collection of successive models. It was

Page 12 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

recognized that the Duane model tended to apply to the data locally, but that the parameters of the model might well change following a major intervention, giving rise to a new model that would be applicable for the period during which the newly configured prototype was in use. Another extension of note was the use of the Weibull or other well-known parametric distributions for the modeling of failure data. Models with an inherent monotone failure rate, of which the Weibull and the gamma models are the best known, are natural for modeling improvement or deterioration in an item or system of interest and are thus useful tools in modeling reliability growth. Much of the work of this era has been summarized in various military handbooks and codified as military standards. While this work was aimed at allowing for different and unpredicted changes in reliability due to a series of interventions, the focus was still on measuring growth rather than trying to understand its root causes.

The next stage of development of reliability growth modeling involved its application over longer periods, occasionally extending to an item’s entire life cycle. Another refinement of interest was the inclusion of covariates that could be used to help predict future performance—covariates that might describe the maintenance process, the level of usage, and the like. An example of the development of models with such features is found in Collas (1991).

TOOLS FOR MEASURING RELIABILITY GROWTH

Four papers were presented in the general area of reliability growth measurement, covering both classical and modern methods in the assessment of the extent to which reliability growth is realized in a developing system, and also exploring models for fault detection and removal.

In one session of the workshop, Ananda Sen was asked to provide a review of the evolution and current state of the “classical theory” of reliability growth. Sen’s presentation follows directly from the preceding historical perspective. Following a summary of Ananda Sen’s talk, we turn to the presentation made by Donald Gaver. Gaver’s assignment was to “think outside of the box,” discussing some interesting alternatives to the classical approaches to reliability growth. His presentation, which was based on his investigations into statistical modeling of fault detection and removal, provided a glimpse of a new, interesting, perhaps even radical approach to reliability growth. Gaver studied the performance of a strategy for operational testing in the context of military acquisition in which a prototype is

Page 13 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

classified as acceptable when it experiences a suitably specified run of successes. While such methods are not yet in use, Gaver’s talk clearly demonstrated that new, different, and promising procedures for acceptance testing are feasible and under development.

Discussion of reliability growth continued in the workshop session dedicated to models, methods, and applications involving the linkage between field performance data and reliability growth. Fritz Scholz described a model applicable to the detection and removal of design flaws in a fielded system and discussed a methodology for estimating and bounding system reliability at each stage of the fault discovery process. William Meeker then presented a series of examples drawn from his experience with field data in the automotive industry—examples that motivate and strongly support the continuous tracking of performance data once an item has been fielded.

After summarizing these four presentations, we turn to the important issue of the management of reliability growth. The presentations of Jane Booker and Larry Crow were both representative of the modern global approach to reliability growth, which incorporates the best features of the classical theory yet goes well beyond it, using ideas that are integrative, inter- and cross-disciplinary, and comprehensive.

The Sen Paper

Ananda Sen provided a review of recent developments in modeling and statistical inference for reliability growth. In typical modern applications of reliability growth theory, a system’s reliability is improved through a series of test, analyze, and fix (TAAF) episodes. Reliability growth modeling is a collection of techniques that attempt to model the (hoped-for) increasing reliability of a system as a function of the time spent in development and other relevant variables. Reliability growth modeling has historically played a role in helping to determine whether a system in development is likely to meet reliability requirements in time for graduation to the next development phase, and eventually to operational testing. Sen focused his presentation on systems for which the relevant data input into reliability growth models consists of successive times to failure (that is, total test time). See Sen (1998) for more detail.

Of course, it is not clear whether the process of reliability improvement can be usefully modeled as a function of time alone, since time is an indirect measurement of the workings of the TAAF cycle. However, it is useful to attempt to do so since these models can be used to (1) monitor the

Page 14 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

improvement in system reliability throughout the developmental testing stage, which can help in judging when to proceed to subsequent acquisition phases; (2) design more effective operational tests; and (3) potentially predict future reliability and idealized reliability achievement when decisions must be made, for example, between upgrading an existing system and switching to a new one.

A problem that deserves more attention is that system reliability is dependent on the environment of use. The reliability of a system in a cold, wet environment may be substantially different from its reliability in a hot, dry environment. Storage or transport may permanently affect a prototype’s reliability. Further, and very important, the reliability of a system in typical developmental test circumstances, with expert users and without counterforces, can be dramatically different from the reliability of the same system in typical operational test circumstances with nonexpert users and deployment in more realistic combat settings. The modeling and measurement of reliability across environments of use is a complicated but important problem.

One class of models that has been used for reliability growth modeling is referred to as learning curve models, especially the power law process. The motivation for the latter model is that, plotted on a log-log scale, the empirical cumulative failure rate in practice has appeared to be linear in time. Equivalently, on the log-log scale, the count of total failures is linearly related to the total time on test. Sen pointed out that the power law process has the following advantages: (1) it is easy to work with analytically, (2) well-developed inferential procedures are associated with it, (3) it has an easily interpretable reliability growth (or decay) scenario, and (4) its validity can be tested with readily available goodness-of-fit procedures. On the other hand, it has the following disadvantages: (1) the error rate is assumed to decrease to zero as time increases (which may not be a problem if either the produced system is extremely reliable or the development process is short relative to the time the system is predicted to be extremely reliable); (2) given the use of a continuous representation of system reliability as a function of time, the failure rate after fixes have been incorporated is assumed to be essentially identical to the failure rate beforehand; and, most important, (3) the TAAF cycle is not explicitly represented in the model.

Sen then described an alternative model form that was developed to address the above deficiencies in the power law process model. This class of alternatives, represented as a step-function approximation to the power law process, was first proposed by Benton and Crow (1989) and is referred to

Page 15 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

as the step-intensity model. It was developed to directly represent the TAAF cycle. These alternatives assume that the time to failure is exponentially distributed, but these distributions have mean times to failure that vary following some prescribed formulation linked to the TAAF process. Sen and Bhattacharya (1993) retained the power law form but gave it an interpretation that was linked more directly to the TAAF process—referred to as exponential reliability growth.

Other alternatives to the power law process exist to handle situations such as upward trends in the failure rate, situations in which the time to first failure can be infinite with positive probability, and failure-rate functions that have a bathtub shape (failure rate initially decreasing and subsequently increasing). A second distinct class of models is derived as solutions to differential equations. The defining equations represent the relationship between cumulative expected time between failures and nonlinear functions of time. Unfortunately, these procedures are complicated to use for purposes of statistical inference. A third distinct class of models makes use of a Bayesian formulation through which the subjective inputs of experts in appropriate disciplines can be elicited, quantified, and included in the analysis. Finally, there are nonparametric approaches to modeling reliability growth that are straightforward applications and generalizations of the Kaplan-Meier estimates used in survival analysis.

Clearly there is a wide variety of reliability growth models from which to choose. No single model is best for all purposes. Parametric models permit extrapolation to areas in which few or no failures have been observed, but they are based on assumptions that need to be validated or evaluated for robustness. Fully nonparametric models are essentially always valid but, for a fixed desired precision, typically require a substantial number of replications; they can also be inefficient relative to parametric alternatives when the relevant assumptions of the latter are approximately true.

Sen argued that it is important for decision makers to be provided with a full representation of the complete evolution of the bottom-line result, instead of a simplistic presentation of a single point estimate or an elementary pass/fail pronouncement. A full representation of the results should include some physical justification or validation of the model’s assumptions (given that a parametric approach is used), together with use of a nonparametric approach for purposes of validation and comparison and, at times, results from leading parametric alternative representations. Agreement of alternative modeling approaches offers important assurance of the stability of the results presented. Disagreement is often indicative of the

Page 16 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

failure of one or more of a model’s assumptions and necessitates examination of each assumption concerning the failure process. The degree of disagreement can be used to measure the potential for model misspecification, which in turn can aid in informing decision makers as to the quality of the reliability growth estimation.

The Gaver Paper

Donald Gaver has led a research team in developing a new methodology that explicitly represents testing as part of system development for systems consisting of separate stages linked in a series structure. (The obvious example is a one-shot system, such as a missile or torpedo.) With this methodology (see Gaver et al., 2000), the failure mode discovery process that derives from testing is directly represented—i.e., system failures are activated based on an explicit random process—and the resulting impact on reliability growth is estimated. (Testing is assumed to be carried out at specific time periods, with no explicit representation of the actual length of time between tests.) The approach assumes that the only testing is full-system testing, with components operating during the test in the sequence natural to system use. Therefore, the later stages of the system are not tested if an earlier stage fails beforehand. This relative lack of testing of later stages of the process for staged systems is often ignored using current approaches for modeling reliability growth.

The simulation framework that has been developed to represent reliability growth explicitly can be used to answer a variety of important questions concerning system and test performance. Also, some of these questions can be addressed analytically. (The solution involves the use of various recursive identities.) Some questions that can be answered analytically concern the properties of stopping rules of the form “accept a system when it runs successfully r times in a row” for various values of r. This class of stopping rules is considered easy to apply and ensures that each stage of the system passes a test r times. Requiring r successes in a row helps control the “false acceptance rate,” with the desired rate being achievable through the appropriate setting of r. Further, the system developers have an incentive to create as reliable a system as possible as early as possible to achieve a high probability of passing the test (either developmental or operational). The test design has the added advantage of focusing on success rather than failure, with poorer-performing systems being eliminated because of their inability to accumulate the requisite success run within a specified time frame

Page 17 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

or within a fixed, predetermined budget. A specific measure of interest for these stopping rules is the expected number of test replications needed to pass a system. The following are some other questions of interest that this simulation structure can address:

After a specified number of operational test events of the system (and associated fixes), what is the probability that the system will meet its reliability requirements when fielded?
How many operational tests are likely to be required to achieve the rth successful test?
How many operational tests are likely to be needed to achieve r consecutive successful tests? Other stopping rules besides r consecutive successful tests can also be examined.

The framework also provides the ability to address a wide variety of additional what-if questions.

To carry out an analysis of a test design, one selects parameters describing the number and probability of failure modes in each component of a system. One then inputs the test parameters (e.g., test size) and runs simulated test replications to estimate the operating characteristics of the proposed test design (i.e., the probability of improperly rejecting a good system and of improperly failing to reject a bad one).

Some of the mathematical details are as follows. This approach assumes that there are an unknown number of initial design faults d_s(0) at each of the s sequential stages of the system and that the undiscovered (and therefore unfixed) faults are revealed (and removed through a redesign) with some unknown probability 1-θ_s. Further, at time t (that is, after the tth test), each of the s stages has d_s(t) remaining faults, given the discovery and treatment of faults in earlier tests.

The model makes use of some simplifying assumptions: (1) when a failure occurs, the design source of the failure is always identified and removed, and no new failures of this type are introduced; (2) the process that exhibits faults in each component follows an independent binomial distribution with parameters 1-θ_s and d_s(t); (3) all of the faults at a given stage have equal probabilities of discovery; (4) the probabilities of fault discovery are not dependent on any environmental conditions, aging, and so on; and (5) if two faults for a given stage simultaneously express themselves, only one is identified and removed. While these assumptions are acknowledged to be

Page 18 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

an oversimplification, more realistic versions of this approach can be (and are being) developed using straightforward generalizations of the above model.

One interesting implication of this research is that reliability growth under the assumed circumstances will not necessarily have the general pattern identified by Duane (1964). Consideration of subsystems tested in series with this framework could certainly lead to other patterns of reliability growth.

One possible generalization of this model is to place a Bayesian prior on the d_s(0)’s. Doing so would (1) allow the introduction of expert judgment, (2) reduce the assumptions concerning the d_s(0)’s to a small number of hyperparameters, and (3) allow some borrowing of information across components. Another generalization that could be explored would be to assume that the θ_s’s were draws from some distribution, instead of assuming fixed parameters 1-θ_s. Doing so would (1) help account for overdispersion, (2) reduce the number of parameters to be estimated, and (3) remove the homogeneous failure rate assumption.

The Scholz Paper

Nonhomogeneous Poisson processes (Poisson processes in which the failure rate changes as a function of time) are commonly used for modeling reliability growth. As mentioned above, an extremely popular model is the Duane power law model, a particular nonhomogeneous Poisson process in which the failure rate is assumed to be a power function of time. A chief deficiency of the Duane model is that it is not based on a physical cause-and-effect connection between an observed pattern of system failures and reliability growth (as was noted in the previous section). To address this concern, Fritz Scholz proposed the following model for a system of defect detection and reliability growth. (This idea was originally developed in the context of software testing, but it can be applied to any system that satisfies the cited assumptions. The description provided here is for the continuous case; Scholz, 1986, provides more detail and also addresses the discrete case.)

Assume one wants to measure reliability for a system that is suspected of having a number of design flaws. To measure reliability, the system is subjected to a series of test events. The system is assumed to be a deterministic function of the inputs to the system. A test is the exercising of the system using a selected subset of the set of possible inputs to the system. For a software system, the inputs are user-supplied fields, such as keystrokes

Page 19 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

and mouse movements. For a hardware system, the inputs can include the environment of use and the actions of friendly and enemy soldiers. A few of the design flaws are assumed to be easy to find in that many inputs are likely to expose them, while many of the flaws are assumed to be relatively difficult to find in that very few inputs will disclose their presence. That is, assuming that inputs are selected uniformly from the space of all possible inputs, a few design flaws will be discovered with high probability, and many more will be discovered with low probability.

Some mathematical details follow. The system is assumed to have N faults. The assumption is also made that the waiting time to discovery of fault i (i = 1, ... , N) can be modeled using independent, exponential random variables Z₁, Z₂, ... Z_N with respective failure rates λ_i. (Here fault i means the fault with label i, not the ith fault discovered.) The results of the testing are the first k waiting times (or cycles of operation) between the discovery of successive faults (again, not faults with successive labels), which are denoted D₁,D₂, ... , D_k. Conditional on the unobservable fault labels, the distribution of the D_i’s is that of independent, exponential random variables with decreasing failure rates (decreasing since, of course, each time a fault is discovered, the system becomes more reliable). This conditional distribution is used to derive the unconditional marginal distribution of the first k D_i’s, which in turn can be used to derive useful estimates concerning system reliability.

While the mathematics underlying inference for this model are complicated given that faults identified previously have an impact on the probability of discovering future faults, Scholz has derived the maximum-likelihood estimates of the residual failure rate at each stage in the fault discovery process using tools from the field of isotonic regression. Scholz has also provided upper bounds for confidence intervals for the residual failure rate. In other words, Scholz’s method estimates and bounds system reliability at each stage of the fault discovery process.

The Meeker Paper

The Department of Defense collects considerable information on the performance of its systems while in development—especially test results— as well as when fielded. However, since test results are currently collected mainly to support decisions on whether to promote systems to the next stage in the milestone process, test data are often not saved and archived in a manner that facilitates their further use. Further, while data on field

Page 20 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

failures and field performance are often retained, they are rarely archived in a manner that facilitates analysis of improvements in the system development process. In particular, such information could be used to improve the future design of developmental and operational tests by helping to explore how system flaws were missed during previous developmental and operational testing and how this can be remedied in the future.

In contrast, in industrial applications, test and field use data are often employed for these and other purposes and are frequently archived in a manner that facilitates analysis in support of these uses. Specifically, field use data are employed for prediction of future warranty or maintenance costs, as well as for early detection of reliability problems in fielded systems. Albeit less frequently, these data are also used to provide information on the discovery of failure modes and their frequency of occurrence—information that is in turn used to improve developmental and operational test procedures. Further, this information supports comparisons of system performance (failure modes and their frequencies) in developmental or laboratory tests, versus performance in operational tests, versus performance in the field. Understanding how system performance is related in tests with various degrees of operational realism is extremely valuable for performing reliability growth modeling and for learning how to design laboratory and operational testing with greater operational realism. Finally, field performance data are used to feed component-level reliability information back to design engineers so they can improve current or future component or system designs.

While field performance data have many potential uses in industry, they also have disadvantages. Some disadvantages stem from the primary reason for the collection of field performance data in industry—to support administrative action such as warranty management. Therefore, the data often are not as suitable for the analyses outlined above as would be the case for data from a structured experiment. Deficiencies include the following. First, a sizable fraction of the data is missing, and there are reporting errors and delays. Second, while collecting time of actual use would be optimal for measuring system life, what is commonly available is only calendar time. Third, the environment of use is commonly known only partially or totally unknown. Fourth, in warranty situations, failures are reported only for units that are under warranty. As a result, data are reliable only until the warranty period is exceeded, and the status of units that are not reported is unknown (including retired units and units that were never put into service). Finally, most field performance data are collected only for repairable systems.

Page 21 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

A further reason that test and field performance data are not fully utilized in industrial applications is that easy access to these data has an associated cost. While the collection of field performance data is effectively free since they are often required for other purposes, field performance (and test) data must be catalogued in a database structure in a way that facilitates the above uses. The construction and maintenance of such a database is time- and resource-intensive. This point can be illustrated if one considers the need to create a (living) cross- referencing system that identifies all (current and future) systems having components in common with a given system, all test event results, the conditions underlying each test event, the performance of components when fielded, and the conditions underlying field use. The benefits from the use of such data must be sufficient to offset this substantial cost. Making this argument was one goal of some of the presentations related to this topic at the workshop.

Bill Meeker provided an overview, from an industrial perspective (in particular, automobile warranty data), of the many opportunities to learn from the analysis of field performance data. He focused on features of such data that would be expected for defense systems: (1) data are collected until the system is a certain age or until it has covered a fixed number of miles, (2) there is only limited information on the exact cause of failure, (3) there is good information on the date of manufacture, (4) there is often useful information on the rate of use for each system, and (5) there are potential biases in estimation resulting from various homogeneity assumptions (e.g., high-speed drivers may have a different miles-per-failure distribution).

Field performance data have the following key applications. A primary use is to support early detection of production processes in trouble. A common approach used for the purpose is to graph the observed percentage of system failures by months in service alongside a graph of the upper bound for an estimate of the same based on a quantile of a standard cumulative distribution function used to model failure rates (e.g., the Poisson distribution) with its parameters estimated using historical data. Two detection rules are used to signal the need for corrective action: (1) if the observed failure rate at a point in time exceeds a particular quantile based on the historical data, or if some function of the observed number of failures (usually chosen to approximate some standard distribution) exceeds the historical estimate plus a critical value times an estimate of the standard deviation of the historical estimate; and (2) if the difference between some function of the observed number of failures at time t and at time t – 1 is greater than the historical estimate for the same plus a critical value times

Page 22 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

an estimate of the standard deviation of the historical estimate (of this difference). The critical values are chosen, using historical data, to balance errors of identifying a process in need of correction when it is functioning fine against the cost of letting a process pass that is in need of correction. (For details, see Kalbfleisch et al., 1991 and Wu and Meeker, 2002.)

A second important application of field performance data is the prediction of future warranty or total maintenance costs (the second possibility currently being more relevant to DoD systems). Clearly, information on the rate of field failures of various types could be extremely useful for estimating field maintenance, repair, and component replacement costs.

A third use of field performance data is to establish a “transfer function” between developmental and operational tests and between operational tests and field performance. Knowledge of the ways in which developmental and operational tests are unreliable predictors of field performance has great value for reliability growth estimation, and could be useful both in linking developmental and operational test results and in providing information on how to design developmental and operational tests with greater operational realism. Meeker described the following possibility for addressing a linkage between developmental and operational testing.

Developmental tests are often accelerated, meaning that stresses are frequently increased in an effort to simulate the greater passage of time and greater use. To make accelerated testing informative for decisions concerning operational or field performance, a model (e.g., a degradation model) is used to relate accelerated test time to actual use time. This model must describe the effects of acceleration, the impact and distribution of environmental conditions, and the distribution of use rates in actual use of the system. (This type of model is often related to physics-of-failure models, discussed below.) A successful model of this form could be used to link developmental test data on system reliability to operational test and field performance. Meeker gave an example concerning the use of washing machines. Here the failure probability was expressed as a function of the number of cycles of use, and users were divided into categories based on their rate of use in cycles per unit time. Within these categories, the rate of use was assumed to be constant. Use of this assumption made it possible to translate the failure probability, initially expressed as a function of the number of cycles (which could be experimented on for high use) into a failure probability expressed as a function of time (see Meeker et al., 2002). Agreement between this mixture distribution and field performance can be used as a validation tool, validating, for example, that the percentages of people

Page 23 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

in the various categories of use rate do not change over time. Divergences from these assumptions and identification of remedies are the subject of ongoing research.

Once these and other uses of field performance data have been institutionalized with the accompanying benefits, the quality of such data is likely to improve. One important application in which industrial field performance data have recently been improving in quality is sensors that can collect the entire history of use of, say, an automobile, including stresses, speeds of use, temperature, etc. That information can be linked to information on system reliability or performance to support much richer statistical reliability modeling. Use of such sensors could be valuable in operational testing for similar reasons.¹ Several companies are undertaking efforts to save additional data in their warranty databases so the data can be used not only for financial purposes, but also for reliability assessment and estimation. Such efforts represent a cultural change. A hurdle is that development or expansion of such a database sometimes requires innovative funding approaches.

Discussion of Gaver, Sen, Scholz, and Meeker Papers

In his discussion of the papers by Sen and Gaver, Paul Ellner addressed the complication involved in reliability growth modeling of translating reliability estimates from developmental test to predictions of reliability in the field from operational test. At present, analysts may use a reliability growth model to extrapolate from developmental test results to operational test results. This approach can be severely biased, producing overly optimistic reliability estimates since the failure modes can be substantially different in the two situations (actually three—developmental test, operational test, and field performance). Efforts to perform this translation face the following challenges: (1) determining the (approximate) relationship between failure modes that occur jointly in developmental and operational environments, and (2) identifying a function expressing the probability of failure in operational test as a function of the probability of failure in operational test due to failure modes not present in developmental test. This translation probably cannot be done at the system level; it must be carried out at the com-

¹	These sensors could be used to monitor reliability degradation during field use and to support efficient logistics management.

Page 24 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

ponent level. Clearly, a direct way to address this issue is to use developmental testing that is more representative of operational use, to the extent possible.

Ellner pointed out some assumptions that limit the applicability of the exponential reliability growth model, though he was relatively confident that these could be addressed. For Gaver’s model, the greatest challenge is that of initial input, that is, the number of faults in each stage of the system and the probability of discovering a fault during a test. Ellner suggested that the number of faults can be assumed to be quite large for complicated systems, and that giving the probabilities of discovering a fault a hierarchical structure is also a promising generalization of the model.

Ellner also strongly supported Sen’s proposal regarding the use of many alternative models that are consistent with the data to assess model misspecification. If these models agree with respect to decisions, one can be confident; if they disagree, the discrepancy will have to be analyzed.

Ellner remarked that an AMSAA website² and an Institute of Electrical and Electronics Engineers working group are both concerned with updating Military Handbook 189 on reliability growth management (U.S. Department of Defense, 1981). He suggested that efforts to update this handbook would be more successful if the responsibility were assigned to a specific organization.

In his discussion of the papers by Scholz and Meeker, James Crouch pointed out that DoD already makes considerable use of operational test and field performance data, at least in the area of reliability testing of jet engines. Performance data are used to manage and control various aspects of turbine engine reliability, specifically (1) the engine in-flight shutdown rate; (2) the rate at which the engine needs to be repaired; and (3) the line replaceable unit rate, the maintenance rate for replacement of the components that surround the engine. The use of operational test data is complicated by engine-to-engine variations (it is typical to develop only three or four prototypes for operational test), and the use of both operational test and field performance data is complicated by variations in operational use on which data are not easily collected. These problems are currently being addressed using modeling and simulation.

The Air Force uses Pareto charts (histograms of the number of failure occurrence reports by root-cause category) based on field performance data

²	http://amsaa-www.arl.mil/AR/rel_growth_guide.htm

Page 25 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

to examine different causes of engine removal. In the long term, redesigns are often based on these analyses. The Air Force also has a deficiency reporting system that initiates an engineering investigation of a problem. (At times, unfortunately, the reports are incomplete, and improper malfunction codes are entered.) In addition, the Air Force has a warranty program that reports component failures and conducts warranty investigations. As mentioned above, while the time to failure is generally known, the cycles at failure or other characteristics are sometimes unknown. Also as mentioned above, once the parts and engines have outlived the warranty, problems in collecting data can become more prevalent.

The Air Force has been giving reliability issues much higher priority of late. One successful program addressing high-cycle fatigue is referred to as reliability-centered maintenance. Reliability-centered maintenance is a systematic approach to preventative maintenance in which optimal maintenance processes are employed at the component or subsystem level. The Air Force is also using highly accelerated life testing and highly accelerated stress screening to identify failure modes. Analysis of the common and unique failure modes from accelerated developmental testing and operational testing may make it possible to better understand the distinctions between these two types of testing.

In the floor discussion, Dan Willard questioned the benefit of going beyond the understanding of failure modes from such activities as accelerated testing. Scholz responded that the failure modes discovered in accelerated testing may differ from those found in the field. While it is valuable to discover and correct as many faults as possible before fielding, there is additional value in comparing those faults identified through accelerated testing and those discovered after fielding.

TOOLS FOR MANAGING RELIABILITY GROWTH

Reliability growth management consists of procedures and infrastructure used during system development to track and expedite reliability growth—especially including use of various feedback mechanisms to improve system design. (The hope is that these feedback mechanisms can also be used to improve the process of reliability growth management itself over time.) A key goal is early assessment of the likelihood of meeting the targeted operational reliability. Generally speaking, DoD has accorded subordinate priority to system reliability as compared with system effectiveness as a result of the primary conflict scenarios on which the agency has, until

Page 26 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

recently, focused its attention. In recent years, the types of military engagements faced and anticipated have changed quite dramatically. As a consequence, the importance of placing increased emphasis on the development of highly reliable systems has grown. Speakers and discussants strongly confirmed the need for improved reliability growth management through frequent and thorough testing and inspection and through the application of global, cross-disciplinary strategies for achieving and surpassing reliability growth targets.

It was pointed out by more than one speaker that at present, defense systems regularly fail to satisfy their operational suitability requirements in the field. (Suitability encompasses reliability and related measures, including maintainability and availability.) As a result, DoD is spending far too much for system redesigns late in system development, and for spares management and system maintenance (and also system redesign) after the system has been fielded. DoD systems also are frequently submitted for operational testing before they are sufficiently mature with respect to system reliability. For example, it was pointed out that 80 percent of Army systems failed to achieve even half of their requirements for mean time to failure in operational test (Defense Science Board, 2000).

Mention was made of a number of methods currently used in industrial applications of system development and reliability growth management that are not being used in defense system development, but appear to be relevant to the latter systems. First, early assessment of (operational) reliability could play an important role in improving system design, as opposed to current primary use in supporting the milestone decision process. Second, little or no use is currently made of test or field failure data to (1) support a better understanding of system life-cycle costs; (2) help determine how failure modes escaped detection during developmental or operational testing; or (3) relate the reliability of systems and failure modes in operational test to the reliability of systems and failure modes in developmental test, which could support methods for combining information from developmental and operational testing (as discussed earlier). The estimation of system life-cycle costs was noted as a particularly important use of field performance data. (As mentioned above, the increased accessibility of such data to support this type of analysis would require the institution of a data archive.) Third, it was noted that it is typical in reliability growth modeling for defense systems—used to predict system reliability in the future (e.g., to ascertain when it would be appropriate to enter operational test)—to make no attempt to model

Page 27 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

the system defect discovery process. Ignoring this information will likely produce much less predictive models than the approaches available today that model this process. Fourth, methods exist and are currently used for developing early assessments of system reliability that make full use of the disparate information available in industrial applications (e.g., information derived from maintenance records, computer simulations, expert knowledge, historical data, and test data, and information from similar systems, or systems with similar parts, components, or processes). However, these methods are not currently applied to defense systems (with a few notable exceptions). Finally, when determining total time on test (in operational test) or other aspects governing an operational test design, requisites for testing system effectiveness typically have been the dominant consideration, while the requisites for producing assessments of operational suitability have received considerably less weight.

The above five areas are ones in which greater attention to reliability measurement, reliability modeling and data collection, and the management of defense system reliability could prove beneficial. Two specific approaches to reliability growth management were considered at the workshop, as summarized below.

The Booker Paper

It is currently typical for reliability assessment of defense systems to be used primarily as input into the DoD acquisition milestone process, for deciding whether a system in development can proceed to the next milestone. Since operational testing is carried out near the end of the second phase of system development (known as engineering and manufacturing development), there is little or no opportunity for reliability assessment of a system’s operational performance to inform system design during its early stages. In contrast, in various industrial applications of system development, reliability assessment has an earlier and more continuous influence on system design. A major benefit of this early influence is that, generally speaking, the earlier modifications are made to system design, the less costly those modifications are. Further, the better a system design is, the more likely it is that the system will pass operational test on its first attempt. Finally, and most important, the better the system design is, the more likely it is that the system ultimately approved for full-rate production will perform better and be less costly to operate in the field, since it will likely require less maintenance and repair. Changing the role of reliability assess-

Page 28 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

ment from one of confirming that a system or its components meet specific performance requirements to one of understanding as much as possible, as early as possible, about the (operational) strengths and weaknesses of the current system design requires planning and carrying out additional, targeted testing of the system. In that testing, previous assessments are used in determining the test timing, test size, test scenarios, and number of replications at each scenario for the various test events that are needed. Two processes were presented at the workshop that, to different degrees, (1) provide early reliability assessments; (2) assist in the design of these early, additional, operationally relevant tests; and (3) use early reliability assessments for improving system design.

The first such process was presented by Jane Booker, representing a team at Los Alamos National Laboratory that has developed the Performance and Reliability Evaluation with Diverse Information Combination and Tracking (PREDICT) system of early reliability assessment. PREDICT (now known as Information Integration Technology) is a comprehensive framework that facilitates the use of disparate sources of information— such as expert opinion, simulations, historical data, test data, and maintenance data—for the system in question, or information on similar systems or systems with similar components to produce reliability assessments through use of a combination of information models. PREDICT also provides estimates of the uncertainty of these assessments. Both the estimates and their estimated uncertainties can be displayed graphically for easier understanding by decision makers.

PREDICT uses this information for a variety of purposes. The first is to identify which components, if improved, would most improve overall system reliability. Also, these assessments and their associated uncertainty can help in designing system tests that are more informative by targeting test events to areas of lowest reliability or of greatest uncertainty. These test results can be used to propose system design changes. Further, this framework can be used to carry out “what-if” analyses. For example, to gauge test size, one could ask what the result would be if another test run were carried out and were successful. One could also input a different maintenance schedule or a redesign of a subsystem to examine the impact on total system performance.

For an initial assessment of system reliability, PREDICT uses the following inputs: (1) system requirements or performance measures; (2) system structure, including information on subsystems and components and

Page 29 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

on failure modes for components that is input through use of a variety of representations, including logic models/diagrams, event/fault trees, directed graphs, Bayesian networks, process trees, and reliability block diagrams; (3) system process, using inputs from physics and chemistry, mechanical engineering, quality control tests, assembly, and testing at various levels (system, subsystem, and component); and (4) inputs concerning the reliability of components in analogous systems, and expert opinion on the reliability of the components. All of these inputs are documented in a knowledge base that provides information at customized levels for various queries. Inputs for a given system are also available to provide information concerning the performance of similar or related systems in the future. These initial assessments are updated in accordance with the receipt of new information and test results. (Updates are also based on refinements to system structure or changes to requirements or performance measures.)

PREDICT tracks performance as system development proceeds. Once a system has been fielded, PREDICT can be used to track performance in the field; that is, it can continuously update reliability assessments on the basis of new information (e.g., on the aging of the system).

PREDICT also provides a platform that facilitates consideration of various action items, such as whether one can support a system in the field or how the number of maintenance actions can be reduced once the system has been fielded. PREDICT can also support decisions involving either the development of a new system or a choice among several system designs through balancing of the costs of development and the costs of fielding to arrive at a system that minimizes life-cycle costs.

As an example, consider an air-to-air heat-seeking missile. The major subsystems are the warhead, the missile, the aircraft, command and control, and logistics and maintenance. Taking the command and control subsystem in more detail, the aircraft has power, avionics, environmental, acquisition and fire control, flight structure, launching, flight control, and missile interface elements, as well as human intervention. There are also complex interactions between subsystems that act across major subsystems. PREDICT attempts to represent all of this structure using various forms of sensitivity analysis.

PREDICT has been used successfully by Delphi Automotive Systems and in the nuclear weapons program at Los Alamos National Laboratory. PREDICT can also be implemented in dynamic environments where testing is not feasible, such as in the nuclear weapons program.

Page 30 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

The Crow Paper

A second system that also uses early reliability assessments to improve system design and development is the Integrated Reliability Growth Strategy (IRGS), currently in use at General Dynamics Advanced Technology Systems and several other institutions. IRGS, which was outlined at the workshop by its primary developer, Larry Crow, is a process that generates early and substantial reliability growth through continuous testing and assessment to determine which of a system’s components have a mature or immature design. On this basis, IRGS directs design modifications of the immature components. The result is a reliability growth program, iterating between design modification and testing, that tends to reduce substantially the time needed to achieve reliability goals, for example, to attain a required reliability level before entering operational test.

Permitting a system to enter late-stage developmental test with a substantial number of reliability flaws places too heavy a burden on developmental and operational test to discover the remaining problems. This is also an expensive way of discovering defects since it is likely that the system will experience difficulties in operational test, and it may have to undergo design modifications and later repeat some operational test events. Today, it is not uncommon for some DoD systems to enter into late-stage developmental test when their reliability is at 30 percent of the ultimate goal, whereas the goal for industrial applications is for a system to be at 75 percent of its eventual reliability before entering into formal testing. The latter is accomplished by identifying design flaws in earlier stages of the development process, thereby producing a mature system design much earlier. Again, the overall change in strategy is based on modifying the function of reliability assessment from that of a statistic used to support promotion decisions to that of an early and continuing objective measurement (combining a wide variety of types of information) that is used to support system development by helping to identify components in need of redesign or maturation.

IRGS takes as input a system design that supports prototypes with approximately 25–30 percent of the final required reliability. The complete system undergoes a requirements review, including performance requirements and requirements involving the environment, reliability, safety, maintainability, and support. IRGS then categorizes failure modes for complex systems into type A and type B failure modes. Type A modes correspond to components that have mature designs and are unlikely to be

Page 31 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

capable of substantial improvement. These modes are typically associated with off-the-shelf components with demonstrated high reliability and a proven, inherited design. Type B modes correspond to components that are candidates for improvement. These components involve unproven new technology or a new design, or may be off-the-shelf components that require improvement before use. System maturity can be measured as the mean time to failure that is due to A components as a percentage of total system mean time to failure. (See Crow [1998] for some related estimation issues.)

The foundation of IRGS is a process that identifies and mitigates type B failure modes by converting them to type A modes. This conversion is accomplished through an iterative process of testing and analysis. Testing steps include evaluation; qualification; reliability growth modeling; and application of the Failure Reporting, Analysis, and Corrective Action System (FRACAS). Analysis steps include understanding of failure modes and fault tree analysis, analysis of reliability design trade-offs, safety, maintainability, design-stress reliability, material and supplies analysis, and analysis of manufacturing for reliability. Analysis and testing are used to identify which components are likely involved in any problems. This is accomplished by applying the broad concept of the type A–B mode approach to component reliability described above. A process that tracks the reliability of components is fed information from this analysis and testing scheme, and the sources of any problems are identified and appropriate corrective actions sought through use of design reviews and consultation with project teams. The result is a reliability growth program iterating between design modification and testing that tends to reduce the time to achieve reliability goals. As the reliability of components improves, their designation changes from type B to type A.

IRGS has been applied successfully for various purposes. For example, it has been used to demonstrate that a system in development would be highly likely to exceed its required reliability of 8 years between failures. It has also been used to show that design upgrades are improving system reliability. Finally, it has been used to monitor a wide variety of system integration tests and hardware and software upgrades and their impact on system reliability.

In all of these applications, it is important to define specific metrics that can be used for tracking performance over time so that reliability growth can be concretely documented. Metrics such as the proportion of failure modes that are classified at any point in time as type B, as well as the

Page 32 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

performance of components, subsystems, and systems before and after interventions and/or design improvements, serve not only to measure progress toward meeting a project’s reliability goals, but also to inform participants at all levels of the constructive contributions made by the reliability program. Since the success of IRGS relies on the collaboration of a wide array of scientists, engineers, and management personnel, it is imperative that improvements in system performance be documented and widely communicated.

Discussion of Booker and Crow Papers

The two systems for using reliability assessment as an input into system development described by Booker and Crow make credible the claim that comprehensive reliability improvement programs can be both feasible and effective. The cornerstone of both methodologies is their interactive character, with iterations of the traditional test-fix-test cycle leading to interventions that improve components and subsystems. Both methods seek to utilize input from experts, with the PREDICT methodology doing so in a more formal way.

The discussants for these papers, Walt Hollis and Arthur Fries, expressed their optimism that these methods could be implemented in defense system development and would provide substantial benefits for many types of systems. (In the floor discussion, Dan Willard mentioned that his agency had developed a tool that appears to have some similarities to PREDICT, and they were going to compare the two to see whether there are advantages that could be shared.) One promising idea would be to use IRGS as the process for managing reliability growth, with PREDICT being used for reliability assessment.

Hollis mentioned three Army systems for which measurement of system reliability is extremely difficult, for different reasons: the National Missile Defense System, the Theater High-Altitude Air Defense System, and the Patriot Missile System. For these systems, reliability growth must necessarily rely on component testing and simulations. Since achievement of high operational reliability cannot be tested in, it must be designed into the system.

Fries pointed out that there is less and less time available for evaluation, and there is a growing need to begin operational evaluations earlier. To do so, one is obligated to use other information sources from the development process. IRGS and PREDICT are both very worthy attempts to accomplish this. Both are both highly structured (fully documented) ways

Page 33 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

of breaking a system up into subsystems and subprocesses and having experts examine each carefully, suggesting additional testing when necessary to direct improvements. These tools can also help guide final operational tests.

Hollis and Fries pointed out that the adoption of these methods does require up-front investment, and DoD program managers need to be willing to expend those funds. Support for this investment will come with expected positive experiences, which IRGS has already demonstrated for defense systems.

Finally, the discussants pointed out that the key to the successful use of both of these processes is that there must be an early and constant emphasis on the performance of the system under operational conditions, as opposed to meeting a required reliability level that is based on laboratory performance. It is still the case that operational sources of reliability problems appear very late in system development. These problems are typically ones that could have been identified much earlier. Both approaches can address this problem if necessary change in emphasis is achieved.

CONCLUDING REMARKS ON THE MEASUREMENT AND MANAGEMENT OF RELIABILITY GROWTH

The reliability growth management processes outlined above, and reliability growth management more generally, require a variety of sources of information on system reliability as key inputs. Especially important are data from developmental and operational testing and from the field performance of related systems.

Because operational testing is costly (and occurs late in the budget cycle when there is little possibility of a large reallocation of funds for operational test), only a limited amount of information is typically collected in terms of both the number of replications and the number of separate test environments and scenarios that can be examined. Given this limited information, it is typically the case that operational testing data alone cannot confirm, with the usual levels of statistical confidence, that a defense system’s suitability requirements are met. It would be generally useful, therefore, to combine operational test data with appropriate portions of developmental test data on the same system, and data from field use and developmental and operational testing of related systems to provide less variable estimates of system reliability to inform decisions about system promotion. Further, as mentioned previously, early assessment of a system’s

Page 34 Cite

Suggested Citation:"2. The Measurement and Management of Reliability Growth." National Research Council. 2002. Reliability Issues for DOD Systems: Report of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10561.

×

reliability is extremely useful to help guide early design changes, and any such early assessment must be based on a combination of information from a variety of sources given the scarcity of direct assessment early in system development. Therefore, combining information, some of it possibly subjective, is likely to prove beneficial in some situations.

It is becoming increasingly common—though by no means widespread as yet—for reliability assessment for industrial systems late in development to make effective use of information on the reliability of related systems (e.g., systems with identical or similar components) and information for the same system from laboratory testing. Even earlier in system development, some industries have demonstrated the utility of information on related systems and expert judgment to help make initial assessments of system reliability that are useful for developmental test planning and for tracking of reliability growth.

In the field of statistics, combining information models are currently being developed primarily from a Bayesian perspective. Much progress is occurring in this area as a result of the development of simulation methods that have greatly facilitated the calculation of Bayesian estimates. This rapid progress increases expectations that more and more types of applications will be addressed using these new methods.

Certainly such methods cannot be used without some scrutiny, and the benefits of use of these models for defense systems will almost undoubtedly vary with the specific application. The linkage between failure modes and failure frequencies across systems and across environments of testing or field use must be well understood before these models are applied. Aggressive efforts are needed to validate the assumptions made. The expectation is that over time, estimates for some measures for some types of systems will be found to benefit greatly from use of these models, whereas for other systems, these models will be much less useful. Chapter 3 provides a description of some specific methods that were suggested for use at the workshop.