Design for reliability is a collection of techniques that are used to modify the initial design of a system to improve its reliability. It appears to the panel that U.S. Department of Defense (DoD) contractors do not fully exploit these techniques. There are probably a variety of reasons for this omission, including the additional cost and time of development needed. However, such methods can dramatically increase system reliability, and DoD system reliability would benefit considerably from the use of such methods. This chapter describes techniques to improve system design to enhance system reliability.
From 1980 until the mid-1990s, the goal of DoD reliability policies was to achieve high initial reliability by focusing on reliability fundamentals during design and manufacturing. Subsequently, DoD allowed contractors to rely primarily on “testing reliability in” toward the end of development. This change was noted in the 2011 Annual Report to Congress of the Director of Operational Test and Evaluation (U.S. Department of Defense, 2011b, p. v):
[I]ndustry continues to follow the 785B methodology, which unfortunately takes a more reactive than proactive approach to achieving reliability goals. In this standard, approximately 30 percent of the system reliability comes from the design while the remaining 70 percent is to be achieved through growth implemented during the test phases.
This pattern points to the need for better design practices and better system engineering (see also Trapnell, 1984; Ellner and Trapnell, 1990).
Many developers of defense systems depend on reliability growth methods applied after the initial design stage to achieve their required levels of reliability. Reliability growth methods, primarily utilizing test-analyze-fix-test, are an important part of nearly any reliability program, but “testing reliability in” is both inefficient and ineffective in comparison with a development approach that uses design-for-reliability methods. Relying on testing-in reliability is inefficient and ineffective because when failure modes are discovered late in system development, corrective actions can lead to delays in fielding and cost over-runs in order to modify the system architecture and make any related changes. In addition, fixes incorporated late in development often cause problems in interfaces, because of a failure to identify all the effects of a design change, with the result that the fielded system requires greater amounts of maintenance and repair.
Traditional military reliability prediction methods, including those detailed in Military Handbook: Reliability Prediction of Electronic Equipment (MIL-HDBK-217) (U.S. Department of Defense, 1991), rely on the collection of failure data and generally assume that the components of the system have failure rates (most often assumed to be constant over time) that can be modified by independent “modifiers” to account for various quality, operating, and environmental conditions. MIL-HDBK-217, for example, offers two methods for predicting reliability, the “stress” method and the “parts count” method. In both of these methods, a generic average failure rate (assuming average operating conditions) is assumed. The shortcoming of this approach is that it uses only the field data, without understanding the root cause of failure (for details, see Pecht and Kang, 1988; Wong, 1990; Pecht et al., 1992). This approach is inaccurate for predicting actual field failures and provides highly misleading predictions, which can result in poor designs and logistics decisions.
An emerging approach uses physics-of-failure and design-for-reliability methods (see, e.g., Pecht and Dasgupta, 1995). Physics of failure uses knowledge of a system’s life-cycle loading and failure mechanisms to perform reliability modeling, design, and assessment. The approach is based on the identification of potential failure modes, failure mechanisms, and failure sites for the system as a function of its life-cycle loading conditions. The stress at each failure site is obtained as a function of both the loading conditions and the system geometry and material properties. Damage models are used to determine fault generation and propagation.
Many reliability engineering methods have been developed and are collectively referred to as design for reliability (a good description can be found in Pecht, 2009). Design for reliability includes a set of techniques that support the product design and the design of the manufacturing process that greatly increase the likelihood that the reliability requirements are met
throughout the life of the product with low overall life-cycle costs. The techniques that comprise design for reliability include (1) failure modes and effects analysis, (2) robust parameter design, (3) block diagrams and fault tree analyses, (4) physics-of-failure methods, (5) simulation methods, and (6) root-cause analysis. Over the past 20 years, manufacturers of many commercial products have learned that to expedite system development and to contain costs (both development costs and life-cycle or warranty costs) while still meeting or exceeding reliability requirements, it is essential to use modern design-for-reliability tools as part of a program to achieve reliability requirements.
In particular, physics of failure is a key approach used by manufacturers of commercial products for reliability enhancement. While traditional reliability assessment techniques heavily penalize systems making use of new materials, structures, and technologies because of a lack of sufficient field failure data, the physics-of-failure approach is based on generic failure models that are as effective for new materials and structures as they are for existing designs. The approach encourages innovative designs through a more realistic reliability assessment.
The use of design-for-reliability techniques can help to identify the components that need modification early in the design stage when it is much more cost-effective to institute such changes. In particular, physics-of-failure methods enable developers to better determine what components need testing, often where there remains uncertainty about the level of reliability in critical components.
A specific approach to design for reliability was described during the panel’s workshop by Guangbin Yang of Ford Motor Company. Yang said that at Ford they start with the design for a new system, which is expressed using a system boundary diagram along with an interface analysis. Then design mistakes are discovered using computer-aided engineering, design reviews, failure-mode-and-effects analysis, and fault-tree analysis. Lack of robustness of designs is examined through use of a P-diagram, which examines how noise factors, in conjunction with control factors and the anticipated input signals, generate an output response, which can include various errors.
We emphasize throughout this report the need for assessment of full-system reliability. In addition, at this point in the development process, there would also be substantial benefits of an assessment of the reliability of high-cost and safety critical subsystems for both the evaluation of the current system reliability and the reliability of future systems with similar subsystems. Such a step is almost a prerequisite of assessment of full-system reliability.
Producing a reliable system requires planning for reliability from the earliest stages of system design. Assessment of reliability as a result of design choices is often accomplished through the use of probabilistic design for reliability, which compares a component’s strength against the stresses it will face in various environments. These practices can substantially increase reliability through better system design (e.g., built-in redundancy) and through the selection of better parts and materials. In addition, there are practices that can improve reliability with respect to manufacturing, assembly, shipping and handling, operation, maintenance and repair. These practices, collectively referred to as design for reliability, improve reliability through design in several ways:
- They ensure that the supply-chain participants have the capability to produce the parts (materials) and services necessary to meet the final reliability objectives and that those participants are following through.
- They identify the potential failure modes, failure sites, and failure mechanisms.
- They design to the quality level that can be controlled in manufacturing and assembly, considering the potential failure modes, failure sites, and failure mechanisms, obtained from the physics-of-failure analysis, and the life-cycle profile.
- They verify the reliability of the system under the expected life-cycle conditions.
- They demonstrate that all manufacturing and assembly processes are capable of producing the system within the statistical process window required by the design. Because variability in material properties and manufacturing processes will affect a system’s reliability, characteristics of the process must be identified, measured, and monitored.
- They manage the life-cycle usage of the system using closed loop, root-cause monitoring procedures.
Reviewing in-house procedures (e.g., design, manufacturing process, storage and handling, quality control, maintenance) against corresponding standards can help identify factors that could cause failures. For example, misapplication of a component could arise from its use outside the operating conditions specified by the vendor (e.g., current, voltage, or temperature). Equipment misapplication can result from improper changes in the operating requirements of the machine.
After these preliminaries, once design work is initiated, the goal is to determine a design for the system that will enable it to have high initial reliability prior to any formal testing. Several techniques for design for reliability are discussed in the rest of this section: defining and characterizing life-cycle loads to improve design parameters; proper selection of parts and materials; and analysis of failure modes, mechanisms, and effects.
Defining and Characterizing Life-Cycle Loads
The life-cycle conditions of any system influence decisions concerning: (1) system design and development, (2) materials and parts selection, (3) qualification, (4) system safety, and (5) maintenance. The phases in a system’s life cycle include manufacturing and assembly, testing, rework, storage, transportation and handling, operation, and repair and maintenance (for an example of the impact on reliability of electronic components as a result of shock and random vibration life-cycle loads, see Mathew et al., 2007). During each phase of its life cycle, a system will experience various environmental and usage stresses. The life-cycle stresses can include, but are not limited to: thermal, mechanical (e.g., pressure levels and gradients, vibrations, shock loads, acoustic levels), chemical, and electrical loading conditions. The degree of and rate of system degradation, and thus reliability, depend upon the nature, magnitude, and duration of exposure to such stresses.
Defining and characterizing the life-cycle stresses can be difficult because systems can experience completely different application conditions, including location, the system utilization profile, and the duration of utilization and maintenance conditions. In other words, there is no precise description of the operating environment for any system.1 Consider the example of a computer, which is typically designed for a home or office environment. However, the operational profile of each computer may be completely different depending on user behavior. Some users may shut down the computer every time they log off; others may shut down only once at the end of the day; still others may keep their computers on all the time. Furthermore, one user may keep the computer by a sunny window, while another person may keep the computer nearby an air conditioner, so the temperature profile experienced by each system, and hence its degradation due to thermal loads, would be different.
There are three methods used to estimate system life-cycle loads relevant to defense systems: similarity analysis, field trial and service records, and in-situ monitoring:
1 This is one of the limitations of prediction that is diminishing over time, given that many systems are being outfitted with sensors and communications technology that provide comprehensive information about the factors that will affect reliability.
- Similarity analysis estimates environmental stresses when sufficient field histories for similar systems are available. Before using data on similar systems for proposed designs, the characteristic differences in design and application for the comparison systems need to be reviewed. For example, electronics inside a washing machine in a commercial laundry are expected to experience a wider distribution of loads and use conditions (because of a large number of users) and higher usage rates than a home washing machine.
- Field trial records provide estimates of the environmental profiles experienced by the system. The data are a function of the lengths and conditions of the trials and can be extrapolated to estimate actual user conditions. Service records provide information on the maintenance, replacement, or servicing performed.
- In-situ monitoring (for a good example, see Das, 2012) can track usage conditions experienced by the system over a system’s life cycle. These data are often collected using sensors. Load distributions can be developed from data obtained by monitoring systems that are used by different users. The data need to be collected over a sufficiently long period to provide an estimate of the loads and their variation over time. In-situ monitoring provides the most accurate account of load histories and is most valuable in design for reliability.
Proper Selection of Parts and Materials
Almost all systems include parts (materials) produced by supply chains of companies. It is necessary to select the parts (materials) that have sufficient quality and are capable of delivering the expected performance and reliability in the application. Because of changes in technology trends, the evolution of complex supply-chain interactions and new market challenges, shifts in consumer demand, and continuing standards reorganization, a cost-effective and efficient parts selection and management process is needed to perform this assessment, which is usually carried out by a multidisciplinary team. (For a description of this process for an electronic system, see Sandborn et al., 2008.) A manufacturer’s ability to produce parts with consistent quality is evaluated; the distributor assessment evaluates the distributor’s ability to provide parts without affecting the initial quality and reliability; and the parts selection and management team defines the minimum acceptability criteria based on a system’s requirements.
In the next step, the candidate part is subjected to application-dependent assessments. The manufacturer’s quality policies are assessed with respect to five assessment categories: process control; handling, storage, and shipping controls; corrective and preventive actions; product traceability; and change
notification. If the part is not found to be acceptable after this assessment, then the assessment team must decide whether an acceptable alternative is available. If no alternative is available, then the team may choose to pursue techniques that mitigate the possible risks associated with using an unacceptable part.
Performance assessment seeks to evaluate a part’s ability to meet the performance requirements (e.g., functional, mechanical, and electrical) of the system. In order to increase performance, manufacturers may adopt features for products that make them less reliable.
In general, there are no distinct boundaries for such stressors as mechanical load, current, or temperature above which immediate failure will occur and below which a part will operate indefinitely. However, there are often a minimum and a maximum limit beyond which the part will not function properly or at which the increased complexity required to address the stress with high probability will not offer an advantage in cost-effectiveness. The ratings of the part manufacturer or the user’s procurement ratings are generally used to determine these limiting values. Equipment manufacturers who use such parts need to adapt their design so that the part does not experience conditions beyond its ratings. It is the responsibility of the parts team to establish that the electrical, mechanical, or functional performance of the part is suitable for the life-cycle conditions of the particular system.
Failure Modes, Mechanisms, and Effects Analysis
A failure mode is the manner in which a failure (at the component, subsystem, or system level) is observed to occur, or alternatively, as the specific way in which a failure is manifested, such as the breaking of a truck axle. Failures do link hierarchically in terms of the system architecture, and so a failure mode may, in turn, cause failures in a higher level subsystem or may be the result of a failure of a lower level component, or both. A failure cause is defined as the circumstances during design, manufacture, storage, transportation, or use that lead to a failure. For each failure mode, there may be many potential causes that can be identified.
Failure mechanisms are the processes by which specific combinations of physical, electrical, chemical, and mechanical stresses induce failure. Failure mechanisms are categorized as either overstress or wear-out mechanisms; an overstress failure involves a failure that arises as a result of a single load (stress) condition. Wear-out failure involves a failure that arises as a result of cumulative load (stress) conditions. Knowledge of the likely failure mechanisms is essential for developing designs for reliable systems.
Failure modes, mechanisms, and effects analysis is a systematic approach to identify the failure mechanisms and models for all potential failure modes, and to set priorities among them. It supports physics-
of-failure-based design for reliability. High-priority failure mechanisms determine the operational stresses and the environmental and operational parameters that need to be accounted or controlled for in the design.
Failure modes, mechanisms, and effects analysis is used as input in the determination of the relationships between system requirements and the physical characteristics of the product (and their variation in the production process), the interactions of system materials with loads, and their influences on the system’s susceptibility to failure with respect to the use conditions. This process merges the design-for-reliability approach with material knowledge. It uses application conditions and the duration of the application with understanding of the likely stresses and potential failure mechanisms. The potential failure mechanisms are considered individually, and they are assessed with models that enable the design of the system for the intended application.
Failure models use appropriate stress and damage analysis methods to evaluate susceptibility of failure. Failure susceptibility is evaluated by assessing the time to failure or likelihood of a failure for a given geometry, material construction, or environmental and operational condition. Failure models of overstress mechanisms use stress analysis to estimate the likelihood of a failure as a result of a single exposure to a defined stress condition. The simplest formulation for an overstress model is the comparison of an induced stress with the strength of the material that must sustain that stress.
Wear-out mechanisms are analyzed using both stress and damage analysis to calculate the time required to induce failure as a result of a defined stress life-cycle profile. In the case of wear-out failures, damage is accumulated over a period until the item is no longer able to withstand the applied load. Therefore, an appropriate method for combining multiple conditions has to be determined for assessing the time to failure. Sometimes, the damage due to the individual loading conditions may be analyzed separately, and the failure assessment results may be combined in a cumulative manner.
Life-cycle profiles include environmental conditions such as temperature, humidity, pressure, vibration or shock, chemical environments, radiation, contaminants, and loads due to operating conditions, such as current, voltage, and power. The life-cycle environment of a system consists of assembly, storage, handling, and usage conditions of the system. Information on life-cycle conditions can be used for eliminating failure modes that may not occur under the given application conditions.
In the absence of field data, information on system use conditions can be obtained from environmental handbooks or from data collected on similar environments. Ideally, such data should be obtained and processed during actual application. Recorded data from the life-cycle stages for the same or similar products can serve as input for a failure modes, mechanisms, and effects analysis.
Ideally all failure mechanisms and their interactions are considered for system design and analysis. In the life cycle of a system, several failure mechanisms may be activated by different environmental and operational parameters acting at various stress levels, but only a few operational and environmental parameters and failure mechanisms are in general responsible for the majority of the failures (see Mathew et al., 2012). High-priority mechanisms are those that may cause the product to fail relatively early in a product’s intended life. These mechanisms occur during the normal operational and environmental conditions of the product’s application.
Failure susceptibility is evaluated using the previously identified failure models when they are available. For overstress mechanisms, failure susceptibility is evaluated by conducting a stress analysis under the given environmental and operating conditions. For wear-out mechanisms, failure susceptibility is evaluated by determining the time to failure under the given environmental and operating conditions. If no failure models are available, then the evaluation is based on past experience, manufacturer data, or handbooks.
After evaluation of failure susceptibility, occurrence ratings under environmental and operating conditions applicable to the system are assigned to the failure mechanisms. For the overstress failure mechanisms that precipitate failure, the highest occurrence rating, “frequent,” is assigned. If no overstress failures are precipitated, then the lowest occurrence rating, “extremely unlikely,” is assigned. For the wear-out failure mechanisms, the ratings are assigned on the basis of benchmarking the individual time to failure for a given wear-out mechanism with overall time to failure, expected product life, past experience, and engineering judgment.
The purpose of failure modes, mechanisms, and effects analysis is to identify potential failure mechanisms and models for all potential failures modes and to prioritize them. To ascertain the criticality of the failure mechanisms, a common approach is to calculate a risk priority number for each mechanism. The higher the risk priority number, the higher a failure mechanism is ranked. That number is the product of the probability of detection, occurrence, and severity of each mechanism. Detection describes the probability of detecting the failure modes associated with the failure mechanism. Severity describes the seriousness of the effect of the failure caused by a mechanism. Additional insights into the criticality of a failure mechanism can be obtained by examining past repair and maintenance actions, the reliability capabilities of suppliers, and results observed in the initial development tests.
Assessment of the reliability potential of a system design is the determination of the reliability of a system consistent with good practice and conditional on a use profile. The reliability potential is estimated through use of various forms of simulation and component-level testing, which include integrity tests, virtual qualification, and reliability testing.
Integrity is a measure of the appropriateness of the tests conducted by the manufacturer and of the part’s ability to survive those tests. Integrity test data (often available from the part manufacturer) are examined in light of the life-cycle conditions and applicable failure mechanisms and models. If the magnitude and duration of the life-cycle conditions are less severe than those of the integrity tests, and if the test sample size and results are acceptable, then the part reliability is acceptable. If the integrity test data are insufficient to validate part reliability in the application, then virtual qualification should be considered.
Virtual qualification can be used to accelerate the qualification process of a part for its life-cycle environment. Virtual qualification uses computer-aided simulation to identify and rank the dominant failure mechanisms associated with a part under life-cycle loads, determine the acceleration factor for a given set of accelerated test parameters, and determine the expected time to failure for the identified failure mechanisms (for an example, see George et al., 2009).
Each failure model is made up of a stress analysis model and a damage assessment model. The output is a ranking of different failure mechanisms, based on the time to failure. A stress model captures the product architecture, while a damage model depends on a material’s response to the applied stress. Virtual qualification can be used to optimize the product design in such a way that the minimum time to failure of any part of the product is greater than its desired life. Although the data obtained from virtual qualification cannot fully replace the data obtained from physical tests, they can increase the efficiency of physical tests by indicating the potential failure modes and mechanisms that can be expected.
Ideally, a virtual qualification process will identify quality suppliers and quality parts through use of physics-of-failure modeling and a risk assessment and mitigation program. The process allows qualification to be incorporated into the design phase of product development, because it
allows design, manufacturing, and testing to be conducted promptly and cost-effectively.
The effects of manufacturing variability can be assessed by simulation as part of the virtual qualification process. But it is important to remember that the accuracy of the results using virtual qualification depends on the accuracy of the inputs to the process, that is, the system geometry and material properties, the life-cycle loads, the failure models used, the analysis domain, and the degree of discreteness used in the models (both spatial and temporal). Hence, to obtain a reliable prediction, the variability in the inputs needs to be specified using distribution functions, and the validity of the failure models needs to be tested by conducting accelerated tests (see Chapter 6 for discussion).
Reliability testing can be used to determine the limits of a system, to examine systems for design flaws, and to demonstrate system reliability. The tests may be conducted according to industry standards or to required customer specifications. Reliability testing procedures may be general, or the tests may be specifically designed for a given system.
The information required for designing system-specific reliability tests includes the anticipated life-cycle conditions, the reliability goals for the system, and the failure modes and mechanisms identified during reliability analysis. The different types of reliability tests that can be conducted include tests for design marginality, determination of destruct limits, design verification testing before mass production, on-going reliability testing, and accelerated testing (for examples, see Keimasi et al., 2006; Mathew et al., 2007; Osterman 2011; Alam et al., 2012; and Menon et al., 2013).
Many testing environments may need to be considered, including high temperature, low temperature, temperature cycle and thermal shock, humidity, mechanical shock, variable frequency vibration, atmospheric contaminants, electromagnetic radiation, nuclear/cosmic radiation, sand and dust, and low pressure:
- High temperature: High-temperature tests assess failure mechanisms that are thermally activated. In electromechanical and mechanical systems, high temperatures may soften insulation, jam moving parts because of thermal expansion, blister finishes, oxidize materials, reduce viscosity of fluids, evaporate lubricants, and cause structural overloads due to physical expansions. In electrical systems, high temperatures can cause variations in resistance, inductance, capacitance, power factor, and dielectric constant.
- Low temperature: In mechanical and electromechanical systems, low temperatures can cause plastics and rubber to lose flexibility and become brittle, cause ice to form, increase viscosity of lubricants and gels, and cause structural damage due to physical contraction. In electrical systems, low-temperature tests are performed primarily to accelerate threshold shifts and parametric changes due to variation in electrical material parameters.
- Temperature cycle and thermal shock: Temperature cycle and thermal shock testing are most often used to assess the effects of thermal expansion mismatch among the different elements within a system, which can result in materials’ overstressing and cracking, crazing, and delamination.
- Humidity: Excessive loss of humidity can cause leakage paths between electrical conductors, oxidation, corrosion, and swelling in materials such as gaskets and granulation.
- Mechanical shock: Some systems must be able to withstand a sudden change in mechanical stress typically due to abrupt changes in motion from handling, transportation, or actual use. Mechanical shock can lead to overstressing of mechanical structures causing weakening, collapse, or mechanical malfunction.
- Variable frequency vibration: Some systems must be able to withstand deterioration due to vibration. Vibration may lead to the deterioration of mechanical strength from fatigue or overstress; may cause electrical signals to be erroneously modulated; and may cause materials and structure to crack, be displaced, or be shaken loose from mounts.
- Atmospheric contaminants: The atmosphere contains such contaminants as airborne acids and salts that can lower electrical and insulation resistance, oxidize materials, and accelerate corrosion. Mixed flowing gas tests are often used to assess the reliability of parts that will be subjected to these environments.
- Electromagnetic radiation: Electromagnetic radiation can cause spurious and erroneous signals from electronic components and circuitry. In some cases, it may cause complete disruption of normal electrical equipment such as communication and measuring systems.
- Nuclear/cosmic radiation: Nuclear/cosmic radiation can cause heating and thermal aging; alter the chemical, physical, and electrical properties of materials; produce gasses and secondary radiation; oxidize and discolor surfaces; and damage electronic components and circuits.
- Sand and dust: Sand and dust can scratch and abrade finished sur-
- faces; increase friction between surfaces, contaminate lubricants, clog orifices, and wear materials.
- Low pressure: Low pressure can cause overstress of structures such as containers and tanks that can explode or fracture; cause seals to leak; cause air bubbles in materials, which may explode; lead to internal heating due to lack of cooling medium; cause arcing breakdowns in insulations; lead to the formation of ozone; and make outgassing more likely.
Reliability test data analysis can be used to provide a basis for design changes prior to mass production, to help select appropriate failure models and estimate model parameters, and for modification of reliability predictions for a product. Test data can also be used to create guidelines for manufacturing tests including screens, and to create test requirements for materials, parts, and sub-assemblies obtained from suppliers.
We stress that the still-used handbook MIL-HDBK-217 (U.S. Department of Defense, 1991) does not provide adequate design guidance and information regarding microelectronic failure mechanisms. In many cases, MIL-HDBK-217 methods would not be able to distinguish between separate failure mechanisms. It is in clear contrast with physics-of-failure estimation: “an approach to design, reliability assessment, testing, screening and evaluating stress margins by employing knowledge of root-cause failure processes to prevent product failures through robust design and manufacturing practices” (Cushing et al., 1993, p. 542). A detailed critique of MIL-HDBK-217 is provided in Appendix D.
Failure tracking activities are used to collect test- and field-failed components and related failure information. Failures have to be analyzed to identify the root causes of manufacturing defects and to test or field failures. The information collected needs to include the failure point (quality testing, reliability testing, or field), the failure site, and the failure mode and mechanism. For each product category, a Pareto chart of failure causes can be created and continually updated.
The outputs for this key practice are a failure summary report arranged in groups of similar functional failures, actual times to failure of components based on time of specific part returns, and a documented summary of corrective actions implemented and their effectiveness. All the lessons learned from failure analysis reports can be included in a corrective actions database for future reference. Such a database can help save considerable funds in fault isolation and rework associated with future problems.
A classification system of failures, failure symptoms, and apparent causes can be a significant aid in the documentation of failures and their root causes and can help identify suitable preventive methods. By having such a classification system, it may be easier for engineers to identify and share information on vulnerable areas in the design, manufacture, assembly, storage, transportation, and operation of the system. Broad failure classifications include system damage or failure, loss in operating performance, loss in economic performance, and reduction in safety. Failures categorized as system damage can be further categorized according to the failure mode and mechanism. Different categories of failures may require different root-cause analysis approaches and tools.
The goal of failure analysis is to identify the root causes of failures. The root cause is the most basic causal factor or factors that, if corrected or removed, will prevent the recurrence of the failure. Failure analysis techniques include nondestructive and destructive techniques. Nondestructive techniques include visual observation and observations under optical microscope, x-ray, and acoustic microscopy. Destructive techniques include cross-sectioning of samples and de-capsulation. Failure analysis is used to identify the locations at which failures occur and the fundamental mechanisms by which they occurred. Failure analysis will be successful if it is approached systematically, starting with nondestructive examinations of the failed test samples and then moving on to more advanced destructive examinations; see Azarian et al. (2006) for an example.
Product reliability can be ensured by using a closed-loop process that provides feedback to design and manufacturing in each stage of the product life cycle, including after the product is shipped and fielded. Data obtained from maintenance, inspection, testing, and usage monitoring can be used to perform timely maintenance for sustaining the product and for preventing failures.
An important tool in failure analysis is known as FRACAS or failure reporting, analysis and corrective action system. According to the Reliability Analysis Center:
A failure reporting, analysis and corrective action system (FRACAS) is defined, and should be implemented, as a closed-loop process for identifying and tracking root failure causes, and subsequently determining, implementing and verifying an effective corrective action to eliminate their reoccurrence. The FRACAS accumulates failure, analysis and corrective action information to assess progress in eliminating hardware, software and process-related failure modes and mechanisms. It should contain information and data to the level of detail necessary to identify design or process deficiencies that should be eliminated.
It is important for FRACAS to be applied throughout developmental and operational testing and post-deployment.
Reliability predictions are an important part of product design. They are used for a number of different purposes: (1) contractual agreements, (2) feasibility evaluations, (3) comparisons of alternative designs, (4) identification of potential reliability problems, (5) maintenance and logistics support planning, and (6) cost analyses. As a consequence, erroneous reliability predictions can result in serious problems during development and after a system is fielded. An overly optimistic prediction, estimating too few failures, can result in selection of the wrong design, budgeting for too few spare parts, expensive rework, and poor field performance. An overly pessimistic prediction can result in unnecessary additional design and test expenses to resolve the perceived low reliability. This section discusses two explicit models and similarity analyses for developing reliability predictions.
Two Explicit Models
Fault trees and reliability block diagrams are two methods for developing assessments of system reliabilities from those of component reliabilities: see Box 5-1.2 Although they can be time-consuming and complex (depending on the level of detail applied), they can accommodate model dependencies. Nonconstant failure rates can be handled by assessing the probability of failure at different times using the probability of failure for each component at each time, rather than using the component’s mean time between failure. Thus, components can be modeled to have decreasing, constant, or increasing failure rates. These methods can also accommodate time-phased missions. Unfortunately, there may be so many ways to fail a system that an explicit model (one which identifies all the failure possibilities) can be intractable. Solving these models using the complete enumeration method is discussed in many standard reliability text books (see, e.g., Meeker and Escobar (1998); also see Guide for Selecting and Using Reliability Predictions of the IEEE Standards Association [IEEE 1413.1]).
2 For additional design-for-reliability tools that have proven useful in DoD acquisition, see Section 2.1.4 of the TechAmerica Reliability Program Handbook, TA-HB-0009, available: http://www.techstreet.com/products/1855520 [August 2014].
Two Common Techniques for Design for Reliability
Reliability Block Diagrams. Reliability block diagrams model the functioning of a complex system through use of a series of “blocks,” in which each block represents the working of a system component or subsystem. Reliability block diagrams allow one to aggregate from component reliabilities to system reliability. A reliability block diagram can be used to optimize the allocation of reliability to system components by considering the possible improvement of reliability and the associated costs due to various design modifications. It is typical for very complex systems to initiate such diagrams at a relatively high level, providing more detail for subsystems and components as needed.
Fault Tree Analysis. Fault tree analysis is a systematic method for defining and analyzing system failures as a function of the failures of various combinations of components and subsystems. The basic elements of a fault tree diagram are events that correspond to improper functioning of components and subcomponents, and gates that represent and/or conditions. As is the case for reliability block diagrams, fault trees are initially built at a relatively coarse level and then expanded as needed to provide greater detail. The construction concludes with the assignment of reliabilities to the functioning of the components and subcomponents. At the design stage, these reliabilities can either come from the reliabilities of similar components for related systems, from supplier data, or from expert judgment. Once these detailed reliabilities are generated, the fault tree diagram provides a method for assessing the probabilities that higher aggregates fail, which in turn can be used to assess failure probabilities for the full system. Fault trees can clarify the dependence of a design on a given component, thereby prioritizing the need for added redundancy or some other design modification of various components, if system reliability is deficient. Fault trees can also assist with root-cause analyses.
The two methods discussed above are “bottom-up” predictions. They use failure data at the component level to assign rates or probabilities of failure. Once the components and external events are understood, a system model is developed. An alternative method is to use a “top-down” approach using similarity analysis. Such an analysis compares two designs: a recent vintage product with proven reliability and a new design with unknown reliability. If the two products are very similar, then the new design is believed to have reliability similar to the predecessor design. Sources of reliability and failure data include supplier data, internal manufacturing test results from various phases of production, and field failure data.
There has been some research on similarity analyses, describing either
the full process or specific aspects of this technique (see, e.g., Foucher et al., 2002). Similarity analyses have been reported to have a high degree of accuracy in commercial avionics (see Boydston and Lewis, 2009). Because this is a relatively new technique for prediction, however, there is no universally accepted procedure.
The main idea in this approach is that all the analysts agree to draw as much relevant information as possible from tests and field data. As the “new” product is produced and used in the field, these data are used to update the prediction for future production of the same product (for details, see Pecht, 2009). However, changes between the older and newer product do occur, and can involve
- product function and complexity
- technology upgrades
- engineering design process
- design tools and rules
- engineering team
- de-rating concepts
- assembly suppliers
- manufacturing processes
- manufacturing tooling
- assembly personnel
- test equipment and processes
- management policies
- quality and training programs, and
- application and use environment.
In this process, every aspect of the product design, the design process, the manufacturing process, corporate management philosophy, and quality processes and environment can be a basis for comparison of differences. As the extent and degree of difference increases, the reliability differences will also increase. Details on performing similarity analyses can be found in the Guide for Selecting and Using Reliability Predictions of the IEEE Standards Association (IEEE 1413.1).
Redundancy exists when one or more of the parts of a system can fail and the system can still function with the parts that remain operational. Two common types of redundancy are active and standby.
In active redundancy, all of a system’s parts are energized during the
operation of a system. In active redundancy, the parts will consume life at the same rate as the individual components. An active redundant system is a standard “parallel” system, which only fails when all components have failed.
In standby redundancy, some parts are not energized during the operation of the system; they get switched on only when there are failures in the active parts. In a system with standby redundancy, ideally the parts will last longer than the parts in a system with active redundancy. A standby system consists of an active unit or subsystem and one or more inactive units, which become active in the event of a failure of the functioning unit. The failures of active units are signaled by a sensing subsystem, and the standby unit is brought to action by a switching subsystem.
There are three conceptual types of standby redundancy: cold, warm, and hot. In cold standby, the secondary part(s) is completely shut down until needed. This type of redundancy lowers the number of hours that the part is active and does not consume any useful life, but the transient stresses on the part(s) during switching may be high. This transient stress can cause faster consumption of life during switching. In warm standby, the secondary part(s) is usually active but is idling or unloaded. In hot standby, the secondary part(s) forms an active parallel system. The life of the hot standby part(s) is consumed at the same rate as active parts. Redundancy can often be addressed at various levels of the system architecture.
“Risk” is defined as a measure of the priority assessed for the occurrence of an unfavorable event. General methodologies for risk assessment (both quantitative and qualitative) have been developed and are widely available. The process for assessing the risks associated with accepting a part for use in a specific application involves a multistep process:
- Start with a risk pool, which is the list of all known risks, along with knowledge of how those risks are quantified (if applicable) and possibly mitigated.
- Determine an application-specific risk catalog: Using the specific application’s properties, select risks from the risk pool to form an application-specific risk catalog. The application properties most likely to be used to create the risk catalog include functionality, life-cycle environments (e.g., manufacturing, shipping and handling, storage, operation, and possibly end-of-life), manufacturing characteristics (e.g., schedule, quantity, location, and suppliers), sustainment plans and requirements, and operational life requirements.
- Characterize the risk catalog: Generate application-specific details about the likelihood of occurrence, consequences of occurrence, and acceptable mitigation approaches for each of the risks in the risk catalog.
- Classify risks: Classify each risk in the risk catalog in one of two categories: functionality risks and producibility risks. Functionality risks impair the system’s ability to operate to the customer’s specification. They are risks for which the consequences of occurrence are loss of equipment, mission, or life. Producibility risks are risks for which the consequences of occurrence are financial (reduction in profitability). Producibility risks determine the probability of successfully manufacturing the product, which in turn refers to meeting some combination of economics, schedule, manufacturing yield, and quantity targets.
- Determine risk-mitigating factors: Factors may exist that modify the applicable mitigation approach for a particular part, product, or system. These factors include the type or technology of the part under consideration, the quantity and type of manufacturer’s data available for the part, the quality and reliability monitors employed by the part manufacturer, and the comprehensiveness of production screening at the assembly level.
- Rank and down-select: Not all functionality risks require mitigation. If the likelihood or consequences of occurrence are low, then the risk may not need to be addressed. The ranking may be performed using a scoring algorithm that couples likelihood and consequence into a single dimensionless quantity that allows diverse risks to be compared. Once the risks are ranked, those that fall below some threshold in the rankings can be omitted.
- Determine the verification approach: For the risks that are ranked above the threshold determined in the previous activity, consider the mitigation approaches defined in the risk catalog. The acceptable combination of mitigation approaches becomes the required verification approach.
- Determine the impact of unmanaged risk: Combine the likelihood of risk occurrence with the consequences of occurrence to predict the resources associated with risks that the product development team chooses not to manage proactively. (This assumes that all unmanaged risks are producer risks.)
- Determine the resources required to manage the risk: Create a management plan and estimate the resources needed to perform a prescribed regimen of monitoring the part’s field performance, the vendor, and assembly/manufacturability as applicable.
- Determine the risk impact: Assess the impact of functionality risks by estimating the resources necessary to develop and perform the worst-case verification activity allocated over the entire product life-cycle (production and sustainment). The value of the product that may be scrapped during the verification testing should be included in the impact. For managed producibility risks, the resources required are used to estimate the impact. For unmanaged producibility risks, the resources predicted in the impact analysis are translated into costs.
- Decide whether the risk is acceptable: If the impact fits within the overall product’s risk threshold and budget, then the part selection can be made with the chosen verification activity (if any). Otherwise, design changes or alternative parts must be considered.
A product’s health is the extent of degradation or deviation from its “normal” operating state. Health monitoring is the method of measuring and recording a product’s health in its life-cycle environment. Prognostics is the prediction of the future state of health of a system on the basis of current and historical health conditions as well as historical operating and environmental conditions.
Prognostics and health management consists of technologies and methods to assess the reliability of a system in its actual life-cycle conditions to determine the likelihood of failure and to mitigate system risk: for examples and further details, see Jaai and Pecht (2010) and Cheng et al. (2010a, 2010b). The application areas of this approach include civil and mechanical structures, machine-tools, vehicles, space applications, electronics, computers, and even human health.
Prognostics and health management techniques combine sensing, recording, and interpretation of environmental, operational, and performance-related parameters to indicate a system’s health. Sensing, feature extraction, diagnostics, and prognostics are key elements. The data to be collected to monitor a system’s health are used to determine the sensor type and location in a monitored system, as well as the methods of collecting and storing the measurements. Feature extraction is used to analyze the measurements and extract the health indicators that characterize the system degradation trend. With a good feature, one can determine whether the system is deviating from its nominal condition: for examples, see Kumar et al. (2012) and Sotiris et al. (2010). Diagnostics are used to isolate and identify the failing subsystems/components in a system, and prognostics carry out the estimation of remaining useful life of the systems, subsystems,
or components: for examples of diagnostics and prognostics, see Vasan et al. (2012) and Sun et al. (2012).
The prognostics and health management process does not predict reliability but rather provides a reliability assessment based on in-situ monitoring of certain environmental or performance parameters. This process combines the strengths of the physics-of-failure approach with live monitoring of the environment and operational loading conditions.