Producing a reliable defense system is predicated on using proper engineering techniques throughout system development, beginning before program initiation, through delivery of prototypes, to fielding of the system. To start, one must develop requirements that are both technically achievable and measurable and testable. In addition, they need to be cost-effective when considering life-cycle costs.
Once reasonable requirements have been determined, the development of reliable defense systems depends on having an adequate budget and time to build reliability in at the design stage and then to refine the design through testing that is focused on reliability. We make several recommendations geared toward ensuring the allocation of sufficient design and test resources in support of the development of reliable defense systems. We also offer recommendations on information sharing and other factors related to the oversight of the work of contractors and subcontractors, the acceptance of prototypes from contractors, developmental testing, reliability growth modeling, and the collection and analysis of data throughout system development.
The panel’s analysis and recommendations to the U.S. Department of Defense (DoD) cover the many steps and aspects of the acquisition process, presented in this chapter in roughly chronological order: analysis of alternatives; request for proposals; an outline reliability demonstration plan; raising the priority of reliability; design for reliability; reliability growth testing; design changes; information on operational environments; acquisition contracts; delivery of prototypes for developmental testing; developmental testing; and intermediate reliability goals.
We note that in several of our recommendations the panel designates the Under Secretary of Defense for Acquisition, Technology, and Logistics (USD AT&L) as the implementing agent. This designation reflects the panel’s understanding of DoD’s acquisition process and regulations and the flow of authority from USD AT&L, first through the Assistant Secretary of Defense for Networks and Information Integration (ASD NII) and the Director of Operational Test and Evaluation (DOT&E) and then through the component acquisition authorities (of each service) and program executive officers to program managers.1
We also note that some of our recommendations are partly redundant with existing acquisition procedures and regulations: our goal in including them is to emphasize their importance and to encourage more conscientious implementation.
The defense acquisition process begins when DoD identifies an existing military need that requires a materiel solution. The result can be a request either for the development of a new defense system or the modification of an existing one. Different suggestions for addressing the need are compared in an “analysis of alternatives.” This document contains the missions that a proposed system is intended to carry out and the conditions under which the proposed system would operate. Currently, the analysis of alternatives does not necessarily include the possible effects of system reliability on life-cycle costs (although many such analyses do). Clearly, those costs do need to be considered in the decision on whether to proceed.
After there is a decision to proceed, reliability requirements are first introduced and justified in the RAM-C (reliability, availability, maintainability, and cost) document, which lays out the reliability requirements for the intended system and contains the beginnings of a reliability model justifying that the reliability requirement is technically feasible.
If it is decided to develop a new defense system, possible contractors from industry are solicited using a request for proposal (RFP), which is based on both the analysis of alternatives and the RAM-C document. RFPs describe the system’s capabilities so that potential bidders fully understand what is requested. RFPs specify the intended missions the system needs to successfully undertake, the conditions under which the system will operate
1DoD 5000.02 states, Program Managers for all programs shall formulate a viable Reliability, Availability and Maintainability strategy that includes a reliability growth program. Our recommendations, if implemented, will expand on this existing requirement and effect the work and authority of program managers and test authorities, but regulatory change is the responsibility of USD (AT&L), together with ASD (NII) and DOT&E.
and be maintained during its lifetime, the requirements that the system needs to satisfy, and what constitutes a system failure. An RFP also contains, from the RAM-C document, the beginning of a reliability model so that the contractor can understand how DoD can assert that the reliability requirement is achievable.
RFPs generate proposals, and these need to be evaluated, among other criteria, to assess whether the contractor is likely to produce a reliable system. Therefore, proposals need to be explicit about the design tools and testing, including a proposed testing schedule that will be funded in support of the production of a reliable system. When DoD selects a winning proposal, an acquisition contract is negotiated. This contract is critical to the entire process. Such contracts provide for the level of effort devoted to reliability growth, and the degree of interaction between the contractor and DoD, including the sharing of test and other information that inform DoD as to what the system in development is and is not capable of doing.
In making our recommendations, we consider first the analysis of alternatives. As noted above, there is currently no obligation in the analysis of alternatives to consider the impact of the reliability of the proposed system on mission success and life-cycle costs. Because such considerations could affect the decision as to whether to go forward with a new acquisition program, they should be required in every analysis of alternatives.
RECOMMENDATION 1 The Under Secretary of Defense for Acquisition, Technology, and Logistics should ensure that all analyses of alternatives include an assessment of the relationships between system reliability and mission success and between system reliability and life-cycle costs.
The next stage in the acquisition process is the setting of reliability requirements. Although these requirements should not necessarily be shared at the RFP stage, they are needed internally—even prior to the issuance of an RFP—to begin the process of justification and assessment of feasibility. The RAM-C report should justify the reliability requirements by showing that they are either necessary in order to have a high probability of successfully carrying out the intended missions or by showing that they are necessary to limit life-cycle costs.
In addition, the RAM-C report should include an estimate of the acquisition costs and an assessment of their uncertainty, which should include as a component the estimated life-cycle costs and an assessment of their uncertainty, with life-cycle costs expressed as a function of system reliability. (It is understood that life-cycle costs are a function of many other system characteristics than its reliability.) In addition, the RAM-C report should provide support for the assertion that the reliability requirements are technically feasible, measurable, and testable. (A requirement is measurable
if there is a metric that underlies the requirement that is objectively determined, and it is testable if there is a test that can objectively discriminate between systems that have and have not achieved the requirement.)
DoDI 5000.02 requires
[a] preliminary Reliability, Availability, Maintainability and Cost Rationale (RAM-C) Report in support of the Milestone A decision. This report provides a quantitative basis for reliability requirements, and improves cost estimates and program planning. This report will be attached to the SEP at Milestone A, and updated in support of the Development RFP Release Decision Point, Milestone B, and Milestone C…. [The RAM-C report] documents the rationale behind the development of the sustainment requirements along with underlying assumptions. Understanding these assumptions and their drivers will help warfighters, combat developers, and program managers understand the basis for decisions made early in a program. When the requirements and underlying assumptions are not clearly documented, the project may be doomed to suffer from subsequent decisions based on incorrect assumptions.
We are aware of reliability requirements for proposed new defense systems that have been technically infeasible or that have not reflected a cost-efficient approach to acquisition. Furthermore, reliability requirements have at times not been measurable or testable. To address these deficiencies, DoD should be obligated to include technical justifications in the RAM-C document that support these assertions in a manner that most experts in the field would find persuasive. Given that estimates of life-cycle costs require considerable technical expertise to develop, it is important to ensure that such assessments are made by appropriate experts in reliability engineering. Furthermore, the assessment as to whether requirements are achievable, measurable, and testable also requires considerable expertise with respect to the proposed system. To ensure that the required report about the reliability requirements reflects input from people with the necessary expertise, DoD should require that an external panel examines the arguments behind such assertions prior to the issuance of an RFP. That assessment of reliability requirements should be delivered to the Joint Requirements Oversight Council (JROC) or, as appropriate, its Component analog. This assessment should also contain an evaluation of the feasibility of acquiring the system within the specified cost and time schedule. The JROC, based on this technical report and the external assessment of it, should be the deciding authority on whether or not DoD proceeds to issue an RFP for the system.2
2In forming these expert committees, it is important that the relevant requirements officer is either a member or is asked to present any relevant work on the development of the reliability requirement.
RECOMMENDATION 2 Prior to issuing a request for proposals, the Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics should issue a technical report on the reliability requirements and their associated justification. This report should include the estimated relationship between system reliability and total acquisition and life-cycle costs and the technical justification that the reliability requirements for the proposed new system are feasible, measurable, and testable. Prior to being issued, this document should be reviewed by a panel with expertise in reliability engineering, with members from the user community, from the testing community, and from outside of the service assigned to the acquisition. We recognize that before any development has taken place these assessments are somewhat guesswork and it is the expectation that as more about the system is determined, the assessments can be improved. Reliability engineers of the services involved in each particular acquisition should have full access to the technical report and should be consulted prior to the finalization of the RFP.
As argued above, requests for proposals should include reliability requirements and their justification—highlighting the reliability goals for specific subsystems that are believed to be keys to system reliability—by demonstrating that they are necessary either to have a high probability of successfully carrying out the intended missions or by showing that a reliability level is necessary to limit life-cycle costs.3 We acknowledge here, too, that prior to any system development, the assessment of feasibility and the linking of the level of system reliability to life-cycle costs is at best informed guesswork. But, absent assessments of feasibility, requirements could be optimistic dreams. And absent linking the requirements to reliability-driven life-cycle costs, the decision could be made to, say, modestly reduce the cost of the system through what would be perceived as a modest reduction in system reliability, and as a result producing a system that is substantially more expensive to field due to the increased life-cycle costs.
On the basis of the RAM-C document, the RFP should include a rough estimate of the acquisition costs and an assessment of their uncertainty, which should include as a component the estimated life-cycle costs and an assessment of their uncertainty, with life-cycle costs expressed as a function of system reliability. The RFP needs to provide support for the assertion
3Sometimes, system requirements are initially expressed optimistically to generate early support for the system. This is clearly counterproductive for many reasons, and the panel’s recommendation to provide technical justification in the RFP may help to eliminate this practice.
that the reliability requirements are technically feasible by reporting estimated levels for specific subsystems thought to contribute substantially to system reliability, that have either appeared in fielded systems or for which estimates are otherwise available, and the assertion that the reliability requirements are measurable and testable.
Clearly, analyses of feasibility and estimates of life-cycle costs as a function of system reliability are likely to be revised as development proceeds by the contractor. But including initial analyses and assessments of these quantities in the RFP will help to demonstrate the high priority given to such considerations. As the system design matures, such analyses and assessments will improve. As a start to this improvement, part of the proposal that is produced in response to the RFP from the contractor should be the contractor’s review of the government’s initial reliability assertions and the degree to which they are in accordance or they differ (with rationale)—and the consequence of such on the contractor’s proposal for designing, building, and testing the system.
In situations in which new technology is involved, DoD may instead issue a request for information to engage knowledgeable people from industry in the process of preparing the report on requirements. If new or developing technology may be needed in the system, the process of evolutionary acquisition needs to be considered.4 In this case, necessary, achievable, measurable, and testable reliability requirements for the system during each acquisition spiral need to be specified and justified.
Even when assessments of technical feasibility are made with due diligence, it may be that during development the reliability requirements turn out to be technically infeasible. This possibility can become clear as a result of the collection of new information about the reliability of components and other aspects of the system design through early testing. Similarly, an argument about whether to increase system reliability beyond what is necessary for mission success in order to reduce life-cycle costs could need reconsideration as a result of the refinement of estimates of the costs of repair and replacement parts.
If the requirement for reliability turns out not to be technically feasible, it could have broad implications for the intended missions, life-cycle costs, and other aspects of the system. Therefore, when a request is made by the contractor for a modification of reliability requirements, there is again a need for a careful review and issuance of an abbreviated version of the analysis of alternatives and the above report on reliability requirements, with input from the appropriate experts. In addition to updating the
4For a description of this process, see DoD Instruction 5000.02, Operation of the Defense Acquisition System, available at http://www.dtic.mil/whs/directives/corres/pdf/500002_interim.pdf [December 2013].
analysis of alternatives, if necessary, the RAM-C and associated logistics documents would need to be updated to identify and show the impacts of the reliability changes.
RECOMMENDATION 3 Any proposed changes to reliability requirements by a program should be approved at levels no lower than that of the service component acquisition authority. Such approval should consider the impact of any reliability changes on the probability of successful mission completion as well as on life-cycle costs.
It is not uncommon for the DoD requirements generation process to establish one or more reliability requirements that differ from the reliability requirements agreed to in the acquisition contract. This can be due to the difference between mean time between failures in a laboratory setting and mean time between failures in an operational setting, or it can be due to negotiations between DoD and the contractor. In the first instance, these differences are due to the specifics of the testing strategy. To address this, we suggest that DoD archive the history of the development of the initial reliability requirement in the RFP and how that initial requirement evolved throughout development and even in initial fielding and subsequent use.
Knowing the design of the tests that will be used to evaluate a system in development is an enormous help to developers in understanding the missions and the stresses the system is expected to face. Given the importance of conveying such information as early as possible to developers, RFPs should provide an early overview of what will later be provided in much greater detail in an outline reliability demonstration or development plan. With respect to reliability, a test and evaluation master plan (TEMP) provides the types and numbers of prototypes, hours, and other characteristics of various test events and schedules that will take place during government testing. And with respect to reliability assessment, a TEMP provides information on any acceleration used in tests, the associated evaluations resulting from tests, and, overall, how system reliability will be tracked in a statistically defendable way. So a TEMP provides a description of the various developmental and operational tests that will be used to identify flaws in system design and those tests that will be used to evaluate system performance. A TEMP also describes system failure and specifies how reliability is scored at test events and at design reviews.
It would be premature to lay out a TEMP in the RFP for a proposed new defense acquisition. However, having some idea as to the testing that
is expected to be done to support reliability growth and to assess reliability performance would be extremely useful in making decisions on system design. We therefore call on DoD to produce a new document, which we call an outline reliability demonstration plan, to be included in the RFP and serve as the overview of the TEMP for reliability, providing as much information as is available concerning how DoD plans to evaluate system performance—for present purposes, the evaluation of reliability growth. The outline should specify the extent of the tests (e.g., total hours, number of replications), the test conditions, and the metrics used. The outline should also include the pattern of reliability growth anticipated at various stages of development.5
Preliminary reliability levels that can serve as intermediate targets would be available early on since there is some empirical evidence as to the degree of reliability growth that can be expected to result from a test-analyze-fix-test period of a certain length for various kinds of systems (see Chapter 6). An outline reliability demonstration plan should also indicate how such comparisons will be used as input to decisions on the promotion of systems to subsequent stages of development—a specified threshold needs to include a buffer to reflect the sample size of such tests in order to keep the producer’s risk low.
As with the technical report on reliability requirements recommended above and because an outline reliability demonstration plan also has substantial technical content, it should be reviewed by an expert panel prior to its inclusion in an RFP. This expert panel should include reliability engineers and system users, members from the testing community, and members from outside of the service responsible for system acquisition. This expert panel should deliver a report reviewing the adequacy of the outline reliability demonstration plan, and it should include an assessment as to whether or not the system can likely be acquired within the specified cost and time schedule. Based on the technical report on reliability requirements and the outline reliability demonstration plan, the JROC would decide whether or not DoD will proceed to issue an RFP for the system.
RFPs currently contain a systems engineering plan, which lays out the methods by which all system requirements having technical content, technical staffing, and technical management are to be implemented on a program, addressing the government and all contractor technical efforts. Therefore the systems engineering plan is a natural location for this additional material on reliability test and evaluation that we argue for inclusion in RFPs.
5DoD may also wish to include in an outline reliability demonstration plan the early plans for the overall evaluation of system performance.
RECOMMENDATION 4 Prior to issuing a request for proposal, the Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate the preparation of an outline reliability demonstration plan that covers how the department will test a system to support and evaluate system reliability growth. The description of these tests should include the technical basis that will be used to determine the number of replications and associated test conditions and how failures are defined. The outline reliability demonstration plan should also provide the technical basis for how test and evaluation will track in a statistically defendable way the current reliability of a system in development given the likely number of government test events as part of developmental and operational testing. Prior to being included in the request for proposal for an acquisition program, the outline reliability demonstration plan should be reviewed by an expert external panel. Reliability engineers of the services involved in the acquisition in question should also have full access to the reliability demonstration plan and should be consulted prior to its finalization.
RFPs are currently based on a statement of work that contains reliability specifications for the developer and obligations for DoD. We note that the Army Materiel Systems Analysis Activity (AMSAA) has issued documents concerning language for reliability specification and contractual language for hardware and software systems that can be used as a guide for implementing the above panel recommendation.
A key element in improving the reliability of DoD’s systems is recognizing the importance of reliability early and throughout the acquisition process. This point was emphasized in an earlier report of the Defense Science Board (2008). Many of our recommendations are consistent with the recommendations in that report (see Chapter 1). To emphasize the importance of these issues, we offer a recommendation on the need to increase the priority of reliability in the acquisition process.
At present, availability is the mandatory suitability key performance parameter, and reliability is a subordinate key system attribute. There is some evidence to suggest that when reliability falls short of its requirement, some defense acquisition personnel consider it a problem that can be addressed with more maintenance or expedited access to spare parts. Furthermore, there seems to be a belief that as long as the availability key performance parameter is met, DOT&E is likely to deem the system to be suitable. Yet DOT&E continues to find systems unsuitable because of poor
reliability (see Chapter 1). This continuing deficiency supports elevation of reliability to key performance parameter status.
RECOMMENDATION 5 The Under Secretary of Defense for Acquisition, Technology, and Logistics should ensure that reliability is a key performance parameter: that is, it should be a mandatory contractual requirement in defense acquisition programs.
As discussed throughout this report, there are two primary ways, in combination, to achieve reliability requirements in a development program: reliability can be “grown” by using a test-analyze-fix-test process, and the initial design can be developed with system reliability as an objective. The Defense Science Board’s report (2008) argued that no amount of testing would compensate for deficiencies in system design due to the failure to give proper priority in the design to attain reliability requirements (see Chapter 1). We support and emphasize this conclusion.
It is important for contractors to describe the reliability management processes that they will use. Those processes should include establishment of an empowered reliability review board or team for tracking reliability from design through deployment and operations, encompassing design changes, observed failure modes, and failure and correction action analyses.
Similarly, the report of the Reliability Improvement Working Group (U.S. Department of Defense, 2008c) contained detailed advice for mandating reliability activities in acquisition contracts (see Appendix C). That report included requirements for a contractor to develop a detailed reliability model for the system that would generate reliability allocations from the system level down to lower levels, and to aggregate system-level reliability estimates based on estimates from components and subsystems. The reliability model would be updated whenever new failure modes are identified, there are updates or revisions to the failure definitions or load estimates are revised, and there are design and manufacturing changes throughout the system’s life cycle. The report further called for the analysis of all failures, either in testing or in the field, until the root-cause failure mechanism is identified. In addition, the report detailed how a contractor should use a system reliability model, in conjunction with expert judgment, for all assessments and decisions about the system.
Consistent both with the report of the Reliability Improvement Working Group and with ANSI/GEIA-STD-0009, we strongly agree that proposals should provide evidence in support of the assertion that the design-for-reliability tools suggested for use and the testing schemes outlined are con-
sistent with meeting the reliability requirement during the time allocated for development. As part of this evidence, contractors should be required to develop, and share with DoD, a system reliability model detailing how system reliability is related to that of the subsystems and components. Proposals should also acknowledge that developers will provide DoD with technical assessments, at multiple times during development, that track whether the reliability growth of the system is consistent with satisfying the reliability requirements for deployment.
We acknowledge that it is a challenge for developers to provide this information and for DoD to evaluate it before any actual development work has started. However, a careful analysis can identify proposed systems and development plans that are or are not likely to meet the reliability requirements without substantial increase in development costs and/or extensive time delays. To develop a reasonable estimate of the initial reliability corresponding to a system design, one would start with the reliability of the components and subsystems that have been used in previous systems and engineering arguments and then combine this information using reliability block diagrams, fault-tree analyses, and physics-of-failure modeling. Then, given this initial reliability, a testing plan, and the improvements that have been demonstrated by recent, related systems during development, one can roughly ascertain whether a reliability goal is feasible.
RECOMMENDATION 6 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that all proposals specify the design-for-reliability techniques that the contractor will use during the design of the system for both hardware and software. The proposal budget should have a line item for the cost of design-for-reliability techniques, the associated application of reliability engineering methods, and schedule adherence.
RECOMMENDATION 7 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that all proposals include an initial plan for system reliability and qualification (including failure definitions and scoring criteria that will be used for contractual verification), as well as a description of their reliability organization and reporting structure. Once a contract is awarded, the plan should be regularly updated, presumably at major design reviews, establishing a living document that contains an up-to-date assessment of what is known by the contractor about hardware and software reliability at the component, subsystem, and system levels. The U.S. Department of Defense should have access to this plan, its updates, and all the associated data and analyses integral to their development.
The reliability plan called for in Recommendation 7 would start with the reliability case made by DoD (see Recommendation 1). Given that the contractor is responding to an RFP with a proposal that contains an initial argument that system reliability is technically feasible, the contractor should be able to provide a more refined model that supports the assertion that the reliability requirement is achievable within the budget and time constraints of the acquisition program. As is the case of the argument provided by DoD, this should include the reliabilities of components and major subsystems along with either a reliability block diagram or a fault-tree diagram to link the estimated subsystem reliabilities to produce an estimate of the full-system reliability.
Determining the reliability of new electronic components is a persistent problem in defense systems. Appendix D provides a critique of MIL-HDBK-217 as a method for predicting the reliability of newly developed electronic components. The basic problem with MIL-HDBK-217 is that it does not identify the root causes, failure modes, and failure mechanisms of electronic components. MIL-HDBK-217 provides predictions based on simple heuristics and regression fits to reliability data for a select number of components, as opposed to engineering design principles and physics-of-failure analyses. The result is a prediction methodology that has the following limitations: (1) the assumption of a constant failure rate is known to be false, since electronic components have instantaneous failure rates that are subject to various kinds of wear-out (due to several different types of stresses and environmental conditions) and are subject to infant mortality; (2) lack of consideration of root causes of failures, failure modes, and failure mechanisms does not allow predictions to take into consideration the load and environment history, materials, and geometries; (3) the approach taken is focused on component-level reliability prediction, therefore failing to account for manufacturing, design, system requirements, and interfaces; (4) the approach is unable to account for environmental and loading conditions in a natural way—instead they are accounted for through the use of various adjustment factors; and (5) the focus on fitting reliability data makes it impossible to provide predictions for the newest technologies and components. These limitations combine in different ways to cause the predictions from MIL-HDBK-217 to fail to accurately predict electronic component reliabilities, as has been shown by a number of careful studies, including on defense systems. Possibly most disturbingly, the use of MIL-HDBK-217 has resulted in poor ranking of the predicted reliabilities of proposed defense systems in development.
To further support this assertion, we quote from the following articles that strongly support the need to eliminate MIL-HDBK-217 in favor of a physics-of-failure approach:
- … it appears that this application of the Arrhenius model was not rigorously derived from physics-of-failure principles. Also, current physics-of-failure research indicates that the relationship between microelectronic device temperatures and failure rate is more complex than previously realized, necessitating explicit design consideration of temperature change, temperature rate of change, and spatial temperature gradients. And, a review of acceleration modeling theory indicates that when modeling the effect that temperature has on microelectronic device reliability, each failure mechanism should be treated separately, which is also at odds with the approach used in MIL-HDBK-217 (Cushing, 1993).
- Traditionally, a substantial amount of military and commercial reliability assessments for electronic equipment have been developed without knowledge of the root-causes of failure and the parameters which appreciably affect them. These assessments have used look-up tables from US MIL-HDBK-217 and its progeny for component failure rates which are largely based on curve fitting of field data. However, these attempts to simplify the process of reliability assessment to the point of ignoring the true mechanisms behind failure in electronics, and their life and acceleration models, have resulted in an approach which provides the design team little guidance, and may in fact harm the end-product in terms of reliability and cost. The oversimplified look-up table approach to reliability requires many invalid assumptions. One of these assumptions is that electronic components exhibit a constant failure rate. For many cases, this constant failure rate assumption can introduce a significant amount of error in decisions made for everything from product design to logistics. The constant failure rate assumption can be most detrimental when failure rates are based on past field data which includes burn-in failures, which are typically due to manufacturing defects, and/or wear out failures, which are attributed to an intrinsic failure rate which is dependent on the physical processes inherent within the component. In order to improve the current reliability assessment process, there needs to be a transition to a science-based approach for determining how component hazard rates vary as a function of time. For many applications, the notion of the constant failure rate should be replaced by a composite instantaneous hazard rate which is based on root-cause failure mechanisms. A significant
- amount of research has been conducted on the root-causes of failure and many failure mechanisms are well understood (Mortin et al., 1995).
- Reliability assessment of electronics has traditionally been based on empirical failure-rate models (e.g., MIL-HDBK-217) developed largely from curve fits of field-failure data. These field-failure data are often limited in terms of the number of failures in a given field environment, and determination of the actual cause of failure. Often, components are attributed incorrectly to be the cause of problems even though 30-70% of them retest OK. In MIL-HDBK-217, crucial failure details were not collected and addressed, e.g., (1) failure site, (2) failure mechanism, (3) load/environment history, (4) materials, and (5) geometries. Two consequences are: (a) MIL-HDBK-217 device failure-rate prediction methodology does not give the designer or manufacturer any insight into, or control over, the actual causes or failure since the cause-and-effect relationships impacting reliability are not captured. Yet, the failure rate obtained is often used as a reverse-engineering tool to meet reliability goals; (b) MIL-HDBK-217 does not address the design and usage parameters that greatly influence reliability, which results in an inability to tailor a MIL-HDBK-217 prediction using these key parameters…. A central feature of the physics-of-failure approach is that reliability modeling used for the detailed design of electronic equipment is based on root-cause failure processes or mechanisms. When reliability modeling is based on failure mechanisms, an understanding of the root-causes of failure in electronic hardware is feasible. This is because failure-mechanism models explicitly address the design parameters which have been found to influence hardware reliability strongly, including material properties, defects, and electrical, chemical, thermal, and mechanical stresses. The goal is to keep the modeling, in a particular application, as simple as feasible without losing the cause-effect relationships that advance useful corrective action. Research into physical failure mechanisms is subjected to scholarly peer review and published in the open literature. The failure mechanisms are validated through experimentation and replication by multiple researchers. Industry is now recognizing that an understanding of potential failure mechanisms leads to eliminating them cost-effectively, and is consequently demanding an approach to reliability modeling and assessment that uses knowledge of failure mechanisms to encourage robust designs and manufacturing practices (Cushing et al., 1993).
The natural focus on the part limit necessarily bounds how much attention can be given to an exhaustive consideration at the subsystems and system level—even for a physics-of-failure approach. Therefore, some judgment has to be exercised as to where to conduct detailed analyses. But if some part/component/system is determined to be “high priority,” then the best available tools for addressing it should be pursued. MIL-HDBK-217 falls short in that regard.
Physics of failure has often been shown to perform better, but it does require the a priori understanding of the failure mechanisms—or the development of such. Also, MIL-HDBK-217 does not provide adequate design guidance and information regarding microelectronic failure mechanisms, and for the most part it does not include the failure rate for software, integration, manufacturing defects, etc.
Because of the limitations of MIL-HDBK-217, we want to emphasize the importance of the modern design-for-reliability techniques, particularly physics-of-failure-based methods, to support system design and reliability estimation. We wish to exclude any version of MIL-HDBK-217 from further use. Further, we realize that there is nothing particular in this regard with electronic components, and therefore, we are recommending that such techniques be used to help design for and assess the reliability of subsystems early in system development for all components in all subsystems. In particular, physics of failure should be utilized to identify potential wear-out failure modes and mitigations for enhancing long-term reliability performance.
RECOMMENDATION 8 Military system developers should use modern design-for reliability (DFR) techniques, particularly physics-of-failure (PoF)-based methods, to support system design and reliability estimation. MIL-HDBK-217 and its progeny have grave deficiencies; rather, DoD should emphasize DFR and PoF implementations when reviewing proposals and reliability program documentation.
We understand that the conversion from MIL-HDBK-217 to a new approach based on physics-of-failure modeling cannot be done overnight and that guidances, training, and specific tools need to be developed to support the change. However, this conversion can be started immediately, because the approach is fully developed in many commercial applications.
If a system is software intensive or if one or more major subsystems are software intensive, then the contractor should be required to provide information on the reasons justifying the selection of the software architecture
and the management plan used in code development (e.g., use of AGILE development) to produce an initial code for testing that is reasonably free of defects. Given the current lack of expertise in software engineering in the defense acquisition community, the architecture, management plan, and other specifications need to be reviewed by an outside expert panel appointed by DoD that includes users, testers, software engineers, and members from outside of the service acquiring the system. This expert panel should also review the software system design and estimates of its reliability and the uncertainty of those estimates. The panel should report to JROC, which should use this information in awarding acquisition contracts.
Software reliability growth through the test-analyze-and-fix approach can be assessed using various metrics, including build success rate, code dependency metrics, code complexity metrics, assessments of code churn and code stability, and code velocity. To assist DoD in monitoring progress toward developing reliable software, a database should be developed by the contractor to provide a constant record of an agreed-upon subset of such metrics. In addition, the contractor should maintain a sharable record of all the categories of failure and how the code was fixed in response to each discovered failure.
RECOMMENDATION 9 For the acquisition of systems and subsystems that are software intensive, the Under Secretary of Defense for Acquisition, Technology, and Logistics should ensure that all proposals specify a management plan for software development and also mandate that, starting early in development and continuing throughout development, the contractor provide the U.S. Department of Defense with full access to the software architecture, the software metrics being tracked, and an archived log of the management of system development, including all failure reports, time of their incidence, and time of their resolution.
Reliability growth models are statistical models that link time on test, and possibly other inputs, to increases in reliability as a system proceeds through development. Because reliability growth models often fail to represent the environment employed during testing, because time on test is often not fully predictive for growth in the reliability of a system in development, and because extrapolation places severe demands on such models, they should be validated prior to use for predicting either the time at which the required reliability will be attained or for predicting the reliability attained at some point in the future. An exception to this is the use of reliability
growth models early in system development, when they can help determine the scope, size, and design of the developmental testing programs.6
RECOMMENDATION 10 The validity of the assumptions underlying the application of reliability growth models should be carefully assessed. In cases where such validity remains in question: (1) important decisions should consider the sensitivity of results to alternative model formulations and (2) reliability growth models should not be used to forecast substantially into the future. An exception to this is early in system development, when reliability growth models, incorporating relevant historical data, can be invoked to help scope the size and design of the developmental testing programs.
When using reliability growth models to scope developmental testing programs, there are no directly relevant data to validate modeling assumptions. Historical reliability growth patterns experienced for similar classes of systems can be reviewed, however. These should permit the viability of the proposed reliability growth trajectory for the subject system to be assessed. They should also support the allocation of adequate budget reserves that may become necessary if the originally envisioned reliability growth plan turns out to be optimistic.
One can certainly argue that one reason that many defense systems fail to achieve their reliability requirements is because there are too many defects that have not been discovered when the system enters operational testing. Given the limitations in discovering reliability defects in both developmental and operational tests, most of the effort to find reliability problems prior to fielding needs to be assumed by the contractor. Although a great deal can be done at the design stage, using design-for-reliability techniques, some additional reliability growth will need to take place through testing and fixing the reliability problems that are discovered, and the majority of the reliability growth through testing has to take place through contractor testing. Consequently, DoD has an interest in monitoring the testing that is budgeted for in acquisition proposals and in monitoring the resulting progress toward the system’s reliability requirements.
Because the contractor has control of the only direct test information
6 Elements of Recommendations 7, 9, and 10, which concern plans for design-for-reliability and reliability testing for both hardware and software systems and subsystems, are sometimes referred to as a “reliability case”: for details, see Jones et al. (2004).
on the reliability of both subsystems and the full system through early development, granting DoD access to such information can help DoD monitor progress on system development and the extent to which a system is or is not likely to satisfy its reliability requirements. In addition, such access can enable DoD to select test designs for developmental and operational testing to verify that design faults have been removed and so that relatively untested components and subsystems are more thoroughly tested.
Thus, it is critical that DoD be provided with information both on reliability test design at the proposal stage to examine whether such plans are sufficient to support the necessary degree of reliability growth, and on reliability test results during development to enable the monitoring of progress toward attainment of requirements. The information on reliability test design should include the experimental designs and the scenario descriptions of such tests, along with the resulting test data, for both the full system and all subsystem assemblies, as well as the code for and results of any modeling and simulation software that were used to assess reliability. The information should cover all types of hardware testing, including testing under operationally relevant conditions, and any use of accelerated or highly accelerated testing.7 The contractor should also provide DoD with information on all types of software testing, including the results of code reviews, automated testing, fault seeding, security testing, and unit test coverage.
In order to ensure that this information is provided to DoD, acquisition contracts will need to be written so that this access is mandated and proposals will need to state that contractors agree to share this information. This information sharing should occur at least at all design reviews throughout system development. This sharing of information should enable DoD to assess system reliability at the time of the delivery of system prototypes, which can help DoD make better decisions about whether to accept delivery of prototypes.
RECOMMENDATION 11 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that all proposals obligate the contractor to specify an initial reliability growth plan and the outline of a testing program to support it, while recognizing that both of these constructs are preliminary and will be modified through development. The required plan will include, at a minimum, information on whether each test is a test of components, of subsystems, or of the full system; the scheduled dates; the test design; the test scenario conditions; and the number of replications in each scenario. If a test is an accelerated
7 For further details on the information that should be provided to DoD, see National Research Council (2004).
test, then the acceleration factors need to be described. The contractor’s budget and master schedules should be required to contain line items for the cost and time of the specified testing program.
RECOMMENDATION 12 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that contractors archive and deliver to the U.S. Department of Defense, including to the relevant operational test agencies, all data from reliability testing and other analyses relevant to reliability (e.g., modeling and simulation) that are conducted. This should be comprehensive and include data from all relevant assessments, including the frequency under which components fail quality tests at any point in the production process, the frequency of defects from screenings, the frequency of defects from functional testing, and failures in which a root-cause analysis was unsuccessful (e.g., the frequency of instances of failure to duplicate, no fault found, retest OK). It should also include all failure reports, times of failure occurrence, and times of failure resolution. The budget for acquisition contracts should include a line item to provide DoD with full access to such data and other analyses.
Similar to the panel’s concerns above about the use of reliability growth models for extrapolation, models used in conjunction with accelerated testing linking extreme use to normal use also use extrapolation and therefore need to be validated for this use. The designs of such tests are potentially complicated and would therefore also benefit from a formal review. Such validation and formal review are particularly important when accelerated testing inference is of more than peripheral importance, for example, if applied at the major subsystem or system level, and there is inadequate corroboration provided by limited system testing and the results are central to decision making on system promotion.
RECOMMENDATION 13 The Office of the Secretary of Defense for Acquisition, Technology, and Logistics, or, when appropriate, the relevant service program executive office, should enlist independent external, expert panels to review (1) proposed designs of developmental test plans critically reliant on accelerated life testing or accelerated degradation testing and (2) the results and interpretations of such testing. Such reviews should be undertaken when accelerated testing inference is of more than peripheral importance—for example, if applied at the major subsystem or system level, there is inadequate corroboration
provided by limited system testing, and the results are central to decision making on system promotion.
Software systems present particular challenges for defense acquisition. Complicated software subsystems and systems are unlikely to be comprehensively tested in government developmental or operational testing because of the current lack of software engineers at DoD. Therefore, such systems should not be accepted for delivery from a contractor until the contractor has provided sufficient information for an assessment of their readiness for use. To provide for some independent testing, the contractor should provide DoD with fully documented software that conducts automated software testing for all of its software-intensive subsystems and for the full system when the full system is a software system. This documentation will enable DoD to test the software for many times the order of magnitude of replications that would otherwise be possible in either developmental or operational testing.
RECOMMENDATION 14 For all software systems and subsystems, the Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that the contractor provide the U.S. Department of Defense (DoD) with access to automated software testing capabilities to enable DoD to conduct its own automated testing of software systems and subsystems.
Changes in design during the development of a system can have significant effects on the system’s reliability. Consequently, developers should be required to include descriptions of the impact of substantial system design changes and how such changes required the modification of plans for design-for-reliability activities and plans for reliability testing. Any changes in fund allocation for such activities should be communicated to DoD. This information will help to support more efficient DoD oversight of plans for design for reliability and reliability testing.
RECOMMENDATION 15 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate the assessment of the impact of any major changes to system design on the existing plans for design-for-reliability activities and plans for reliability testing. Any related proposed changes in fund allocation for such activities should also be provided to the U.S. Department of Defense.
Inadequate communication between the prime contractor and subcontractors can be a source of difficulties in developing a reliable defense system. In particular, subcontractors need to be aware of the stresses and strains, loads, and other sources of degradation that the components they supply will face. Therefore, acquisition contracts need to include the contractor’s plan to ensure the reliability of components and subsystems, especially those that are produced by subcontractors and those that are commercial off-the-shelf systems. For off-the-shelf systems, the risks associated with using a system in an operational environment that differs from its intended environment should be assessed. To do so, the government has to communicate the operational environment to the contractor, and the contractor, in turn, has to communicate that information to any subcontractors.
RECOMMENDATION 16 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that contractors specify to their subcontractors the range of anticipated environmental load conditions that components need to withstand.
RECOMMENDATION 17 The Under Secretary of Defense for Acquisition, Technology, and Logistics should ensure that there is a line item in all acquisition budgets for oversight of subcontractors’ compliance with reliability requirements and that such oversight plans are included in all proposals.
The above recommendations would require contractors to lay out their intended design for reliability and reliability testing activities in acquisition proposals. The level of effort should be a factor in awarding acquisition contracts. In addition, to ensure that the general level of effort for reliability is sufficient, contractors should provide to DoD their budgets for these activities, and those budgets should be protected, even in the face of unanticipated problems.
RECOMMENDATION 18 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that proposals for acquisition contracts include appropriate funding for design-for-reliability activities and for contractor testing in support of reliability growth. It should be made clear that the awarding of contracts will include consideration of such fund allocations. Any changes to such allocations after a contract award should consider the impact on probability of mission
success and on life-cycle costs, and at the minimum, require approval at the level of the service component acquisition authority.
We argue throughout this report that both developmental and, especially, operational testing as currently practiced are limited in their ability to discover reliability problems in defense system prototypes. We recommend ways in which government testing can be made more effective in identifying reliability problems, for instance, by adding aspects of operational realism to developmental testing. Also, by targeting DoD developmental testing to those components that were problematic in development, developmental testing can be made more productive. And, for software, by acquiring capabilities for software testing from the contractor, DoD can play the role of an independent software tester.
However, even after implementing these recommendations, it is likely that most reliability growth through testing will need to be achieved by contractor testing, rather than through DoD’s developmental or operational testing. Furthermore, even though there will likely be appreciable reliability growth as a result of developmental and operational testing, not only is it limited by the lack of operational realism in developmental testing and the short time frame of operational testing, but there also will likely be reliability “decline” due to the developmental test/operational test gap (see Chapter 8.) Although the magnitudes of these increases and decreases cannot be determined a priori, one can increase overall reliability of all systems by requiring that prototypes achieve their reliability requirements on delivery to DoD.
RECOMMENDATION 19 The Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate that prior to delivery of prototypes to the U.S. Department of Defense for developmental testing, the contractor must provide test data supporting a statistically valid estimate of system reliability that is consistent with the operational reliability requirement. The necessity for this should be included in all requests for proposals.
This recommendation should not preclude the early delivery of subsystems that are considered final to DoD, while development work continues on other parts of the system, when doing so is considered beneficial by both the contractor and by DoD.
The estimation of system reliability called for in this recommendation would likely need to combine information from full-system testing done late in development with component- and subsystem-level testing done
earlier, and it could also use estimates from previous versions of the system at either the full-system, subsystem, or component levels. In this way, the contractor will be able to justify delivery of prototypes. Such assessments will, at times, be later demonstrated by DoD to be high or low, but this approach will support a learning process about how to better merge such information in the future.
The monitoring of system reliability prior to operational testing is important, because it is likely that relying on operational tests to expose reliability problems will result in too many defense systems exhibiting deficient system reliability when fielded (see Chapter 7). Yet accounting for the differences in conditions between developmental and operational testing remains a challenge. One possible approach to meet this challenge is, to the extent possible, make greater use of test conditions in nonaccelerated developmental testing that reflect operational conditions. Then, DoD must somehow estimate what the system reliability will likely be under operational conditions based on results from developmental testing.
Schedule pressures, availability of test facilities, and testing constraints necessarily limit the capability of contractors to consistently be able to carry out testing under circumstances that mimic operational use. It thus remains important for DoD to provide its own assessment of a system’s operationally relevant levels prior to making the decision to proceed to operational testing. This assessment is best accomplished through the use of a full-system test in environments that are as representative of actual use as possible.
RECOMMENDATION 20 Near the end of developmental testing, the Under Secretary of Defense for Acquisition, Technology, and Logistics should mandate the use of a full-system, operationally relevant developmental test during which the reliability performance of the system will equal or exceed the required levels. If such performance is not achieved, then justification should be required to support promotion of the system to operational testing.
Operational testing provides an assessment of system reliability in as close to operational circumstances as possible. As such, operational testing provides the best indication as to whether a system will meet its reliability requirement when fielded. Failure to meet the reliability requirement during operational test is a serious deficiency for a system and should generally
be the cause for delaying promotion of the system to full-rate production until modifications to the system design can be made to improve system reliability to meet the requirement.
RECOMMENDATION 21 The U.S. Department of Defense should not pass a system that has deficient reliability to the field without a formal review of the resulting impacts the deficient reliability will have on the probability of mission success and system life-cycle costs.
Reliability deficiencies can continue to arise after deployment, partly because however realistic an operational test strives to be, it will always differ from field deployment. Field operations can stress systems in unforeseen ways and reveal failure modes that were not likely to have been unearthed in either developmental or operational testing. In addition, feedback and system retrofits from field use can further improve reliability of a given system and can also improve the reliability of subsequent related systems if lessons are learned and communicated. Therefore, the support and enhancement of such feedback loops should be a DoD priority. One way to do so is through continuous monitoring of reliability performance in fielded systems.
RECOMMENDATION 22 The Under Secretary of Defense for Acquisition, Technology, and Logistics should emplace acquisition policies and programs that direct the services to provide for the collection and analysis of postdeployment reliability data for all fielded systems, and to make that data available to support contractor closed-loop failure mitigation processes. The collection and analysis of such data should be required to include defined, specific feedback about reliability problems surfaced in the field in relation to manufacturing quality controls and indicate measures taken to respond to such reliability problems. In addition, the contractor should be required to implement a comprehensive failure reporting, analysis and corrective action system that encompasses all failures (regardless whether failed items are restored/repaired/replaced by a different party, e.g., subcontractor or original equipment manufacturer).
Problems can arise when a contractor changes its subcontractors or suppliers. If this is done without proper oversight, then it can result in substantial reductions in reliability. Therefore, contractors should be required to document the reason for such changes and estimate the likelihood of mission success and modified life-cycle costs due to such changes in the fielded system. The document detailing the implications of such changes should be reviewed by an external panel of reliability and system experts. If the
review finds that there is the potential for a substantial decrease in system reliability, then USD (AT&L) should not approve the change.
RECOMMENDATION 23 After a system is in production, changes in component suppliers or any substantial changes in manufacturing and assembly, storage, shipping and handling, operation, maintenance, and repair should not be undertaken without appropriate review and approval. Reviews should be conducted by external expert panels and should focus on impact on system reliability. Approval authority should reside with the program executive office or the program manager, as determined by the U.S. Department of Defense. Approval for any proposed change should be contingent upon certification that the change will not have a substantial negative impact on system reliability or a formal waiver explicitly documenting justification for such a change.
This report is focused on activities that are undertaken prior to the end of operational testing. However, approaches to manufacturing and assembly, storage, shipping and handling, operation, maintenance, and repair also affect system reliability. In particular, it is crucial that supply-chain participants have the capability to produce the parts and materials of sufficient quality to support meeting a system’s final reliability objectives. Because of changes in technology trends, the evolution of complex supply-chain interactions, a cost-effective and efficient parts selection and management process is needed to perform this assessment.
Setting target values for tracking system reliability during development is important for discriminating between systems that are likely to achieve their reliability requirements and those that will struggle to do so. Through early identification of systems that are having problems achieving the required reliability, increased emphasis and resources can be placed on design for reliability or reliability testing, which will often provide a remedy. Given the difficulty of modifying systems later in development, it is critical that such problems are identified as early in the process as possible.
Target reliability values at specified times could be set both prior to or after delivery of prototypes from the contractor to DoD. Prior to delivery of prototypes for developmental testing, intermediate target values could be set by first determining the initial reliability level, based only on design-for-reliability activities, prior to most subsystem- or system-level testing. Then the contractor, possibly jointly with DoD, would decide what model of reliability as a function of time should be used to link this initial level of
reliability with the reliability requirement to support delivery of prototypes to DoD (see Chapter 4). Such a function could then be used to set intermediate reliability targets.
As noted throughout the report, the number of test replications carried out by a contractor is likely to be very small for any specific design configuration. Therefore, such estimates are likely to have large variances. This limitation needs to be kept in mind in setting up decision rules that aim to identify systems that are unlikely to improve sufficiently to make their reliability requirement absent additional effort (see Chapter 7).
After prototype delivery, the specified initial reliability level could be the reliability assessed by the contractor on delivery or during early full-system developmental testing; the final level would be the specified requirement, and its date would be the scheduled time for initiation of operational testing. Again, the contractor and DoD would have to decide what function should be used to link the initial level of reliability with the final value and the associated dates used to fit target values. Having decided on that, intermediate reliability targets can be easily determined. As noted above, the variances of such reliability estimates would need to be considered in any decision rules pertaining to whether a system is or is not ready to enter into operational testing.
In each of these applications, one is merely fitting a curve of hypothesized reliabilities over time that will associate the initial reliability to the reliability goal over the specified time frame. One can imagine curves in which most of the change happens early in the time frame, other curves with relatively consistent changes over time, and myriad other shapes. Whatever curve is selected, it is this curve that will be used to provide intermediate reliability targets to compare with the current estimates of reliability, with the goal of using discrepancies from the curve to identify systems that are unlikely to meet their reliability requirement in the time allotted. Experience with similar systems should provide information about the adequacy of the length, number, and type of test events to achieve the target reliability. Clearly, the comparisons to be made between the estimated system reliability, its estimated standard error, and the target values are most likely to occur at the time of major developmental (and related) test events or during major system reviews.
With respect to the second setting of target values, the appropriate time to designate target values for reliability is after delivery of prototypes because reliability levels cannot be expected to appreciably improve as a result of design flaws discovered during operational testing. As noted throughout the report, operational testing is generally focused on identifying deficiencies in effectiveness, not in suitability, and fixing flaws discovered at this stage is both expensive and risky (see Chapter 8). Unfortunately, late-stage full-system developmental testing, as currently carried out, may also be somewhat limited in its potential to uncover flaws in reliability
design due to its failure to represent many aspects of operational use (see Chapter 8; also see National Research Council, 1998). As the Defense Science Board emphasized (U.S. Department of Defense, 2008a), testing cannot overcome a poor initial design. Therefore, it is important to insist that more be done to achieve reliability levels at the design stage, and therefore, the goals for initial reliability levels prior to entry into developmental testing should be set higher than are presently the case. Starting out on a growth trajectory by having an initial design that provides too low an initial reliability is a major reason that many systems fail to attain their requirement. Unfortunately, one cannot, a priori, provide a fixed rule for an initial system reliability level in order to have a reasonable chance of achieving the reliability requirement prior to delivery of prototypes or prior to operational testing. At the very least, one would expect that such rules would be specific to the type of system.
More generally, several key questions remain unanswered concerning which design-for-reliability techniques and reliability growth tests are most effective for which types of defense systems, the order in which they are most usefully applied, and the total level of effort that is needed for either design for reliability or for reliability growth.
To help clarify these and other important issues, DoD should collect and archive, for all recent acquisition category I systems (see Chapter 1), the estimated reliability for at least five stages of development:
- the level achieved by design alone, prior to any contractor testing,
- the level at delivery of prototypes to DoD,
- the level at the first system-level government testing,
- the level achieved prior to entry into operational testing, and
- the level assessed at the end of operational testing.
Analyses of the data would provide information as to the degree of improvement toward reliability requirements that are feasible for different types of systems at each stage of the development process. Such an analysis could be useful input toward the development of rules as to what levels of reliability should be evidence of promotion to subsequent stages of development.(Such an analysis would require some type of taxonomy of defense systems in which the patterns of the progression to requirements were fairly comparable for all the members in a cell.)
Analysis of these data may also determine the factors that play a role in achieving reliability requirements. For example, it would be of great importance to determine which design-for-reliability techniques or what budgets for design for reliability were predictive of higher or lower rates of initial full system developmental test reliability levels. Similarly, it would also be important to determine what testing efforts, including budgets for
reliability testing, and what kinds of tests used, were successful in promoting reliability growth. One could also consider the achieved reliabilities of related systems as predictors.
There are likely to be considerable additional benefits if DoD sets up a database with these and other variables viewed as having a potential impact on reliability growth and reliability attainment from recent past and current acquisition programs. For example, this database could also be useful in the preparation of cost-benefit analyses and business case analyses to support the conduct of specific reliability design tasks and tests. Such kinds of databases are commonplace for the best performing commercial system development companies, because they support the investigation of the factors that are and are not related to the acquisition of reliable systems. While it is more difficult for defense systems, any information on the reliability of fielded systems could also be added to such a database.
RECOMMENDATION 24 The Under Secretary of Defense for Acquisition, Technology, and Logistics should create a database that includes three elements obtained from the program manager prior to government testing and from the operational test agencies when formal developmental and operational tests are conducted: (1) outputs, defined as the reliability levels attained at various stages of development; (2) inputs, defined as the variables that describe the system and the testing conditions; and (3) the system development processes used, that is, the reliability design and reliability testing specifics. The collection of these data should be carried out separately for major subsystems, especially software subsystems.
Analyses of these data should be used to help discriminate in the future between development programs that are and are not likely to attain reliability requirements. Such a database could also profitably include information on the reliability performance of fielded systems to provide a better “true” value for reliability attainment. DoD should seek to find techniques by which researchers can use extracts from this database while protecting against disclosure of proprietary and classified information. Finally, DoD should seek to identify leading examples of good practice of the development of reliable systems of various distinct types and collect them in a casebook for use by program managers.
Once it is better understood how near the initial reliability needs to be to the required level to have a good chance of attaining the required level prior to entry into operational testing, acquisition contracts could indicate the reliability levels that need to be attained by the contractor before a system is promoted to various stages of development. (Certainly, when considering whether demonstrated reliability is consistent with targets from
a reliability growth curve, the impact of any impending corrective actions should be factored into such assessments.)
We believe that collectively, the above recommendations will address what may have been a not uncommon practice of contractors’ submitting proposals that simply promised to produce a highly reliable defense system without providing details regarding the measures that would be taken to ensure this. Proposals were not obligated to specify which, if any, design-for-reliability methodologies would be used to achieve as high an initial reliability as possible prior to formal testing, and proposals were not obligated to specify the number, size, and types of testing events that would be carried out to “grow” reliability from its initial level to the system’s required level. Contractors were also not required to provide the associated budgets or impacts on schedule of delivery of prototypes.
Partly as a result of the absence of these details, there was no guarantee that reliability-related activities would take place. In fact, proposals that did allocate a substantial budget and development time and detail in support of specific design-for-reliability procedures and comprehensive testing have been implicitly penalized because their costs of development were higher and their delivery schedules were longer in comparison with proposals that made less specific assertions as to how their reliability requirements would be met. Our recommendations above will level the playing field by removing any incentive to reduce expenditures on reliability growth or reliability testing to lower a proposal’s cost and so increase the chances of winning the contract.
Systems should have objective reliability thresholds that will serve as “go/no–go” gates that are strictly enforced, preventing promotion to the next stage of development or to the field unless those thresholds have been attained. At each of the decision points for development of a system, if the assessed level of reliability is considerably different from the reliability growth curve, then the system should not be promoted to the next level unless there is a compelling reason to do so.
A number of the above recommendations either explicitly (by mentioning an expert external panel) or implicitly utilize expertise in reliability and associated methods and models. In our opinion, DoD currently does not have sufficient expertise in reliability to provide satisfactory oversight of the many ACAT I acquisition programs. Therefore, we recommend that DoD initiate steps to acquire additional expertise.
RECOMMENDATION 25 To help provide technical oversight regarding the reliability of defense systems in development, specifically, to help
develop reliability requirements, to review acquisition proposals and contracts regarding system reliability, and to monitor acquisition programs through development, involving the use of design-for-reliability methods and reliability testing, the U.S. Department of Defense should acquire, through in-house hiring, through consulting or contractual agreements, or by providing additional training to existing personnel, greater access to expertise in these five areas: (1) reliability engineering, (2) software reliability engineering, (3) reliability modeling, (4) accelerated testing, and (5) the reliability of electronic components.
Lastly, our statement of task asked the panel to explore ways in which reliability growth processes and various models could be used to improve the development and performance of defense systems. In its work to produce this report, the panel did identify a number of research areas that DoD might consider supporting in the future: reliability design support, comprehensive reliability assessment, and assessing aspects of reliability that are difficult to observe in development and operational testing.
With regard to reliability design support, research on the relationship between physical failure mechanisms and new technologies seems warranted. We note three difficult issues in this area:
- assessment of system reliability that is complicated by interdependence of component functionality and tolerances; the nonlinear nature of fault development and expression; and the variation of loading conditions, maintenance activities, and operational states;
- the relationship between reliability assessment during system development in comparison to full-rate manufacturing; and
- assessment of the impact of high-frequency signals across connections in a system.
There is much that could be gained from research on assessment methodologies. We suggest, in particular:
- reliability assessment of advanced commercial electronics that addresses the next generation high-density semi-conductors and nano-scale electronic structures; copper wirebonds; environmentally friendly molding compounds; advanced environmentally friendly consumer materials; and power modules for vehicle-aerospace applications and batteries; and
- modeling the inherent uncertainty due to variation in supply and manufacturing chains in an approach similar to reliability block diagrams for the purpose of reliability prediction; and creation of traditional reliability metrics from physics-of-failure models.
There are also intriguing near- and long-term reliability issues that would be difficult to observe in development and operational testing. We note, particularly:
- the identification, characterization, and modeling of the effects of defects that can lead to early failures (infant mortality);
- reliability qualification for very long life cycles containing simultaneous impacts of multiple, combined types of stresses;
- built-in self-diagnosis of sensor degradation as systems are being increasingly instrumented with sensing functions and degradation of the sensors can lead to erroneous operation in that degradation goes undetected; and
- long-term (e.g., space flight, storage) failure models and test methods.