statistically valid set of accident conditions that a protection system must guard against. However, we maintain that usage testing can build confidence in the reliability of the software (as long as no failures occur)" (Taylor and Faya, 1995).
In the United Kingdom, Nuclear Electric is carrying out extensive dynamic testing of at least substantial portions of Sizewell-B's software as part of its safety case for the reactor's primary protection system. A quantification of the reliability was reportedly not required for licensing, but Nuclear Electric has decided to continue the testing to more accurately estimate the reliability of its software as part of its research and development activity (Marshall, 1995).
DEVELOPMENTS IN OTHER SAFETY-CRITICAL INDUSTRIES
In other safety-critical industries, the use of deterministic safety analysis methods is prevalent; the use of probabilistic analysis is mixed. The Federal Aviation Administration relies heavily on the use of the DO-178B standard for software quality assurance (Software Considerations in Airborne Systems and Equipment Certification) and eschews the use of a probabilistic assessment of software failure. A representative from a developer of railway control systems reported to the committee on the use of formal methods in his industry for safety assessment (requirements analysis, hazard analysis, failure modes and effects analysis), abstract modeling (Petri nets, VHDL simulations, Markov models) and detailed experimental fault injection (Profetta, 1996). Within the rail industry there is a trend towards the use of a PRA-based analysis, raising for that industry many of the same issues facing the nuclear industry. The manager of software engineering at a developer of implantable devices for cardiac rhythm management described his company's system development process, which included safety and reliability assessment and V&V at each stage (Elliott, 1996). Specification analysis included data flow diagrams, state charts, and other formal methods. Quantitative analysis included extensive use of field data and an assessment of the importance of software failure to overall system safety.
Techniques for deterministic analysis of safety and reliability are well accepted and are applicable to digital systems. Formal methods are not currently used widely but offer a good basis for safety analysis of digital systems (Leveson, 1995; Rushby, 1995).
When considering a probabilistic analysis of a system containing digital components, there are basically three choices available to the analyst. First, one can estimate a probability of failure for the digital system, including software, using the best known data and the results of statistically meaningful random tests. An uncertainty analysis can help to minimize the dependence on an uncertain input for the achievement of a reliability or safety goal. The second available choice is to assume that either the software does not fail or that it always fails. This first assumption (that it does not fail) is the assumption that coincides with not including the software in the fault tree. Alternatively, one could assume that the software will certainly fail, assign a probability of one, and design the system to survive such a failure. Many analysts, who are hesitant to model software probabilistically, leave the software out of the fault tree. Since this omission is equivalent to assuming that the software does not fail, the result may be unduly optimistic. However, if the analyst can subjectively determine a reasonable upper bound on the probability of failure (i.e., by the use of quality assurance techniques and statistically meaningful random testing), the resulting analysis may be more meaningful. The third choice is to abandon the use of probabilistic analysis for reliability and safety of a nuclear power plant entirely. This third choice seems impractical, as a PRA is a key component of nuclear power plant safety analysis and has been used effectively.
However, if traditional fault tree analysis is used in PRA, it must be recognized that it is limited in its ability to model some of the failure modes associated with digital systems, especially those that incorporate fault tolerance. There are also other methods available. For example, Markov methods are generally accepted as an appropriate method for analyzing fault-tolerant digital systems (Johnson, 1989), and some mention of Markov models has appeared in the nuclear literature (Bechta Dugan et al., 1993; Sudduth, 1993). But their use appears limited within the nuclear community. Although Markov models are more flexible than fault tree models and are useful for modeling various sequence dependencies, common-cause failures, and failure event correlations, they have the disadvantage of being hard to specify and requiring very long solution times for large models.
Recent work (Bechta Dugan et al., 1992) has expanded the applicability of fault tree models to adequately handle the complexities associated with the analysis of fault-tolerant systems without necessitating the specification of a complex Markov model. This dynamic fault tree model integrates well with a traditional fault tree analysis of other parts of the system (Pullum and Bechta Dugan, 1996). In addition to the extensions of the fault tree model, other analysis techniques have been proposed, for example, dynamic flow graphs (Garrett et al., 1995; USNRC, 1996a).
Further, fault-tolerant digital systems are known to be susceptible to "coverage failures," which are a type of common-cause failure that can bring down the entire system on a single failure. Coverage failures have been shown to dramatically affect the reliability analysis of highly reliable systems (Arnold, 1973; Bechta Dugan and Trivedi, 1989; Bouricius et al., 1969) and so it is important to include them in a model. Paula (1993) provides data for coverage failures in PLC systems used in the chemical process and nuclear power industries.
CONCLUSIONS AND RECOMMENDATIONS
Conclusion 1. Deterministic assessment methodologies, including design basis accident analysis, hazard analysis, and other formal analysis procedures, are applicable to digital systems.
Conclusion 2. There is controversy within the software engineering community as to whether an accurate failure probability can be assessed for software or even whether software fails randomly. However, the committee agreed that a software failure probability can be used for the purposes of performing probabilistic risk assessment (PRA) in order to determine the relative influence of digital system failure on the overall system. Explicitly including software failures in a PRA for a nuclear power plant is preferable to the alternative of ignoring software failures.
Conclusion 3. The assignment of probabilities of failure for software (and more generally for digital systems) is not substantially different from the handling of many of the probabilities for rare events. A good software quality assurance methodology is a prerequisite to providing a basis for the generation of bounded estimates for software failure probability. Within the PRA, uncertainty and sensitivity analysis can help the analyst assure that the results are not unduly dependent on parameters that are uncertain. As in other PRA computations, bounded estimates for software failure probabilities can be obtained by processes that include valid random testing and expert judgment. 1
Conclusion 4. Probabilistic analysis is theoretically applicable in the same manner to commercial off-the-shelf (COTS) equipment, but the practical application may be difficult. The difficulty arises when attempting to use field experience to assess a failure probability, in that the experience may or may not be equivalent. For programmable devices, the software failure probability may be unique for each application. However, a set of rigorous tests may still be applicable to bounding the failure probability, as with custom systems. A long history of successful field experience may be useful in eliciting expert judgment.
Recommendation 1. The USNRC should require that the relative influence of software failure on system reliability be included in PRAs for systems that include digital components.
Recommendation 2. The USNRC should strive to develop methods for estimating the failure probabilities of digital systems, including COTS, for use in probabilistic risk assessment. These methods should include acceptance criteria, guidelines and limitations for use, and any needed rationale and justification.2
Recommendation 3. The USNRC and industry should evaluate their capabilities and develop a sufficient level of expertise to understand the requirements for gaining confidence in digital implementations of system functions and the limitations of quantitative assessment.
Recommendation 4. The USNRC should consider support of programs that are aimed at developing advanced techniques for analysis of digital systems that might be used to increase confidence and reduce uncertainty in quantitative assessments.
AECB (Atomic Energy Control Board, Canada). 1996. Draft Regulatory Guide C-138 Software in Protection and Control Systems. Ottawa, Ontario: AECB.
Apostolakis, G. 1990. The concept of probability in safety assessments of technological systems. Science 250(Dec. 7):1359–1364.
Arnold, T.F. 1973. The concept of coverage and its effect on the reliability model of a repairable system. IEEE Transactions on Computers (22)3:251–254.
Bechta Dugan, J., and K.S. Trivedi. 1989. Coverage modeling for dependability analysis of fault-tolerant systems. IEEE Transactions on Computers 38(6):775–787.
Bechta Dugan, J., S.J. Bavuso, and M.A. Boyd. 1992. Dynamic fault tree models for fault tolerant computer systems. IEEE Transactions on Reliability 41(3):363–377.
Bechta Dugan, J., S.J. Bavuso, and M.A. Boyd. 1993. Fault trees and Markov models for reliability analysis of fault tolerant systems. Reliability Engineering and System Safety, 39:291–307.
Bellcore. 1992. Reliability Prediction for Electronic Equipment. Report TR-NWT-000332, Issue 4, September.
Bertolino, A., and L. Strigini. 1996. On the use of testability measures for dependability assessment. IEEE Transactions on Software Engineering (22)2:97–108.
Bouricius, W.G., W.C. Carter, and P.R. Schneider. 1969. Reliability modeling techniques for self-repairing computer systems. Pp. 295–309 in Proceedings of the 24th Annual Association of Computing Machinery (ACM) National Conference, August 26-28, 1969. New York, N.Y.: ACM.
Butler, R.W., and G.B. Finelli. 1993. The infeasibility of quantifying the reliability of life-critical real-time software. IEEE Transactions on Software Engineering 19(1):3–12.
Cooke, R. 1991. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford: Oxford University Press.
Cox, R.T. 1946. Probability, frequency and reasonable expectation. American Journal of Physics 14(1):1–13.
DOD (U.S. Department of Defense). 1991. Reliability Prediction of Electronic Equipment. Mil-Handbook-217F. New York: Griffiss Air Force Base. December, 1991.
Eckhardt, D.E., and L.D. Lee. 1985. A theoretical basis for the analysis of multiversion software subject to coincident errors. IEEE Transactions on Software Engineering 11(12):1511–1517.
Committee member Nancy Leveson did not concur with this conclusion.
See also Chapter 8, Dedication of Commercial Off-the-Shelf Hardware and Software.