design diversity does not appear to be a reasonable use of USNRC research funds.
Conclusion 5. Although many in the software community believe that there are more cost-effective techniques for achieving high software reliability than redundancy and diversity, there is no agreement as to what these alternatives may be. The most promising of these appear to be the extension of standard safety analysis and design techniques to software and the use of formal (mathematical) analysis. (See Recommendation 3 in Chapter 4.)
Conclusion 6. The use of self-checking to detect hardware failures and some simple software errors is effective and should be incorporated. However, care must be taken to assure that the self-checking features themselves do not introduce errors.
Recommendation 1. The USNRC should retain its position of assuming that common-mode software failure is credible.
Recommendation 2. The USNRC should maintain its basic position regarding the need for diversity in digital instrumentation and control (I&C) systems as stated in the draft branch technical position, Digital Instrumentation and Control Systems in Advanced Plants, and its counterpart for existing plants.
Recommendation 3. The USNRC should revisit its guidelines on assessing whether adequate diversity exists. The USNRC should not place reliance on different programming languages, different design approaches meeting the same functional requirements, different design teams, or using different vendors' equipment ("nameplate" diversity). Rather, the USNRC should emphasize potentially more robust techniques such as the use of functional diversity, different hardware, and different real-time operating systems.
Recommendation 4. The USNRC should reconsider the use of research funding to try to establish diversity between two pieces of software performing the same function. This does not appear to be possible. Specifically, it appears the USNRC funding of the Unravel tool is based on the use of this tool for this purpose and, as such, is unlikely to be useful.
AECB (Atomic Energy Control Board, Canada). 1996. Draft Regulatory Guide C-138, Software in Protection and Control Systems. Ottawa, Ontario: AECB.
Bowman, W.C., G.H. Archinoff, V.M. Raina, D.R. Tremaine, and N.G. Leveson. 1991. An Application of Fault Tree Analysis to Safety-Critical Software at Ontario Hydro. Presentation at Conference on Probabilistic Safety Assessment and Management (PSAM), Beverly Hills, Calif., April.
Brilliant, S., J.C. Knight, and N.G. Leveson. 1990. Analysis of faults in an N-version software experiment. IEEE Transactions on Software Engineering 16(2):238–247.
Brunelle, J.D., and D.E. Eckhardt. 1985. Fault-Tolerant Software: Experiment with the SIFT Operating System. Presentation at AIAA Computers in Aerospace Conference, Dallas, October.
Eckhardt, D.E., and L. Lee. 1985. A theoretical basis for the analysis of multiversion software subject to coincident errors. IEEE Transactions on Software Engineering 11(12):1511–1517.
Eckhardt, D.E., A.K. Caglayan, P. Lorczak, J.C. Knight, D.F. McAllister, M. Vouk, L. Lee, and J.P. Kelly. 1991. Robustness of software redundancy as a strategy for improving reliability. IEEE Transactions on Software Engineering 17(7):692–702.
DOD (U.S. Department of Defense). 1993. Military Standard 882C, System Safety Program Requirements. Washington, D.C.: U.S. Department of Defense.
FAA (Federal Aviation Administration). 1992. DO-178B, Software Considerations in Airborne Systems and Equipment Certification. Washington, D.C.: FAA.
FDA (Food and Drug Administration). 1991. Reviewer Guidance for Computer Controlled Medical Devices Undergoing 510(k) Review. Washington, D.C.: FDA.
Knight, J.C., and N.G. Leveson. 1986. An experimental evaluation of the assumption of independence in multi-version programming. IEEE Transactions on Software Engineering 12(1):96–109.
Leveson, N.G. 1995. Safeware: System Safety and Computers. New York: Addison-Wesley.
Leveson, N.G, S.S. Cha, J.C. Knight, and T.J. Shimeall. 1990. The use of self checks and voting in software error detection: An empirical study. IEEE Transactions on Software Engineering 16(4):432–443.
Miller, B.P., L. Fredrikson, and B. So. 1990. An empirical study of the reliability of UNIX utilities. Communications of the Association for Computing Machinery 33(12):32–44.
Reese, J.D. 1996. Software Deviation Analysis. Ph.D. dissertation, University of California, Irvine. January.
Sci.math. 1990. Various authors posting to this Usenet newsgroup, Feb. 3–8.
Scott, R.K., J.W. Gault, and D.F. McAllister. 1987. Fault tolerant reliability modeling. IEEE Transactions on Software Engineering 13(5):582–592.
USNRC (U.S. Nuclear Regulatory Commission). 1992. Draft Branch Technical Position on Digital Instrumentation and Control Systems in Advanced Plants. Washington, D.C.: USNRC.
USNRC. 1996. Draft Branch Technical Position on Defense-in-Depth and Diversity. Washington, D.C.: USNRC. (Also USNRC staff presentation to the Committee on Application of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., October 1995.)
Safety and Reliability Assessment Methods
Appropriate methods for assessing (as distinct from achieving or assuring) safety and reliability are the key to establishing the acceptability of digital instrumentation and control (I&C) systems in nuclear plants. Methods must be available to support estimates of reliability, assessments of safety margins, comparisons of performance with regulatory criteria such as quantitative safety goals, and overall assessments of safety in which trade-offs are made on the basis of the relative importance of disparate effects such as improved self-checking acquired at the cost of increased complexity. These methods must be sufficiently robust, justified, and understandable to be useful in assuring the public that using digital I&C technology in fact enhances public safety.
Statement of the Issue
Effective, efficient methods are needed to assess the safety and reliability of digital I&C systems in nuclear power plants. These methods are needed to help avoid potentially unsafe or unreliable applications and aid in identifying and accepting safety-enhancing and reliability-enhancing applications. What methods should be used for making these safety and reliability assessments of digital I&C systems?
In nuclear power plants, reliability and safety are assessed using an interactive combination of deterministic and probabilistic techniques. The issues that the committee considered were the extent to which these assessment methods are applicable to digital I&C systems and the appropriate use of these methods.
Design basis accident analysis is a deterministic assessment of the response of the plant to a prescribed set of accident scenarios. This specific analysis constitutes a major section of the nuclear plant safety analysis report that is submitted to and reviewed by the U.S. Nuclear Regulatory Commission (USNRC) in the licensing process. In a design basis accident analysis an agreed-upon set of transient events are imposed on analytical simulations of the plant. Then, assuming defined failures, the plant systems must be shown to be effective in keeping the plant within a set of defined acceptance criteria. Consider, for example, the analysis of the thermal response of the reactor following a postulated pipe rupture. In this case, the deterministic safety analysis considers:
the size of the rupture (the cross-sectional area of the pipe)
the geometry of the systems and components affected, such as volumes and elevations of pipes and vessels
the initial conditions (conditions at the time of the rupture), such as initial power, pressures, and temperatures
the response logic of the active and passive safety systems, such as the sensing of the event by the instrumentation systems, the subsequent actuation of valves that isolate the fault, and the subsequent opening of backup feedwater system valves
All these considerations are used as parameters or forcing functions in the equations that model the physical behavior of the affected systems (mainly nuclear, thermal, mass, and momentum conservation equations) to calculate the response of the system. Of particular importance is the calculation of the resultant pressures and temperatures in the cooling systems and in the core to assess the integrity of the fuel and the multiple physical barriers that contain radionuclides.
Probabilistic risk assessment (PRA) (or probabilistic safety assessment [PSA]) techniques are used to assess the relative effects of contributing events on system-level safety or reliability. Probabilistic methods provide a unifying means of assessing physical faults, recovery processes, contributing effects, human actions, and other events that have a
high degree of uncertainty. These analyses are typically performed using fault tree analysis; but other methods, such as event trees, reliability block diagrams, and Markov methods, are also appropriate. In PRA, the probability of occurrence of various end events, both acceptable and unacceptable, is calculated from the probabilities of occurrence of basic events (usually failure events). For example, the USNRC has established a quantitative safety goal that the probability of a core damage event shall not exceed 10-5 per reactor year. The results of a particular PRA, of course, have wide bands of uncertainty; but on a relative basis they allow searching out the most important failure modes (''weak points") and allow the designer to balance the design appropriately between mitigation and prevention and to avoid unhealthy dependence on single systems or components.
The development of a fault tree model serves several important purposes. First, it provides a logical framework for analyzing the failure behavior of a system and for precisely documenting which failure scenarios have been considered and which have not. Second, the fault tree model has well-defined Boolean algebraic and probabilistic basis that relates probability calculations to Boolean logic functions. That is, a fault tree model not only shows how events can combine to cause the end (or top) event, but at the same time defines how the probability of the end event is calculated as a function of the probabilities of the basic events. Thus the fault tree model can evolve as the system evolves and can at any time evaluate the effect of proposed changes on the reliability and safety of the nuclear power plant. In this manner the fault tree analysis can be used to support engineering tasks such as illuminating the design "weak points," facilitating trade-off analyses, or assessing relative risks.
As mentioned above, the probabilistic analysis of reliability and safety is dependent upon an assignment of a probability of occurrence for each basic event in the fault tree. In addition to addressing the probability of an event, however, probability analysis may also address probabilities of variability and uncertainty. For example, an estimation may be made of the probability that a component will fail (probability of an event). But this failure probability may vary as a result of statistical variation in external conditions, such as temperature, or statistical characteristics of the source of the component. A second probability concept describes this variation as a probability distribution around a "point estimate" for the failure probability. Furthermore, the failure probability may not be known with perfect confidence. A third probability concept uses a distribution to express the degree of uncertainty associated with the point estimate reflecting the differences and uncertainties among experts solicited for judgments on probabilities (see below). Thus current risk assessment practice distinguishes between probabilities of events, variability, and uncertainty (NRC, 1994).
An uncertainty analysis using the fault tree model reflects the degree to which the output value is affected by the uncertainty in an input. This analysis helps the designer determine the extent to which an unknown input can affect the reliability or safety of the system and thus the extent to which the system must be able to withstand such uncertainty (Modarres, 1993).
But the fundamental concept in probabilistic analysis remains the concept of the probability of an event. There are several interpretations of the probability associated with an event (Cooke, 1991; Cox, 1946; McCormick, 1981; Modarres, 1993). The classic notion of the probability of an event is the ratio of the size of the subspace of sample points that include the event to the size of the sample space. A frequency interpretation of the probability of an event is the one most commonly understood; it defines the probability of an event as the limit of the ratio of the number of such events observed to the number of trials as the number of trials becomes large. Many events considered in a probabilistic safety assessment in the nuclear field are, however, classifiable as rare events, which complicates the estimation of occurrence probabilities for the basic events. If failure probabilities are to be estimated from life testing or field experience, many samples must be studied over long periods of time in order to gain any statistical significance in the data (Leemis, 1995). Several databases and handbooks exist to help with the estimation of failure probabilities for basic events (Bellcore, 1992; DOD, 1991; Gertman and Blackman, 1994; RAC, 1995). Within the nuclear engineering community, failure data for nuclear-specific systems and components are available from several sources, including summaries of licensee event reports (USNRC, 1980, 1982a, 1982b) and other handbooks (IEEE, 1983; USNRC, 1975). The existence and use of such handbooks helps address the problems associated with obtaining failure data for many of the basic events.
But for some basic events, where there are few or no applicable data on frequencies, subjective interpretations of probability may be used and may, in fact, be all that is available. Subjective probabilities may be sought in formal and informal processes in which groups of experts weigh available evidence and make judgments. This approach to probability is not of course based on relative frequencies and does not require samples or trials except as they may be available to inform subjective engineering judgment. Rather, subjective interpretations are commonly described as measures of the degree of belief that an event will occur. For example, Apostolakis (1990) states that "probability is a measure of belief." He continues: "The primitive notion is that of 'more likely': that is, we can intuitively say that event A is more likely than event B. Probability is simply a numerical expression for this likelihood." However, as more information becomes available, the subjective distribution (see discussion of uncertainty analysis above) can be adjusted to reflect the current state of knowledge.
There is extensive experience in nuclear risk studies and elsewhere with such elicitation of expert judgments on probabilities. Bayesian analysis (Leemis, 1995) tells how past observations (i.e., frequency data) influence the subjective
judgment. Certain characteristic biases, such as tendencies toward overconfidence, are known to occur (Cooke, 1991). Notwithstanding its limitations, the subjective interpretation of probability is the usual basis for the analysis of rarely occurring events and forms the basis of many risk evaluations (McCormick, 1981). As such, it is important to the committee's consideration of the applicability of probabilistic analysis to digital systems.
Hazard analysis (i.e., experts thinking about what might go wrong) has been validated as effective for at least 50 years. Random testing has been suggested as an alternate approach. However, truly random testing is not particularly good for finding hazards as it is more of a "needle-in-a-haystack" approach. Tests might also be randomly generated from an abstract description of a rare-event scenario. However, significant expertise is needed to formulate such a description.
Applicability to Digital Systems
Deterministic analysis techniques for digital systems are a generalization of the design basis accident methodology used in the nuclear industry and include such techniques as hazard analysis and formal methods (Leveson, 1995; Rushby, 1995). The use of deterministic analysis techniques for the analysis of digital systems is not controversial, as long as they are applied with care to consider the failure modes attributable to digital systems. More controversial is the applicability of probabilistic models to digital systems. The committee spent much of its effort on this issue in assessing the applicability of probabilistic analysis methods to digital systems.
Although well-accepted techniques exist for the analysis of physical faults, probabilistic analysis of design faults in critical systems is more problematic. Because software faults are by definition design faults, the discussion will focus on probabilistic techniques for assessing software. It should be noted that much of the discussion is applicable to similar systems that may be implemented in hardware, using programmable devices or application-specific integrated circuits.
There is controversy within the software engineering community as to whether software fails, whether it fails randomly, and whether the notion of a software failure rate exists. Some would assert that software does not "fail" because it does not physically change when an incorrect result is produced. Others assert that software either works or does not work, and thus its reliability is either zero or one (see, e.g., Singpurwalla, 1995, and the published discussion accompanying that reference).
Some who accept the notion of software failure disagree as to whether software failure can be modeled probabilistically. Some argue that software is deterministic, in that given a particular set of inputs and internal state, the behavior of the software is fixed. The most common justification for the apparent random nature of software failures is the randomness or uncertainty of the input sequences (Eckhardt and Lee, 1985; Laprie, 1984; Littlewood and Miller, 1989). For example, Finelli (1991) identifies "error crystals" (regions of the input space that cause a program to produce errors); a software failure occurs when the input trajectory enters an error crystal. Recent experimental work (Goel, 1996) seems to suggest that the reliability of some software can be modeled stochastically as a function of the workload.
For nonsafety-critical software systems, statistical analysis techniques are being used in the software reliability engineering process (Lyu, 1996). For example, the statistical analysis of the results (i.e., detected failures) of a good set of tests can, based on the operational profile, help managers answer questions such as "When can I release this version?" or "When can I consider this phase of testing complete?" The basic premise is that a set of random tests of a large software system provides data as to the probability of failure for a particular version of software.
Many of the methods developed for software reliability engineering of large-scale commercial systems are not directly applicable to embedded systems for critical applications. One problem with the software reliability engineering approaches is that a very large number of test cases must be considered to statistically validate a low probability of failure (Butler and Finelli, 1993).
For very reliable software, the software would be expected to pass every test, making statistical analysis even more difficult. If software for a safety-critical application were to fail a test, the software would be changed in such as way as to correct the error and the testing would be restarted. Thus, a point would be reached when the software would have passed a very large number of tests. Miller et al. (1992) describe several methods for estimating a probability of failure for software that, in its current version, has not failed during random testing. Bertolino and Strigini (1996) propose a method for estimating both the probability of failure and the probability of program correctness from a series of failure-free test executions. Parnas et al. (1990) describe a methodology for determining how many tests should be passed in order to achieve a certain level of confidence that the failure probability is below a specified upper bound. A similar approach is described in NUREG/CR-6113 (USNRC, 1993a). In this case, the operating range of a safety system is considered to be the transition region between safe and unsafe operation. Thus it is recommended that random tests be selected in this transition region, and a mathematical formula is given for determining the number of test cases needed for statistical confidence that the failure probability is below a given upper bound.
The validity of these methods is dependent on the quality of the test cases chosen. The test cases should be representative of the inputs encountered in practice and should certainly include all boundary conditions and known potentially hazardous cases. Random testing should, however, be only a
part of a complete program for safety assessment and quality assurance, a program that includes formal methods (Rushby, 1995) or other analysis techniques throughout the development and assurance process. Testing and formal methods, besides being complementary, can be mutually supportive as well. Analysis can help determine potentially hazardous conditions that should be tested, and testing can help validate critical assumptions made in the analysis (Walter, 1990).
Some failure data from operational systems in the nuclear and other industries are available (Paula, 1993). Failure rates for microprocessor-based programmable logic controllers used in emergency shutdown systems are reported by Mitchell and Williams (1993). Fault-tolerant digital control systems failures are analyzed by Paula et al. (1993), who also present a quantitative fault tree analysis that helped a group of owners decide whether to replace existing analog control systems with fault-tolerant digital control systems. In 90 system years of operation, 279 single-channel failures and 55 multiple-channel failures were reported. Of the 55 multiple-channel failures, nine were attributed to software deficiencies. The fault tree analysis included such failure modes as inadvertent operator actions, software failures, physical damage from external events, lack of coverage, and hardware component and communication failures.
CURRENT U.S. NUCLEAR REGULATORY COMMISSION REGULATORY POSITION AND PLANS
The criteria under which a utility can make plant changes without prior USNRC approval are established in 10 CFR 50.59. One of the specified criteria for determining whether a change requires approval (i.e., involves an unreviewed safety question) is whether the probability of occurrence or the consequences of an accident or malfunction of equipment important to safety previously evaluated in the safety analysis report may be increased.
The USNRC is increasingly incorporating probabilistic risk assessment into all of its rulemaking activities as it develops a risk-informed, performance-based stance (Newman, 1995). The current USNRC regulatory position on the probabilistic analysis of digital systems, however, is not clearly established or well documented. In an October 1995 presentation to the committee, USNRC staff described their position as follows (USNRC, 1995a): "It is the responsibility of the licensees to ensure appropriate reliability and safety of the digital I&C system. The design life-cycle activities permit both qualitative and quantitative methodologies for assessing reliability and are sufficiently adaptable to consider the evolving aspect of digital technology." However, although qualitative software assurance techniques are presented in several NUREG publications prepared by the Lawrence Livermore National Laboratory (USNRC, 1993b, 1995b), these contain no discussion of probabilistic analysis. In fact, in the October 1995 presentation, the USNRC staff also stated that "quantitative reliability assessment methods for digital systems are not believed to be sufficiently developed to be acceptable as standard practice" (USNRC, 1995a). In further discussions with the committee in April 1996, in addressing the evaluation of relative frequencies of occurrence for use in 10 CFR 50.59 determinations, the USNRC staff indicated they did not consider current evaluation methods to be sufficiently accurate to be meaningful (USNRC, 1996b).
DEVELOPMENTS IN THE U.S. NUCLEAR INDUSTRY
In the U.S. nuclear industry, the use of probabilistic analysis for digital systems (particularly software) is mixed. The analysis of a fault-tolerant digital control system by Paula et al. (1993) used a fault tree and included software failures; however, this approach is not common. A discussion of key assumptions and guidelines for PRA from the Electric Power Research Institute's Utility Requirements Document (EPRI, 1992) shows no mention of software or of failure modes peculiar to digital systems. When several industry representatives were asked by the committee about the use of probabilistic analysis, the responses were mixed or inconclusive. Asked about the probabilistic risk assessment for the General Electric (GE) Advanced Boiling Water Reactor design, the GE representative told the committee that the GE analysis assumed that the software quality assurance and V&V (verification and validation) methodologies addressed the software failure issue (Simon, 1996). Thus software failures were not explicitly included in the PRA. However, it is interesting to note that the PRA for the protection and safety monitoring system of the (Westinghouse) AP600 used a software common-mode unavailability of 1.1 × 10-5 failures per demand for any particular software module, and a software common mode unavailability of 1.2 × 10-6 failures per demand for software failures that would manifest themselves across all types of software modules derived from the same basic design program in all applications (Westinghouse/ENEL, 1992).
DEVELOPMENTS IN THE FOREIGN NUCLEAR INDUSTRY
As discussed in earlier chapters, the Canadian Atomic Energy Control Board (AECB) is currently formalizing an approach for software assessment in a new regulatory guide (AECB, 1996). The AECB assessment of software focuses on four aspects: review of software requirements specifications, systematic inspection of software development and implementation, review of software testing, and confirmation of software development process and management. The AECB approach requires an analysis of software criticality to assess the role of software in plant safety. A probabilistic analysis is not required since it "is difficult to produce a