about the independence of components may have been made in the original licensing basis. If these independence assumptions can be invalidated by the introduction of the digital components, then the safety evaluation must be redone using the new assumptions. In new plants, if the use of digital components can invalidate standard assumptions and procedures for achieving and assessing independence and high reliability, then new procedures may be needed.
U.S. NUCLEAR REGULATORY COMMISSION POSITION
The U.S. Nuclear Regulatory Commission (USNRC) staff has developed the following position with respect to diversity, as stated in the draft branch technical position, Digital Instrumentation and Control Systems in Advanced Plants (USNRC, 1992):
The applicant shall assess the defense-in-depth and diversity of the proposed instrumentation and control system to demonstrate that vulnerabilities to common-mode failures have been adequately addressed. The staff considers software design errors to be credible common-mode failures that must be specifically included in the evaluation.
In performing the assessment, the vendor or applicant shall analyze each postulated common-mode failure for each event that is evaluated in the analysis section of the safety analysis report (SAR) using best-estimate methods. The vendor or applicant shall demonstrate adequate diversity within the design for each of these events.
If a postulated common-mode failure could disable a safety function, then a diverse means, with a documented bases [sic] that the diverse means is unlikely to be subject to the same common-mode failure, shall be required to perform either the same function or a different function. The diverse or different function may be performed by a nonsafety system if the system is of sufficient quality to perform the necessary function under the associated event conditions. Diverse digital or nondigital systems are considered to be acceptable means. Manual actions from the control room are acceptable if time and information are available to the operators. The amount and types of diversity may vary among designs and will be evaluated individually.
A set of displays and controls located in the main control room shall be provided for manual system-level actuation and control of critical safety functions and monitoring of parameters that support the safety functions. The displays and controls shall be independent and diverse from the safety computer system identified in items 1 and 3 above.
The position for existing plants is the same except that item 4 is not required.
Because the regulatory requirement depends on providing a diverse means of carrying out a safety function, the USNRC has also recently issued guidelines to assess whether sufficient diversity exists between digital systems. The guidelines state (USNRC, 1996) that adequate diversity is assumed to exist if:
All of the following are different:
design (including design team), or
The digital systems provide a different function but are developed using the same programming language and by the same vendor, or
The digital systems have a different vendor but perform the same function ("nameplate" diversity), or
Case-by-case review is required for other implementations of diversity.
DEVELOPMENTS IN THE FOREIGN NUCLEAR INDUSTRY
The Canadian Atomic Energy Control Board (AECB) has recently also been developing a position on this issue. Their draft regulatory guide C-138, Software in Protection and Control Systems (also discussed in Chapter 4 above), contains the following policy (AECB, 1996):
To achieve the required levels of safety and reliability, the system may need to be designed to use multiple, diverse components performing the same or similar functions. For example, AECB Regulatory Documents R-8 and R-10 require two independent and diverse protective shutdown systems in Canadian nuclear power reactors. It should be recognized that when multiple components use software to provide similar functionality, there is a danger that design diversity may be compromised. The design should address this danger by enforcing other types of diversity such as functional diversity, independent and diverse sensors, and timing diversity.
Thus, the AECB draft regulatory guide agrees with the USNRC with respect to recognizing the possibility of common-mode software failure and requiring steps to be taken to reduce that possibility. The difference appears to be that the AECB accepts functional diversity as one means of addressing the common-mode software failure issue but does not mandate it. The USNRC accepts digital systems performing the same function but provided by different vendors.
DEVELOPMENTS IN OTHER SAFETY-CRITICAL INDUSTRIES
Regulatory agencies in fields other than nuclear power do not, in general, have equivalent policies about common-mode software failure because of the different approach to safety assurance in other industries. Thus simple comparisons can be misleading. In general, in other industries, all components are considered potentially safety-critical and no distinction is made between safety and nonsafety systems except with respect to their potential to contribute to hazards identified in a system hazard analysis. Components whose operation or failure could cause hazards (such as control
systems) are treated in the same way as those that could mitigate hazards and, in fact, are considered more important because hazard prevention is given a higher priority than hazard mitigation. No assumptions are made or requirements levied to use protection or shutdown systems—the design approach used must be justified for each system according to the hazard analysis and characteristics of the particular system.
The Federal Aviation Administration (FAA) satisfies the need for guidance in satisfying airworthiness requirements for airborne systems by a series of industry-generated and accepted guidelines reflecting best practices: DO-178B, Software Considerations in Airborne Systems and Equipment Certification. These guidelines are in the form of objectives for software life-cycle processes, descriptions of activities and design considerations for achieving these objectives, and descriptions of the evidence that indicate that the objectives have been achieved. The guidelines are applied in a graded manner that depends on the assessed level of criticality of the software component.
Redundancy or diversity in the software is not required by DO-178B. If the licensee wants to take credit for it, that is, reduce the set of normally required activities for their software development process, they must argue the case and get approval from the FAA. Specifically, DO-178B states with respect to using software design diversity (FAA, 1992):
The degree of dissimilarity and hence the degree of protection is not usually measurable. Probability of loss of system function will increase to the extent that the safety monitoring associated with dissimilar software versions detects actual errors or experiences transients that exceed comparator threshold limits. Multiple software versions are usually used, therefore, as a means of providing additional protection after the software verification process objectives for the software level have been satisfied.
In summary, the FAA position on the use of software diversity is that the degree of dissimilarity and protection provided by design diversity is not usually measurable and therefore is usually counted only as additional protection above a required level of assurance.
The defense and aerospace industry use MIL-STD-882C (DOD, 1993) or variations of it (for example, NASA standards are based on MIL-STD-882C). This standard requires the use of a formal safety program that stresses early hazard identification and elimination or reduction of associated risk to a level acceptable to the managing authority. Rather than specify a particular safety design approach, such as defense-in-depth, or design features, such as redundancy or diversity, MIL-STD-882C requires that contractors establish a system safety program that includes specific tasks (such as hazard tracking, reviews and audits, hazard analyses, and safety verification) and criteria (such as the use of qualitative risk assessment and an order of precedence for resolving hazards). In contrast to the FAA and some nuclear power standards, software components are not graded as to their criticality and then subjected to different software development procedures, but rather the hazards themselves are assessed and either eliminated or controlled. Earlier versions of this defense standard included tasks that were specific to software, but the latest version (MIL-STD-882C) has integrated the software tasks with the nonsoftware tasks and does not distinguish them.
The U.S. Office of Device Evaluation of the Center for Devices and Radiological Health of the U.S. Food and Drug Administration has issued Reviewer Guidance for Computer Controlled Medical Devices Undergoing 510(k) Review (FDA, 1991). This guidance applies to the software aspects of premarket notification (510(k)) submissions for medical devices. It provides (1) an overview of the kind of information about software that FDA reviewers may expect in company submissions and (2) specification of the approach that FDA reviewers should take in reviewing computer-controlled devices, such as some key questions that will be asked during the review.
The FDA guidance does not dictate any particular approach to safety, as does the USNRC, or specific software development or quality assurance procedures, as does the FAA. Instead, it focuses attention on the software development process to assure that potential hazardous failures have been addressed, effective performance has been defined, and means of verifying both safe and effective performance have been planned, carried out, and properly reviewed. The FDA believes that in addition to testing, device manufacturers should conduct appropriate analyses and reviews in order to avoid errors that may affect operational safety.
The depth of review is dictated by both the risk to the patient of using (and not using) the device and the role that software plays in the functioning of the device. The three levels of concern are (FDA, 1991):
MAJOR: The level of concern is major if operation of the device or software function directly affects the patient so that failures or latent design flaws could result in death or serious injury to the patient, or if it indirectly affects the patient (e.g., through the action of a care provider) such that incorrect or delayed information could result in death or serious injury of the patient.
MODERATE: The level of concern is moderate if the operation of the device or software function directly affects the patient so that failures or latent design flaws could result in minor to moderate injury to the patient, or if it indirectly affects the patient (e.g., though the action of a care provider) where incorrect or delayed information could result in injury of the patient.
MINOR: The level of concern is minor if failures or latent design flaws would not be expected to result in death or injury to the patient. This level is assigned to a software component that the manufacturer can show to be totally independent of other software or hardware that may be involved in a potential hazard and would not directly or indirectly lead to a failure of the device that could cause a hazardous condition to occur.
The FDA does not specify particular software assurance or development procedures. Instead, the FDA specifies what information should be included in the review documents and what types of questions will be asked during the review for each level of concern. The submission must include a hazard analysis that identifies the potential hazards associated with the device, the method of control (hardware or software), the safeguards incorporated, and the identified level of concern. Because there is no specification of how safety should be achieved, there is no guidance provided on redundancy or diversity.
U.S. NUCLEAR REGULATORY COMMISSION RESEARCH ACTIVITIES
The Office of Nuclear Regulatory Research of the USNRC indicated that they currently fund only one research project on common-mode failure potential. This research project is developing a software tool called Unravel for program slicing.
Program slicing is a technique that was developed to assist with software debugging. Basically, program slicing extracts the statements that might affect the value of a specified variable before execution reaches a particular statement in the program. Thus, if one is trying to fix an error in statement N, it is helpful to know what other statements in the software can affect the values of the variables in that statement.
To perform the slicing, the program is first represented as a flow graph annotated with the variables referenced and defined at each node (roughly, a node is a programming language statement). Unravel works only on programs written in the (ANSI [American National Standards Institute]) C programming language, without any extensions to the language. Some features of C cannot be handled, including calls to the C standard library.
The USNRC argument for the usefulness of this tool is that it can assist auditors in evaluating functional diversity in safety-critical software and in conducting a thread audit. The committee has not previously seen any argument that the technique could be used for evaluating diversity and are skeptical about this (see the evaluation later in this chapter).
When multiple digital components are used to provide diversity, the potential for common-mode software failure exists, requiring consideration of two relevant issues: (1) whether failure independence can be assumed or under what conditions it can be assumed (Issue 1); and (2) whether failure independence can be verified, that is, whether there are any ways to determine that the digital components are adequately independent or diverse in their failure behavior (Issue 2). Both issues are examined in turn, considering both digital hardware and software.
Is the failure independence assumption justified for independently produced digital components? For the purposes of discussing this question, design diversity is separated from functional diversity. Also, operating systems are grouped with hardware unless the operating system functions have been specially written for a particular application or digital device. In the latter case, operating systems are considered as application software.
Case 1: Digital Hardware and Operating Systems. For hardware, the prevalence of a very few processors and real-time operating systems invalidates the use of simple "nameplate" diversity assumptions. Many computers with different manufacturers in fact have identical internal components or use the same operating system.
Although the committee knows of no data to support generally rejecting the assumption of independence between failures of diverse digital hardware devices, there are three concerns in assuming independent failures between digital hardware components providing the same function but produced by different manufacturers. The first is that many of the well-publicized errors found in processors have involved similar functions, for example, floating point operations. The second is the increasing complexity of chip designs, which has led to a lowered ability to adequately test the designs before using them. Testing and verification techniques originally developed for software are now being adapted for use in digital hardware because the complexity of these hardware designs is approaching that of software, thus defying exhaustive testing. A third consideration is the use of common design environments, libraries, and fabrication facilities.
Therefore, the question of whether hardware design errors can be assumed to be independent is beginning to have a close relationship to the same question with respect to software. Currently, however, when the design is different there exists no evidence to invalidate the assumption that failures of digital hardware components due to design errors will be independent.
Similarly, assuming intended differences in design, there also is little current evidence to invalidate an assumption of independence of failure between different real-time operating systems. Note, however, that this assumption applies only to operating systems developed by different companies. Different versions of an operating system by one vendor often include the reuse of much of the same code. In addition, evidence does exist of similar failure modes and errors being found in UNIX operating systems built by different vendors (Miller et al., 1990).
However, the above restrictions may be relaxed if analysis has shown that there is functional diversity. This would allow a single company to design functionally diverse operating
systems. Similarly, functional diversity needs to be assured when using different companies for operating systems and hardware. Licensing agreements between companies can destroy assumptions of functional diversity based on different vendors.
However, even operating systems and library functions produced by different companies can have common-mode software errors. For example, in 1990, a mathematician reported on a computer bulletin board that he had found a serious bug in MACSYMA, a widely used program that computes mathematical functions (Sci.math, 1990). This program incorrectly computes the integral from 0 to 1 of the square root of (x + 1/x - 2) to be -(4/3) instead of the correct value of 4/3. Other readers of the bulletin board became curious and tried the same problem on other math packages. The result was that four packages (MACSYMA, Maple, Mathematica, and Reduce) got the same wrong answer while only one (Derive) got the correct answer. These mathematical packages were all developed separately in different programming languages, and even in different countries, and had been widely used for many years and yet contained the same error.
Case 2: Application Software. The effectiveness of design diversity in increasing software reliability rests on the assumption of statistical independence of failures in separately developed software versions (including both application software and specially constructed operating system functions), such as separately developed digital protection systems. This assumption is important in evaluating whether software design diversity satisfies the USNRC requirements for diversity and independent failures.
Several scientific studies have experimentally evaluated the hypothesis that software separately developed to satisfy the same functional requirements will fail in a statistically independent manner (Brilliant et al., 1990; Eckhardt et al., 1991; Knight and Leveson, 1986; Scott et al., 1987). All these studies have rejected the hypothesis with a high confidence level, i.e., concluded that the number of correlated (common-mode) failures that actually occurred for the programs in the various experiments could not have resulted by chance. The implication is that although design diversity might be able to increase reliability, increased reliability cannot be assumed.
In two of the experiments, the programming errors causing correlated (common) failures were examined to better understand the nature of faults that lead to coincident failures and to determine methods of development for multiple software versions that would help avoid such faults. The first experiment (Knight and Leveson, 1986) found that, as anticipated, in some cases the programmers made equivalent logical errors. More surprising, there were cases in which apparently different logical errors yielded correlated failures in completely different algorithms or in different parts of similar algorithms. For example, in order to satisfy the requirements, the programs needed to compute the size of an angle given three points. Most of the programs worked correctly for the normal case. However, eight of the 27 programmers had difficulty in handling the case where three points were collinear, even though the algorithms used and the actual errors made were quite different. Five of the eight mishandled or failed to consider one or both of the possible subcases (i.e., angle equal to zero degrees and angle equal to 180 degrees). One handled all the cases, but used an algorithm that was inaccurate over certain parts of the input space. Another had machine round-off problems. The final programmer had an apparent typo in an array subscript that, seemingly by chance, resulted in an error only when the points were collinear. Knight and Leveson concluded that there are some input cases (i.e., parts of the problem space) that are more difficult to handle than others and are therefore likely to lead to errors, even though the algorithms used and the actual errors made may be very different. The second experiment (Scott et al., 1987) examined the errors made in the programs in their experiment and also concluded that dependence was related to a "difficulty factor": If one program gave a wrong answer for a particular input, then it was likely that other programs would also produce an incorrect answer, even though the errors made were different and the programs used different algorithms.
In another experiment, Brunelle and Eckhardt (1985) took a portion of an operating system (SIFT) and ran it in a three-way voting scheme with two other operating systems written for the same computer. The results showed that although no errors were found in the original version, there were instances where the two new versions outvoted the correct original version to produce a wrong answer.
Following these experiments, Eckhardt and Lee (1985) produced a mathematical model that explains the results. Their model also shows that even small probabilities of correlated failures, i.e., deviation from statistically independent failures, cause a substantial reduction in potential reliability improvement when using diverse software components.
In summary, the experiments conducted on this issue indicate that statistically correlated failures result from the nature of the application, from similarities in the difficulties experienced by individual programmers, and from special cases in the input space. The correlations seem to be related to the fact that the programmers are all working on the same problem and that humans do not make mistakes in a random fashion.
There is no reason to expect that the use of different development tools or methods, or any other simple technique, will reduce significantly the incidence of errors giving rise to correlated failures in multiple-version software components. All evidence points to the fact that independently developed software that uses different programmers, programming languages, and algorithms but computes the same function (satisfies the same functional requirements) cannot be assumed to fail in an independent fashion. Thus the USNRC
position that allows "nameplate" diversity or design diversity to be used to assure independence is not supported by the extensive scientific evidence that is available. Other regulatory agencies, such as the FAA and the AECB, do not accept design diversity as evidence of failure independence.
In contrast with design diversity, no assumptions about the independence of the code are made when using functional diversity, only about whether the functional requirements are independent and different. The problem here really reduces to the same problem that is found with functionally diverse analog components, and no new procedures are necessary except to determine whether any new failure modes have been introduced that might violate the system-level independence assumptions. Thus, the current USNRC position on functional diversity is consistent with the scientific evidence.
Can the independence of multiple versions of software be evaluated? That is, if the assumption of statistical independence cannot simply be assumed in independently developed software, can software diversity be evaluated or assessed in some way?
Procedures have been developed for evaluating the potential for common-mode failure of analog hardware components. In addition, the number of states and the continuity of behavior over the total state space for analog components allows either exhaustive testing or much more confidence in the testing. In contrast, only a small fraction of the state space for digital systems can usually be tested and the lack of continuity in behavior does not allow any assumptions about the behavior of the software for any inputs or input sequences not specifically tested.
Verifying diversity between two algorithms is impossible in general. Equivalence between two algorithms (and thus also lack of equivalence) has been proven to be mathematically undecidable. But even if diversity cannot be assessed formally, perhaps it can be evaluated informally. The problem reduces to determining what is meant by design diversity between two computer programs. Syntactic diversity (differences in the syntax or lexical structure of the programs) is not the relevant issue: Two programs can be syntactically different (look very different) and yet compute identical mathematical functions.
Even if one could verify diversity between two algorithms, that would not be adequate, because different algorithms may compute the same functions and therefore behave identically. Basically, what is sought is two programs that compute the same function except where they are incorrect (i.e., where they differ from the requirements). Evaluating for independence of failure behavior would require proving that the two programs were different only in their failure behavior (or that they were not identical in their total function computed). To accomplish this would require the same logical power as that required to identify design errors (at which point they would just be removed). Thus, if it were possible to verify effective design diversity, diversity would not be needed. In summary, there is no way to verify or evaluate the diversity of two software versions or to determine whether they will fail independently.
As discussed earlier, the USNRC currently is funding a research project at the National Institute of Standards and Technology to build a tool called Unravel for program slicing. A stated goal for this tool is to assist USNRC auditors in evaluating functional diversity in nuclear power plant safety system software. The developers of the tool say that it can be used to "identify code that is executed in more than one computation and [that] thus could lead to a malfunction of more than one logical software component."
In general, evaluating functional diversity is not possible by simply identifying the code related to a particular computation, as done by program slicing. The probability that separately developed programs will contain the same code is extremely small. If there is any attempt to make the software diverse, then the programs will almost certainly use different variables, data structures, and algorithms. In addition, the experiments described above found that programs failed in a statistically dependent manner even when they used completely different algorithms and had unrelated programming errors. The only relationship needed between software errors to cause statistically dependent failures is that the errors occur on the execution paths for each program that will be followed by the same input data. The errors can appear anywhere on those paths, and the computations and errors on the paths may be different.
The second proposed use of program slicing is for thread audits. However, a technique like program slicing that works backward from a particular statement to find any statements that might affect it seems to have much less relevance for thread audits than a tool that will identify paths through the code starting from particular inputs. Other techniques, such as symbolic execution, are more precise (provide more information to the analyst) and are probably less costly. Slicing can work backward from an output to identify statements affecting the output and thus all paths to that output, but cannot distinguish feasible from infeasible paths and identifies all such paths, not just those related to a particular input. The analyst must then by hand determine which paths relate to the thread being investigated and determine whether the path is feasible (a difficult task). Symbolic execution, on the other hand, can start from specific inputs and identify feasible paths through the code, evaluating the particular predicates that must be true for the path to be taken. Another potentially useful technique related somewhat to symbolic evaluation, called Software Deviation Analysis (Reese, 1996), also does a forward analysis from inputs to determine
the effect on outputs, but starts with likely or possible deviations in the inputs from their expected values and determines whether hazardous outputs can be generated.
Alternatives to Diversity for Software
In addition to the two main diversity issues discussed above (Issue 1 and Issue 2), one final question is whether redundancy and diversity are the most effective way to increase reliability for digital systems or whether there are more effective alternatives. Potential alternatives include mathematical verification techniques, self-checking software, and safety analysis and design techniques.
While mathematical verification of software is potentially effective in finding programming errors, these techniques are difficult to use and have only been applied to very small programs by mathematically sophisticated users. The difficulty of writing the required formal specifications and doing the proofs has not been shown to be less error prone in practice than using less formal techniques. In fact, little or no comparative evaluation with the alternatives has been done. Despite these caveats, the committee notes that mathematical verification has been used by Ontario Hydro on their Darlington and Pickering plant protection system software. The committee understands that the Canadian experience shows that mathematical verification costs can be very high but is far more cost effective if it is built into the development process from the beginning rather than being imposed at the end.
Digital systems have the capability to provide self-checking to detect digital hardware failures and some software errors during execution. This has proven effective for random hardware failures but not for software design errors. Built-in tests for some programming errors, such as attempting to divide by zero, are easily implemented and effective. However, checking for more subtle errors is more difficult and may, in itself, add so much complexity that it leads to errors. For example, a licensee event report about a problem at the Turkey Point plant in Florida in 1994 described a software error that could result in a real emergency signal being ignored if it is received 15 seconds or more after the start of particular test scenarios (see discussion in Chapter 4). An experiment by Leveson et al. (1990) in writing self-checks for software found that very few of the known errors in the code were found by the self-checks. Even more discouraging, the self-checks themselves introduced more errors than they found.
Safety analysis and design techniques (see Leveson, 1995) extend standard system safety techniques to software. Software-related hazards are identified and then eliminated or controlled. In this approach, not all potential errors are targeted but simply those that could lead to hazards or accidents. As such, this approach is potentially less costly than a full formal verification. A type of safety verification procedure (called software fault tree analysis) was used (in addition to formal verification) during the licensing of the Darlington shutdown system (Bowman et al., 1991). The information provided during the analysis was used to change the code to be more fault-tolerant and to design 40 self-checks that were added to the software.
Although many in the software engineering community believe that there are more cost-effective techniques (including both those described here and others) for achieving high software reliability than redundancy and diversity, there is no agreement among them about what these alternatives are.
CONCLUSIONS AND RECOMMENDATIONS
Conclusion 1. The USNRC position of assuming that common-mode software failure could occur is credible, conforms to engineering practice, and should be retained.
Conclusion 2. The USNRC position with respect to diversity, as stated in the draft branch technical position, Digital Instrumentation and Control Systems in Advanced Plants, and its counterpart for existing plants, is appropriate.
Conclusion 3. The USNRC guidelines on assessing whether adequate diversity exists need to be reconsidered. With regard to these guidelines: (a) The committee agrees that providing digital systems (components) that perform different functions is a potentially effective means of achieving diversity. Analysis of software functional diversity showing that independence is maintained at the system level and no new failure modes have been introduced by the use of digital technology is no different from that for upgrades or designs that include analog instrumentation. (b) The committee considers that the use of different hardware or real-time operating systems is potentially effective in achieving diversity provided functional diversity has been demonstrated. With regard to real-time operating systems, this applies only to operating systems developed by different companies or shown to be functionally diverse. (c) The committee does not agree that use of different programming languages, different design approaches meeting the same functional requirements, different design teams, or different vendors' equipment used to perform the same function is likely to be effective in achieving diversity. That is, none of these methods is a proof of independence of failures. Conversely, neither is the presence of these proof of dependence of failures.
Conclusion 4. There appears to be no generally applicable, effective way to evaluate diversity between two pieces of software performing the same function. Superficial or surface (syntactic) differences do not imply failure independence, nor does the use of different algorithms to achieve the same functions. Therefore, funding research to try to evaluate