in helping arrive at a satisfactory resolution of these issues would be very useful.
In Phase 2 of the study, the committee was charged to identify criteria for review and acceptance of digital I&C technology in both retrofitted reactors and new reactors of advanced design; characterize and evaluate alternative approaches to the certification or licensing of this technology; and, if sufficient scientific basis existed, recommend guidelines on the basis of which the USNRC can regulate and certify (or license) digital I&C technology, including means for identifying and addressing new issues that may result from future development of this technology. In areas where insufficient scientific basis exists to make such recommendations, the committee was to suggest ways in which the USNRC could acquire the required information.
In carrying out its Phase 2 charge, the committee limited its work to those issues identified in Phase 1. The issues were chosen because they were difficult and controversial. Further, the committee recognized that by law, the responsibility for setting licensing criteria and guidelines for digital I&C applications in nuclear plants rests with the USNRC. Thus, the reader should not form too literal an expectation that the committee has provided a cogent set of principles, design guidelines, and specific requirements for ready use by the USNRC to assess, test, license, and/or certify proposed systems or upgrades. Rather, the results of the study are presented not in the form of simple generic criteria statements (i.e., at a high level of elaboration) but in the form of conclusions and recommendations related to each issue and primarily addressed to the USNRC for their consideration and use. In the committee's view, there is substantial further work to be accomplished. The committee expects the USNRC and the nuclear industry to extend the work of criteria development beyond where this Phase 2 report leaves it. To guide further work on the eight key issues studied, the committee's report offers findings and recommendations in four broad categories: (a) current practice (of the USNRC and the U.S. commercial nuclear industry) that is essentially satisfactory or requires some fine tuning, (b) points of weakness in the USNRC's approach, (c) issues that merit further inquiry and research before satisfactory regulatory criteria can be developed, and (d) criteria and guidelines that are unreasonable to expect in the near future.
Conduct of the Study
In conducting its study, the committee reviewed a large number of documents made available by the USNRC and a variety of other sources. The committee also interviewed selected personnel from the USNRC, from the two advisory committees discussed above (ACRS, NSRRC), from the nuclear industry, and from other industries using digital systems in safety-critical applications. The committee also sought the view of individuals from academia and research organizations. In addition, the committee visited control room simulators, a nuclear plant, and a fossil-fueled power plant with extensive digital I&C systems (see Appendix B). The committee also had frequent and detailed internal discussions, both face-to-face and via paper and electronic communications. The committee also brought to bear a wide range of experience in and knowledge of the field (see Appendix A).
Carrying Out the Charge
The committee took seriously the charge that it identify criteria for review and acceptance of digital I&C technology and that it recommend guidelines for regulation and certification. In carrying out its charge, the committee recognized that:
In order to develop useful guidance, only a limited number of issues could be dealt with in the relatively brief duration of the study.
General, high level criteria would not be particularly useful.
The final criteria are legally the USNRC's responsibility. Further, since the nuclear power industry is heavily regulated in the public interest, the licensing criteria should be forged in a detailed interaction among the regulators, the industry, and the public.
The committee has a wide range of expertise and experience in digital systems and nuclear power plants but it is not a surrogate for this interaction among the stakeholders. Hence, the committee could serve by clearly delineating and defining issues and providing guidance for resolving these issues rather than developing specific licensing criteria.
Accordingly, the committee selected eight issues for study and worked on those issues. These eight issues address the two major intertwined themes associated with the use of digital instrumentation and control in nuclear power plants. These are:
Dealing with the specific characteristics of digital I&C technology as applied to nuclear power plants.
Dealing with a technology that is more advanced than the one widely in use in existing nuclear power plants. This technology is rapidly advancing at a rate and in directions largely uncontrolled by the nuclear industry but at the same time likely to have a significant impact on the operation and regulation of the nuclear industry.
The technical issues of this report are primarily related to digital technology itself (Theme 1) while the strategic issues are primarily related to the process of adopting advanced technology (Theme 2). The committee concentrated on reviewing the current approaches being taken by the nuclear industry and its regulators toward dealing with the selected key issues. The committee also tried to learn from the experience of the international nuclear industry as well as gather
and evaluate information about how other safety-critical industries and their regulators dealt with these issues. Also, through the technical expertise and knowledge of its various members, the committee explored work done by the digital systems community at large, including both research activities and academic work.
As the committee worked through the issues it discovered there is a major impediment to progress. This is the communication barriers that exist among the key technical communities and individuals involved. The basic reason for the communication difficulty is apparent. Work is simultaneously going on in many areas, each with its own technology, research focus, and agenda. Unfortunately, although many of these areas use common terms, these terms often have different meanings to different groups, resulting in either a lack of communication or very difficult communication. This is particularly troublesome for the nuclear power industry and its regulators, who are not dominant in this technology and must try to synthesize information and experience from a variety of sources and apply it in power plants where safety hazards must be dealt with in a rigorous way, under public scrutiny. In Chapter 11 the committee discusses this communication problem in more detail and provides suggestions for a way forward. Making substantial progress in this area should have a multiplicative effect as it eases the resolution of many specific technical and strategic issues.
Overall, while there are important steps that remain to be taken by the USNRC and industry as addressed in this report, the committee found no insurmountable barriers to the use of digital instrumentation and control technology to nuclear power plants. The committee also believes that a forward-looking regulatory process with good and continuing regulations and industry communication and interaction will help. All participants must recognize that crisp, hard-edged criteria are particularly difficult to come by in this rapidly moving area and good practices and engineering judgment will continue to be needed and relied upon.
For the key technical issues (systems aspects of digital I&C technology; software quality assurance; common-mode software failure potential; safety and reliability assessment methods; human factors and human-machine interfaces; and dedication of commercial off-the-shelf hardware and software) the committee provides specific recommendations and conclusions which include a number of specific criteria. These are listed in each chapter (see Chapters 3 through 8). But recognizing the difficulty of defining specific criteria, and the need for the nuclear technology stakeholders, particularly the USNRC, to make the final decisions, the committee focused on (a) providing process guidance both in developing guidelines and in the short-term acceptance of the new technology; (b) identifying promising approaches to developing criteria and suggestions for avoiding dead-ends; and (c) mechanics for improving communication and strengthening technical infrastructure.
For the key strategic issues (the case-by-case licensing procedure and adequacy of the technical infrastructure) the committee:
Emphasizes guidance to implement a generically applicable framework for regulation that follows current USNRC practice and which in particular draws a distinction between major and minor safety modifications. The committee also provides guidance for the evaluation and updating of this regulatory framework (see Chapter 9).
Identifies a need to upgrade the current USNRC technical infrastructure and suggests specific research activities that will support the needed regulatory program and USNRC's research needs. The committee also suggests several improvements to the technical infrastructure to improve and maintain technical capabilities in this rapidly moving, technically challenging area.
The specific recommendations made by the committee thus offer guidance toward implementing and maintaining the currency of a generically applicable framework for regulation that follows current USNRC practice and draws a distinction between major and minor safety modifications. The committee suggests specific research activities that will support this program and makes a number of suggestions for improving USNRC capabilities for addressing these issues.
Contents of This Report
This report contains 11 chapters and six short appendices. Chapter 1 (this chapter) briefly discusses the scope, basis, and context for the study. Chapter 1 also discusses use of digital I&C systems in nuclear plants in some detail so the reader has the necessary background to follow the more detailed discussions and evaluations in the remainder of the report. Chapter 2 briefly describes how the original issues were derived and places the specific issues in overall context, explaining their interrelationships and the relative priorities assigned to them by the committee. Chapters 3 through 10 discuss each of the individual issues in turn. The detailed discussions in these chapters include the committee's conclusions and recommendations regarding each issue. Chapter 11 presents an overview and summary of the committee's findings. Appendices A through F provide useful information too detailed to include in the body of the text.
ACRS (Advisory Committee on Reactor Safeguards to the U.S. Nuclear Regulatory Commission). 1991. Minutes of ACRS Subcommittee Meeting on Computers in Nuclear Power Plant Operations, February 6, 1991. Washington, D.C.
ACRS. 1992a. Digital Instrumentation and Control System Reliability. Letter to I. Selin, Chairman, USNRC, September 16, 1992. Washington, D.C.
ACRS. 1992b. Minutes of ACRS Subcommittee Meeting on Computers in Nuclear Power Plant Operations: Special International Meeting, September 22, 1992. Washington, D.C.
ACRS. 1993a. Computers in Nuclear Power Plant Operations. Letter to I. Selin, Chairman, USNRC, March 18, 1993. Washington, D.C.
ACRS. 1993b. Minutes of ACRS Subcommittee Meeting on Computers in Nuclear Power Plant Operations: Quantitative Software Assessment and Analog-to-Digital Industry Experience , February 9, 1993. Washington, D.C.
ACRS. 1994. Proposed National Academy of Sciences/National Research Council Study and Workshop on Digital Instrumentation and Control Systems. Letter to I. Selin, Chairman, USNRC, July 14, 1994. Washington, D.C.
EPRI (Electric Power Research Institute). 1992a. Advanced Light Water Reactor Utility Requirements Document. EPRI NP-6780-L. Palo Alto, Calif.
EPRI. 1992b. Advanced Light Water Reactor Utility Requirements Document. EPRI NP-6780-L, Vol. 2 (ALWR Evolutionary Plant) and Vol. 3 (ALWR Passive Plant), Ch. 10: Man-Machine Interface Systems. Palo Alto, Calif.: EPRI
EPRI. 1993. Guideline on Licensing Digital Upgrades. EPRI TR-102348. Palo Alto, Calif.: EPRI
Gill, W., D. Harmon, T. Rozek, and S. Wilkosz. 1994. Nuplex 80+ Advanced Control Complex: Enhanced Safety Through Digital Instrumentation and Control. 9th Annual Korean Atomic Industrial Forum and Korean Nuclear Society (KAIF/KNS) Conference, April 6–8, 1994.
Kletz, T. 1995. Computer Control and Human Error. Houston: Gulf Publishing.
Mauck, J. 1995. Regulating Digital Upgrades. Presentation to the Committee on Applications of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., January 31.
NRC (National Research Council). 1995. Digital Instrumentation and Control Systems in Nuclear Power Plants: Safety and Reliability Issues, Phase 1. Board on Energy and Environmental Systems, National Research Council. Washington, D.C.: National Academy Press.
NSRRC (Nuclear Safety Research Review Committee to the U.S. Nuclear Regulatory Commission). 1992. Summary of April 29, 1992, Meeting. Letter to E. Beckjord, USNRC, November 16, 1992. Washington, D.C.
Nucleonics Week. 1995. Outlook in I&C: Special Report to the Readers of Nucleonics Week, Inside the N.R.C. and Nuclear Fuel. September and October.
Palo Verde Nuclear Generating Station. 1993. NRC Inspection Report 50-528, 50-259, and 50-530/93-07 Related to Amendment to Operating Licenses No. NPF-41, NPF-51, and NPF-74, Implementation Inspection for Anticipated Transients Without Scram (ATWS) Systems: Palo Verde Nuclear Generating Station Units 1, 2, and 3. Dockets Nos. 50-528, 50-529, and 50-530, April 9, 1993. Washington, D.C.
Prairie Island Nuclear Generating Plant. 1993. Supplemental Safety Evaluation by the Office of Nuclear Reactor Regulation: Revision 1 of Design Report for Station Blackout/Electrical Safeguards Upgrade Project, Amendment to Facility Operating License No. DPR-42 and DPR-60: Prairie Island Nuclear Generating Plant, Units 1 and 2. Dockets Nos. 50-282 and 50-306, January 4, 1993. Washington, D.C.
Title 10 CFR (Code of Federal Regulations) Part 50, Appendix A. 1995. General Design Criteria for Nuclear Power Plants.
Title 10 CFR Part 50, Appendix B. 1995. Quality Assurance Criteria for Nuclear Power Plants and Fuel Reprocessing Plants.
Turkey Point Plant. 1990. Safety Evaluation Report by the Office of Nuclear Reactor Regulation of the Load Sequencers in the Enhanced Power System at Turkey Point Plant, Units 3 and 4, Amendment to Operating Licenses DPR-31 and DRP-41, Dockets Nos. 50-250 and 50-251, November 5, 1990 . Washington, D.C.
USNRC (U.S. Nuclear Regulatory Commission). 1981. USNRC Standard Review Plan (SRP), NUREG-0800, section 7.1, Instrumentation and Controls. Other sections applicable to instrumentation and control technology include: 3.10, Seismic and Dynamic Qualification of Mechanical and Electrical Equipment; 3.11, Environmental Qualification of Mechanical and Electrical Equipment; 4.4, Thermal and Hydraulic Design; 7.2, Reactor Trip System; 7.3, Engineered Safety Features Systems; 7.4, Safe Shutdown Systems; 7.5, Information Systems Important to Safety; 7.6, Interlock Systems Important to Safety; 7.7, Control Systems; 8.1, Electric Power; 8.2, Offsite Power System; 8.3.1, A-C Power Systems (Onsite); 8.3.2, D-C Power Systems (Onsite); 15.0, Review of Anticipated Operational Occurrences and Postulated Accidents; 15.1.5, Steam System Piping Failures Inside and Outside of Containment. Washington, D.C.: USNRC.
USNRC. 1991. Digital Computer Systems for Advanced Light Water Reactors. USNRC SECY-91-292. Washington, D.C.: USNRC.
USNRC. 1992. Safety Evaluation Report Related to Amendment No. 127 to Facility Operating License No. DRP-48: Zion Nuclear Power Station, Unit 2. Docket No. 50-304, June 9, 1992. Washington, D.C.: USNRC.
USNRC. 1993a. Proceedings of the Digital Systems Reliability and Nuclear Safety Workshop, U.S. Nuclear Regulatory Commission, September 13–14, 1993, Gaithersburg, Md. NUREG/CP-0136. Washington, D.C.: U.S. Government Printing Office.
USNRC. 1993b. Safety Evaluation Report by the Office of Nuclear Reactor Regulation Related to Amendment No. 84 to Facility Operating License No. DPR-80 and Amendment No. 83 to Facility Operating License No. DPR-82: Eagle 21 Reactor Protection System Modification with Bypass Manifold Elimination: Diablo Canyon Power Plant. Dockets Nos. 50-275 and 50-323, October 7, 1993. Washington, D.C.: USNRC.
USNRC. 1994a. Final Safety Evaluation Report: Related to the Certification of the System 80+ Design. NUREG-1462, Vols. 1–2. Washington, D.C.: USNRC.
USNRC. 1994b. NRC Review of Electric Power Research Institute Advanced Light Water Reactor Utility Requirements Document. NUREG-1242, Vol. 3, Parts 1–2. Washington, D.C.: USNRC.
USNRC. 1995. use of NUMARC/EPRI Report TR-102348, "Guideline on Licensing Digital Upgrades," in Determining the Acceptability of Performing Analog-to-Digital Replacements Under 10 CFR 50.59. NRC Generic Letter 95-02. Washington, D.C.: USNRC.
Wermiel, J. 1995. Update of Instrumentation and Control Systems Section of the Standard Review Plan, NUREG-0800. Presentation to the Advisory Committee on Reactor Safeguards to the U.S. Nuclear Regulatory Commission, Rockville, Md., April 7.
White, J. 1994. Comparative Assessments of Nuclear Instrumentation and Controls in the United States, Canada, Japan, Western Europe, and the Former Soviet Union. JTEC/WTEC Annual Report and Program Summary 1993/94. Baltimore, Md.: World Technology Evaluation Center, Loyola College.
Digital instrumentation and control systems for nuclear power plants have very similar technological characteristics—the equipment, response time, input and output range, and accuracy—to digital instrumentation and control systems for other safety-critical applications such as chemical plants and aircraft. What distinguishes digital I&C (instrumentation and control) applications in nuclear power plants from other digital I&C applications is the need to establish very high levels of reliability under a wide range of conditions. Because of the potentially far greater consequences of accidents in nuclear power plants, the I&C systems must be relied upon to reduce the likelihood of even low-probability events. The U.S. Nuclear Regulatory Commission (USNRC) has developed a regulatory process with the goal of achieving these high levels of reliability and thus assuring public safety. This process is subject to public scrutiny.
DEVELOPING THE KEY ISSUES (PHASE 1)
In Phase 1 of the study, the committee identified eight key issues associated with the use of digital I&C systems in existing and advanced nuclear power plants. In the committee's view, these issues need to be addressed and a working consensus needs to be established regarding these issues among designers, operators and maintainers, and regulators in the nuclear industry. The process the committee followed to identify these issues in Phase 1 is discussed in the Phase 1 report (NRC, 1995) and is only briefly summarized here.
In essence, the committee considered the impact of digital I&C systems against a set of standard regulatory approaches to assessing and ensuring safety (defense-in-depth, safety margins, environmental qualification, requisite quality assurance, and failure invulnerability). From this analysis, the committee identified a number of questions, issues, and facets of issues (see Appendix D). After a number of deliberations, the committee winnowed the list down to eight key issues.
The eight issues separate into six technical issues and two strategic issues. The six technical issues are systems aspects of digital I&C technology, software quality assurance, common-mode software failure potential, safety and reliability assessment methods, human factors and human-machine interfaces, and dedication of commercial off-the-shelf hardware and software. The two strategic issues are the case-by-case licensing process and the adequacy of technical infrastructure (i.e., training, staffing, research plan). The committee recognizes that these are not the only issues and topics of concern and debate in this area (see Appendix D). Nevertheless, the committee reaffirms its judgment, initially formed during Phase 1, that developing a consensus on these eight issues will be a major step forward and accelerate the appropriate use and licensing of digital I&C systems in nuclear power plants.
At the end of Phase 1, it became clear to the committee that the software-related issues and the regulating process would be particularly challenging aspects of the study. Accordingly, the committee strengthened its capability by adding to its numbers two experts in these areas (see Appendix A).
ADDRESSING THE KEY ISSUES (PHASE 2)
In Phase 1, the committee largely operated as a single group. In approaching Phase 2, the committee recognized that deeper study of each issue would be needed to provide a firm foundation for developing specific conclusions and recommendations. The committee accordingly formed working subgroups associated with each area. These subgroups, each led by a member of the committee particularly knowledgeable in that area, were charged with studying the issues in detail, developing topic papers, identifying and reviewing key reference documents, and arranging for presentations by those active in the field to the full committee. However, the committee recognized that several issues had close interrelations, requiring that the committee also work as an integrated body to achieve a balanced perspective and forge a committee consensus. Thus, each issue received significant attention by the entire committee.
PRESENTING THE KEY ISSUES
The issues are discussed individually in Chapters 3 through 10 of this report. The committee has maintained the separation between technical issues and strategic issues in the Phase 2 report, even though as work proceeded in Phase 2 it became increasingly apparent that the technical issues and the strategic issues are tightly interwoven. The technical issue discussions (Chapters 3 through 8) generally focus on the technical basis of the issue and how pertinent technical knowledge (or the lack thereof) affects how the issue is addressed in U.S. nuclear plants, foreign plants, and other industries and their regulators. For each issue, the committee draws conclusions and provides recommendations.
Discussion of the two strategic issues (Chapters 9 and 10) focuses on the licensing process and a key underlying area, the way in which the USNRC has developed and continues to develop its technical infrastructure (staffing, training, and research plans) in the digital I&C area. In Phase 1, the committee became convinced that even if the six technical issues were resolved and no controversy or lack of consensus existed, these strategic issues would still need to be carefully considered and addressed. Concern with these two strategic issues reflects the recognition that rapidly moving and evolving technologies present a special difficulty for an industry and its regulators where licensing and certification processes generally move more slowly than the technology they are intended to regulate.
Because the issues are highly interrelated and are relatively general, the committee debated their relative importance and their order of presentation, which warrants the following brief discussion of their arrangement in this report.
The committee chose to present the technical issues first to provide a basis and context for the strategic issues presented later. Of all the technical issues, systems aspects of digital I&C technology is addressed first (in Chapter 3) because it is a broad issue that encompasses many others. Next (in Chapters 4, 5, and 6), the committee has chosen to present the three issues primarily related to software. 1 Software constitutes a major difference between analog and digital I&C applications, and its use raises some concerns. Software is a design artifact and, because it is, there is difficulty showing definitively that it has no critical errors. Software is also more amenable to the addition of features and enhancements (so-called "creeping complexity") not needed for its basic function, whereby the system becomes more difficult to understand. As the most general of the three software issues, software quality assurance is discussed first (Chapter 4). The issue of software common-mode failures is discussed next (Chapter 5). Common-mode failure in software is closely related to software quality assurance but warrants discussion as a separate topic because of its significance to the safety-critical digital applications, with their emphasis on independence, redundancy, and diversity. The final issue discussed in the primarily software-related group is quantitative safety and reliability assessment methods (Chapter 6).
The committee then turns to the issue of human factors and the human-machine interface (Chapter 7), an issue important in both analog and digital systems. Digital I&C technology has the potential to greatly improve the human factors and human-machine interfaces so that the combination of the human operator and the computer could provide greatly improved process control and enhanced safety. There are, however, unique design challenges that digital technology I&C presents.
The last technical issue discussed is dedication and use of commercial off-the-shelf (COTS) digital I&C systems and equipment in nuclear power plants (Chapter 8). This topic is important because much of the existing I&C equipment in nuclear power plants is becoming obsolete and vendor support is waning. The nuclear plant market is relatively small and COTS offers a potentially cost-effective way to address this problem. Other industries have reached the same conclusion and are reportedly finding some success (Loral, 1996). This is a relatively new area for nuclear plants, particularly in safety system applications, but there is considerable industry activity and regulatory involvement.
Finally, the committee turns to the two strategic issues, case-by-case licensing and adequacy of the technical infrastructure (discussed in Chapters 9 and 10). Both the Advisory Committee on Reactor Safeguards and the Nuclear Safety Research Review Committee share the committee's view that successful resolution of these issues is a necessary prerequisite to successfully applying digital I&C systems in nuclear power plants.
Loral (Loral Space Information Systems). 1996. Mission Control Center Upgrade at NASA Johnson Space Center. Houston, Texas: Loral Corporation press release.
NRC (National Research Council). 1995. Digital Instrumentation and Control Systems in Nuclear Power Plants: Safety and Reliability Issues, Phase 1. Board on Energy and Environmental Systems, National Research Council. Washington, D.C.: National Academy Press.
Systems Aspects Of Digital Instrumentation And Control Technology
Digital instrumentation and control (I&C) systems have proven to be useful and beneficial in a wide range of applications including fossil-fueled power generation, electric power distribution, petroleum refining, petrochemical production, aerospace, and some nuclear power plant applications (e.g., core protection calculators, diesel generator load sequencers, a few digital reactor protection systems, and plant radiation monitoring systems). This usefulness is evidenced by the trend over the last 20 years toward investment in digital I&C applications in the process industries.
However, digital I&C systems were not an instant success; early on it became clear that careful attention to systems aspects1 would be necessary to avoid unanticipated failure modes. In the late 1960s, there was mixed success using central computers in the so-called "direct digital control" architecture (commonly referred to as DDC) for process control. A transition was soon made to the so-called "supervisory control" architecture, in which minicomputers were used to transmit "supervisory" commands to analog controllers that performed continuous process regulation.
Eventually, this transition led to today's modern multilayered architectures in which (a) local controllers perform component control functions, (b) higher- (system-) level control stations coordinate in a supervisory mode the operations of multiple components in a system or multiple systems in a unit, and (c) higher-level stations perform plant-level supervisory functions and data analyses.
There are many options by which to implement these architectures. Selecting among these options involves addressing considerations such as (a) allocations of functions to different layers of the system, and to hardware and to software; (b) communications schemes within and between layers; (c) methods for achieving timely execution of data acquisition, analyses, and control functions; and (d) provisions for redundancy and diversity. One possible application of such a multilayered architecture to a nuclear generating station is described in Chapter 1 of this report (see Figure 1-1). Notice in Figure 1-1 the multiple horizontal layers of functionality that are typical in today's digital I&C systems, along with the traditional nuclear plant features of vertical independence between protection and control and the use of independent manual backup trips. Figure 1-1 also illustrates the use of redundancy in sensing and communication lines and the extensive use of data buses in the control system compared to the extensive use of deterministic point-to-point communications in the protection system.
Recent experience with large-scale, fully integrated digital I&C systems at nuclear power plants has also had its difficulties. There have been problems, apparently related to systems aspects, that have caused substantial delays and increased costs. In addition, there is increasing use of open systems, in which multiple vendors provide components that must successfully interact. Open systems are used because they foster competition and standardization and avoid dependence on single suppliers. However, the presence of multiple vendors may make successfully dealing with systems aspects more difficult because of the increased number of interfaces.
Statement of the Issue
Along with important benefits, digital I&C systems introduce potential new failure modes that can affect operations and margins of safety. Therefore, digital I&C systems require rigorous treatment of the systems aspects of their design and implementation.2 What methods are needed to address this concern? How can the experience and best practices of the various technical communities involved in applying digital I&C technologies be best integrated and applied to nuclear power plants? What procedures can be put
"Systems aspects" refers to those issues that transcend the particular component(s) that comprise the system and possibly even the function that the system performs. Such issues include architecture, communications, allocation of functions, real-time processing, and distributed computing.
Licensing aspects are discussed in Chapter 9.
in place to update the methods and the experience base as new digital I&C technologies and equipment are introduced in the future?
New Plants and Retrofits
Successfully dealing with the systems aspects of digital I&C applications is critically important to both new plant applications and retrofits. However, there are substantial differences between the two applications. For new plants, a large system is conceived and designed as such. The designers have relative freedom in configuring the system architecture and creating the various subsystems, which can be implemented on a plant-wide, fully integrated basis (see for example Figure 1-1 and companion description in Chapter 1). The size of the design task is usually matched by a large pool of available resources and the presence of a dedicated design team. Extensive testing of the subsystems and of the integrated system is also likely to be possible.
For retrofits or modifications, typically there will be a narrower focus and fewer resources available. The systems aspects of the particular application are likely to be relatively limited in scope, and in any case the designer is limited by the requirement of integrating the retrofit subsystem into an existing plant. For example, the designer will likely make more use of one-for-one digital-for-analog replacements. The customized nature of retrofits or modifications can make it difficult to carry out a series of changes in a consistent manner, unless there is an integrated, plant-wide plan.
The systems aspects of I&C systems in nuclear plants need to be considered from two perspectives: the plant (i.e., the nuclear, fluid, mechanical, and electrical systems) and the I&C systems themselves. More specifically, this includes:
definition of the I&C systems, integration of these systems into the overall plant, and specification of the key high-level requirements applicable to all the I&C systems
design of the individual I&C systems themselves, i.e., selection of design features intended to meet the high-level requirements
Interactively addressing the systems aspects from these two perspectives is essential in order for the design of the plant and the I&C systems to be adequately integrated, and to achieve (a) reliable plant operation, (b) reliable plant investment protection, and (c) reliable worker and public health protection. This is consistent with the normal design approach used to design such systems; see, for example, Johnson (1989) and Pradhan (1996). These authors discuss the design process in terms of the high-level function of problem definition, system requirements, and system partitioning. Once these steps are accomplished, the overall I&C system will be defined and divided into manageable systems or subsystems with defined top-level requirements.
The committee recognizes that individual digital systems are an important part of the successful implementation of large systems and that their design can be difficult. But there is a large body of experience, including numerous standards, with designing and successfully implementing these systems (see, for example, Center for Chemical Process Safety, 1995). There is also an extensive body of technical literature to guide this work. Therefore, the committee has focused on the higher-level aspects of digital I&C applications in nuclear power plants. It should be noted that there are several key areas in the design of digital systems that need to be carefully addressed, and these are summarized in Appendix F.
CURRENT U.S. NUCLEAR REGULATOR COMMISSION REGULATORY POSITIONS AND PLANS
In general, the U.S. Nuclear Regulatory Commission (USNRC) approach for addressing systems aspects is consistent with the above approach (looking at the I&C systems from two perspectives) and is generally described in Chapter 1 of this report. That is, high-level regulatory requirements are supplemented by more specific USNRC guidance and endorsements of industry standards. In discussion with the committee in October 1995, the USNRC staff called attention to top-level systems aspects requirements addressed in:
10 CFR 50.55a(h), endorsing the use of IEEE Standard 279–1971, particularly in paragraph 3, Design Basis, and paragraph 4.1, General Functional Requirement
10 CFR 50, Appendix A, Criterion 10, Reactor Design
10 CFR 50, Appendix A, Criterion 13, Instrumentation and Control
10 CFR 50, Appendix A, Criterion 20, Protection System Functions
10 CFR 50, Appendix A, Criterion 21, Protection System Reliability and Testability
10 CFR 50, Appendix A, Criterion 22, Protection System Independence
10 CFR 50, Appendix A, Criterion 23, Protection System Failure Modes
10 CFR 50, Appendix A, Criterion 24, Separation of Protection and Control Systems
10 CFR 50, Appendix A, Criterion 25, Protection System Requirements for Reactivity Control Malfunctions
10 CFR 50, Appendix A, Criterion 29, Protection Against Anticipated Operational Occurrences
In addition to these basic, high-level criteria, the USNRC staff noted that the existing review guidance includes:
IEEE Standard 279–1971 and its alternate, IEEE Standard 603–1991
IEEE 7-4.3.2–1993, in particular, Annexes E and F
The USNRC has recognized the need to revise and update their regulatory guidance documents to better address digital I&C systems, and it has an extensive revision under way (see Chapter 1 and 9). In the systems aspects area, the USNRC (1995) indicated the revision includes several items specifically directed at systems aspects. These include preparation of (a) a new branch technical position on digital systems architecture and real-time performance, which provides guidance on verifying limiting response times and architectural details; and (b) a new Standard Review Plan section, Section 7.9, Data Communications, which provides acceptance criteria and review guidance for data communications or multiplexers.
Applicability to Existing Plants
For existing plants the primary emphasis will be on digital upgrades and modifications. Thus, in addition to the documents listed above, the use of 10 CFR 50.59 will be very important. (See discussion in Chapter 1 and Chapter 9 regarding 10 CFR 50.59).
Applicability to New Plants
There are three new plant designs being proposed by the U.S. nuclear industry, one from each of the major vendors, and these designs are being reviewed by the USNRC. All of these plant designs use I&C systems that are completely digital-based and fully integrated into the overall plant design. The USNRC review is being conducted under an alternative process set forth in 10 CFR 50.52. The basic technical requirements for licensing the plants are essentially the same as for existing plants, but the overall licensing review process defined in 10 CFR 50.52 is intended to be more streamlined and to result in the approval of standardized designs that can potentially be used at multiple sites.
An important part of the process of developing and documenting the design basis for these new plants has been the preparation and use of the Electric Power Research Institute's Utility Requirements Document (URD) (EPRI, 1992), which documents the requirements the utilities and vendors have agreed to impose on the new plant design. Chapter 10 (Man-Machine Interface Systems) of the URD sets forth requirements that specify the design approach for the digital I&C systems, requirements for the systems aspects, and requirements for specific systems.
To ensure the eventual licensability of the new plant design, the industry has sought formal review by the USNRC of the URD. The USNRC has reviewed the URD and has written formal Safety Evaluation Reports in which the USNRC agrees that a plant that meets the URD will likely meet the licensing requirements. USNRC review and acceptance of these requirements and their subsequent use in the design certification of the new plants has provided a way for the nuclear industry and the USNRC to reach agreement on many of the systems aspects of digital I&C. (See additional discussion of the URD in Chapter 1 above).
DEVELOPMENTS IN THE U.S. NUCLEAR INDUSTRY
Existing I&C systems in U.S. nuclear plants are analog-based and are approaching or exceeding their life expectancy, resulting in increased maintenance efforts and costs to sustain system performance (see, e.g., a survey by Cross  indicating that I&C maintenance costs are disproportionately high). As a result there is a strong interest in upgrading and modifying these systems. Many individual utilities are making upgrades, and an industry-wide initiative, led by the Electric Power Research Institute, is under way to promote cost-effective digital I&C upgrades (Chexal et al., 1991). The importance of systems aspects has been recognized in this initiative. For example, the EPRI initiative includes systems aspects in its retrofit implementation guidelines, which include guidance for defining equipment and interface requirements for plant data communications, architecture, systems requirements, and configuration management (see Machiels et al., 1995).
No new U.S. nuclear plants have begun construction in almost 20 years. As discussed above, however, three new nuclear plant designs have been proposed and are under review by the USNRC. All of these plants have fully digital-based I&C systems, and the specifications and other documents submitted for licensing review emphasize assuring that the design process and systems aspects are correctly defined so that the eventual detailed design and implementation will be successful. There is at least some indication that this approach is effective. An advanced nuclear power plant recently completed in Japan (Kashiwazaki-Kariwa unit 6) was started up with only very minor I&C system problems. This plant's design meets the bulk of the requirements for the equivalent U.S. advanced boiling-water reactor plant design being reviewed in the United States and, in fact, was used as a basis for developing many of the requirements contained in the Utility Requirements Document.
DEVELOPMENTS IN THE FOREIGN NUCLEAR INDUSTRY
There have been several other nuclear plants completed in the last few years that use completely digital-based I&C systems and represent significant digital I&C integration efforts. These plants are in the United Kingdom (Sizewell-B plant), France (Chooz-B plant), Canada (Darlington plant), and Japan (Kashiwazaki-Kariwa unit 6). The committee has not reviewed these plant designs in detail. However, what is
known about actual progress of the work on these plants and some of the problems that have occurred is instructive with respect to the importance of systems aspects of the design.
Sizewell-B includes a distributed digital control system for control and data acquisition, of a product family that has been extensively used in process control applications, including fossil fired generating stations. It also includes elements of a nuclear safety-grade product family for protection that has been used in some nuclear applications. Redundancy is provided at all levels, including the use of dual redundant conductors for data buses, and two diverse protection systems. Hard-wired controls and instruments provide backup for the computer-based systems (Boettcher, 1994).
Electricite de France (EDF) uses a three-level architecture for its N4 PWR series used at Chooz-B. One level is the digital protection system. Its mission is to bring the plant to a safe, stable status, maintaining core and containment integrity. A second level uses off-the-shelf hardware to provide functions such as boron control, pressure and temperature control, and monitoring of secondary feedwater supply. The third level is the human-machine interface in the control room, which includes hardwired controls connected directly to the lowest possible level of the I&C system (Nucleonics Week, 1995).
The Canadian nuclear program led the world in the use of digital technology. The CANDU reactors are physically large, and significant computations are required to maintain adequate neutron flux distribution and stability. As a result, digital systems have been extensively used in CANDU plants. Each new plant has had a greater scope of digital technology than the previous one. Darlington has digital systems in almost 100 percent of its control systems and over 70 percent of its plant protection system. Necessity and sound engineering have made digital I&C acceptable in the CANDU reactors (White, 1994).
As explained above, the Kawshiwazaki-Kariwa unit 6 in Japan meets the bulk of the requirements for the equivalent U.S. advanced boiling-water reactor plant design under review in the United States.
All of these plants are now producing power on the grid. Because the I&C systems are used extensively in the testing and startup phases as well as during operation at power, there are now several plant years of experience with these large systems. The implication of this experience is that such systems are clearly practical. Further, operation to data has been safe, although, as noted by Suri et al. (1995), large systems with long mission times are challenging tasks and may be subject to subtle failures that can take a long time to appear.
Three of the four plants have had systems aspects problems that were costly and caused delays. Sizewell-B and Chooz-B were affected by a problem that resulted in the need to both change the basic system design and change the control system suppliers in the middle of the design (Nucleonics Week, 1991). For the Darlington plant, a careful review of the software as part of the licensing process indicated that the software in its present form is satisfactory for use but will eventually need to be rewritten as changes inevitably arise (Joannou, 1995). The plant in Japan reported problems in a single part of the control system, but this was resolved in the startup program. On the basis of this experience it appears that systems aspects of nuclear plant I&C systems continue to warrant attention.
DEVELOPMENTS IN OTHER SAFETY-CRITICAL INDUSTRIES
Safety-critical applications of digital I&C are widely used in the aerospace industry. Systems aspects have been the focus of many studies, particularly those addressing the role of the digital I&C systems in accidents and the lessons to be learned. Many of these deal with human-machine interfaces, task allocation, and levels of automation. One major finding closely related to systems aspects is the importance of operator confusion caused by automatic changes in operating modes (Aviation Week and Space Technology, 1995; IEEE Spectrum, 1995).
The chemical industry has great similarity to the nuclear industry in that it is a process industry that deals with (a) similar fluid conditions in terms of temperatures, pressures, and physical phase changes; (b) similar rotating machinery and mechanisms; and (c) significant latent energy storage, albeit of a different kind. Digital I&C systems have been extensively used in the chemical industry since the late 1970s. The industry has developed Guidelines for Safe Automation of Chemical Processes (Center for Chemical Process Safety, 1995), which contains details on important systems aspects, such as integrity of process control systems, process hazards, control strategies and schemes, safety considerations, data communications media, data reliability, and administrative considerations for system integrity.
During Phase 1 of its study, the committee recognized that a great many of the issues and problems being discussed and addressed in the nuclear industry were of a relatively specific nature that missed capturing the systems aspects of the application of digital I&C technologies. This preponderance of relatively specific issues is reflected in the discussions the committee chose to focus on in Chapter 4 through 8 below and in the many other candidate issues and topics considered by the committee (see Appendix D). Several members of the committee, however, had had personal experience in which the various specific parts of a system were apparently designed correctly but the ensemble or overall system did not perform satisfactorily. However, the committee feels that relatively specific I&C issues and problems are best addressed in the context of overall I&C system requirements and interfaces with the rest of the plant. For example, the committee was very much aware of the problems at the
Sizewell-B, Chooz-B, and Darlington plants, which were higher-level problems.
Sizewell-B and Chooz-B had to change their common original system supplier in the middle of the design efforts. The problem seems to have been the result of the under-specification of the Chooz-B system, and the complexity of the design. The original supplier found itself developing hardware and software in parallel to ever escalating requirements. Technical problems seem to have been created by the lack of adequate capacity to process the mass of acquired reactor data with the original architecture (Nucleonics Week, 1991). At Darlington, despite the high availability and safety record of the Canadian plants, the Canadian Atomic Energy Control Board undertook a more stringent review of the software engineering process and the operation of Darlington's first two units was delayed, with a resulting economic penalty on the utility.
The major lesson learned from these cases is that not only is the control of the design process important; equally important is the need for clear, complete, and stable requirements from the beginning of the project. To be clear, requirements must be quantified. Functions define what the system must do and must not do. Requirements define how well system functions must be performed.
The definition of clear, complete, and stable I&C requirements requires (a) an in-depth understanding of plant processes; (b) an in-depth understanding of the proposed I&C technology to be used; (c) the vision of what new features may be needed or desired in the new system, e.g., security, on-line maintenance aids; and (d) an ability to visualize and articulate the requirements in a top-down approach while keeping requirement conflicts out. The last component implies being able to look and see ahead for consistency as detailed ''lower level" requirements are developed from the more global "top level requirements."
Finally, as noted above, the technical literature identifies the systems aspects of a design as being very important to achieving satisfactory performance, particularly as systems grow in size and complexity. There is thus a need to focus on the issue of systems aspects.
In dealing with systems aspects in U.S. nuclear power plants, there are some important factors to be taken into account: First, although three new U.S. plant designs are being reviewed by the USNRC, it is unlikely that any new nuclear plants will be built in the next few years in the United States. The U.S. plant experience will be limited to modifications or upgrades of limited scope, with the bulk of the upgrades and modifications involving component change-outs or small subsystems. Second, dealing with the systems aspects is not solely a USNRC responsibility. This is because systems aspects applies to both the safety and nonsafety systems and only a relatively small subset of the overall I&C systems in a nuclear plant fall under regulatory control. Industry must assure that systems aspects are properly dealt with for the nonsafety systems. The lessons learned and problems seen in foreign (nuclear) plants indicate nonsafety systems can cause problems. Note, for example, that the problems at Chooz-B and Sizewell-B occurred in the nonsafety portion of the plant (Nucleonics Week, 1995); nevertheless they were costly and should be avoided. Both the USNRC and the industry recognize that failures in the nonsafety systems can challenge the plant's design envelope and the safety systems must be appropriately designed to withstand these challenges and keep the plant within its safety envelope. Third, the existing U.S. nuclear plant I&C technology is largely analog-based. There is very little regulatory guidance regarding systems aspects for digital-based components. (As noted above, the USNRC has recognized this and has begun an upgrade of their requirements.)
Taking into account these realities of the situation in the United States, the committee discerned several activities that could be undertaken by the U.S. nuclear industry and the USNRC to better address systems aspects. The principle underlying these activities is that a proactive approach is appropriate for drawing on the available experience and expertise in other countries, comparable industries, and other government agencies. First, to assess whether new regulatory guidance documents have the needed specificity in the systems aspects area, a trial application could be made to the existing foreign plant experience that is already available and to new experience as additional foreign nuclear plants come on line. These new plants all use digital I&C technology throughout. These trial applications could be made both retrospectively to the existing plants and during development of new plants to see if the guidance is appropriate, effective, and of the desired specificity. Second, a systematic review could be made of the experience, techniques, and regulatory and industry guidance documents used in other comparable industries in the United States. Based on its own brief review, the committee has identified at least one candidate approach, one used in the chemical process industry, that merits consideration (Center for Chemical Process Safety, 1995). The committee expects that there are other likely sources of important experience and expertise, such as the aerospace industry, where large, fault-tolerant, safety-critical I&C systems are in wide use. For example, it would be useful for the USNRC to compare their new guidance documents with those available from the Federal Aviation Administration. Third, as digital systems continue to grow in power and complexity, and particularly in view of the probable lack of any new U.S. nuclear plants, action by the USNRC to maintain currency in systems aspects may also be warranted (Chapter 10 of this report discusses the general topic of technical infrastructure). Examples include:
USNRC staff training and participation in key conferences in particularly germane technologies, such as fault-tolerant, distributed systems
Participation by USNRC staff in the work of other domestic or foreign regulatory agencies (perhaps on a reciprocal loan basis) that are actively dealing with large-scale digital I&C systems
Finally, it is essential to pay careful attention to the specific design features of the individual I&C systems that are evaluated and licensed. Further, it is necessary to consider the details of the I&C system implementation and it is not sufficient to concentrate on general, high-level features. However, the committee's brief review of the applicable USNRC guidance found little specificity in these requirements regarding either level of the systems aspects, that is, the high-level systems aspects or the system design considerations covered in Appendix F. It appears that the USNRC should carefully consider the level of specificity provided in their regulatory guidance documents to be sure that the lessons learned from prior experience and in good design practice are adopted and followed. Appendix F is pertinent to this point.
CONCLUSIONS AND RECOMMENDATIONS
Conclusion 1. Continued effort is warranted by the USNRC and the nuclear industry to deal with the systems aspects of digital I&C in nuclear power plants.
Conclusion 2. The lack of actual design and implementation of large I&C systems for U.S. nuclear power plants makes it difficult to use learning from experience as a basis for improving how the nuclear industry and the USNRC deal with systems aspects.
Conclusion 3. The USNRC's intent to upgrade their regulatory guidance in the systems aspects of digital I&C applications in nuclear power plants is entirely supported by the committee's observations about systems aspects.
Conclusion 4. Existing regulatory guidance lacks the specificity needed to be effective, and the revision should address this shortcoming.
Recommendation 1. The USNRC should make a trial application of the proposed regulatory guidance documents on systems aspects to foreign nuclear plant digital systems, both existing and in progress. In particular, this review should focus on assessing whether or not the revised guidance documents have the necessary level of specificity to adequately address the systems aspects of nuclear plant digital I&C implementations.
Recommendation 2. The USNRC should identify and review systems aspects guidance documents provided in other industries, such as chemical processing and aerospace, where large-scale digital I&C systems are used. The focus of this review would be to compare these other guidance documents with those being developed by the USNRC, paying due attention to common problems and application-specific differences.
Recommendation 3. To obtain practical experience, the USNRC should loan staff personnel, perhaps on a reciprocal basis, to other agencies involved in regulating or overseeing large safety-critical digital I&C systems.
Recommendation 4. The USNRC should require continuing professional training for appropriate staff in technologies particularly germane to systems aspects, such as fault-tolerant, distributed systems.
Aviation Week and Space Technology. 1995. Automated Cockpits: Who's in Charge? January 30 and February 6.
Boettcher, D. 1994. State-of-the-Art at Sizewell-B. Atom 433 (Mar–Apr):34–38.
Center for Chemical Process Safety. 1995. Guidelines for Safe Automation of Chemical Processes. New York: American Institute of Chemical Engineers.
Chexal, V., F. Lang, T. Marston, and K. Stahlkopf. 1991. An Industry Vision for the 1990s and Beyond. Nuclear Energy International 36(446):22–24, 26.
Cross, A.E. 1992. Analysis of corrective actions applied to nuclear power plant operations. Nuclear Safety 33(4): 586.
Electric Power Research Institute (EPRI). 1992. Advanced Light Water Reactor Utility Requirements Document. EPRI NP-6780-L, Vol. 2 (ALWR Evolutionary Plant) and Vol. 3 (ALWR Passive Plant), Ch. 10: Man-Machine Interface Systems. Palo Alto, Calif.: EPRI.
Institute of Electrical and Electronics Engineers (IEEE) Spectrum. 1995. The Glass Cockpit. September.
Joannou, P. 1995. Presentation to the Committee on Application of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., December.
Johnson, B.W. 1989. Design and Analysis of Fault Tolerant Digital Systems. New York: Addison-Wesley.
Machiels, A., R. Torok, J. Naser, and D. Wilkinson. 1995. The Digital Challenge, An update on EPRI's I&C Upgrade Initiative . Nuclear Engineering International 40(489):44–46.
Nucleonics Week. 1991. British Support French I&C System That EDF Has Abandoned for its N4. January 10 and April 11.
Nucleonics Week. 1995. Outlook on I&C: Special Report to the Readers of Nucleonics Week, Inside the N.R.C. and Nuclear Fuel. September and October.
Pradhan, D.K. 1996. Fault-Tolerant Computer System Design. Upper Saddle River, N.J.: Prentice-Hall.
Suri, N., C.J. Walter, and M.M. Hugue. 1995. Advances in Ultra-Dependable Distributed Systems. Los Alamitos, Calif.: IEEE Computer Society Press.
U.S. Nuclear Regulatory Commission (USNRC). 1995. USNRC Staff (J. Wermeil) presentation to the Committee on Application of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., October.
White, J. 1994. Comparative Assessments of Nuclear Instrumentation and Controls in the United States, Canada, Japan, Western Europe, and the Former Soviet Union. JTEC/WTEC Annual Report and Program Summary 1993/94. Baltimore, Md.: World Technology Evaluation Center, Loyola College.
Software Quality Assurance
Software in nuclear power plants can be used to execute relatively simple combinational logic, such as that used for reactor trip functions, or more elaborate sequential logic, such as that used for actuating engineered safety features or for process control and monitoring. In either case, it must be ensured that required actions are taken and unnecessary trips are avoided.1
One way of assuring software quality is by examining and approving the process used to produce it. The assumption behind assessing the process by which software is produced is that high-quality software development processes will produce software products with similar qualities. An alternate approach to quality assurance is to directly evaluate properties of the software. Software properties include correctness, reliability, and safety.
Software is defined as correct if it behaves according to its requirements. Assurance of software correctness is sought either experimentally via program testing or analytically through formal verification techniques. Software may be correct but still not perform as intended, however, because of flaws in requirements (e.g., inconsistencies or incompleteness) or assurance techniques (e.g., failing to consider or design for all significant parts of the software's input space).
Software reliability is "the probability that a given program will operate correctly in a specified environment for a specified duration" (Goel and Bastani, 1985). Several models have been proposed for estimating software reliability (Musa et al., 1987).
Software is safe if it does not exhibit behaviors that contribute to a system hazard (i.e., a state that can lead to an accident given certain environmental conditions). Safety analysis and assurance techniques have been developed for all stages of the software life cycle (i.e., systems analysis, requirements, design, and code verification) (Leveson, 1995).
Complexity is an important aspect of assessing correctness, reliability, and safety of software. (The committee notes that complexity is of critical importance to the use of digital instrumentation and control [I&C] systems, and it is addressed in numerous places in this report.) For example, the committee is not aware of software metrics for complexity which are reliable and definitive.
Analog and digital systems should be analyzed differently because the assumptions underlying their design and production are different. Reliability estimation for analog systems primarily measures failures caused by parts wearing out, whereas for digital systems it seeks to address failures primarily caused by latent design flaws. Analog systems can be modeled using continuous and discrete functions, whereas digital systems must be modeled using discrete mathematics only. Although analog systems could contain similar latent design flaws, they are believed to be accommodated by existing evaluation techniques. When an analog system functions correctly on two "close" test points and continuous mathematics is applicable, it can be assumed that it will also function on all points between the two test points. This is not necessarily true for digital systems, which may produce very different results for similar test points.
Statement of the Issue
The use of software is a principal difference between digital and analog I&C systems. Quality of software is measured
in terms of its ability to perform its intended functions. This, in turn, is traced to software specifications and compliance with these specifications. Neither of the classic approaches of (a) controlling the software development process or (b) verifying the end-product appears to be fully satisfactory in assuring adequate quality of software, particularly for use with safety-critical systems. How can the U.S. Nuclear Regulatory Commission (USNRC) and the nuclear industry define a generally accepted, technically sound solution to specifying, producing, and controlling software needed in digital I&C systems?
High quality software results from the use of good software engineering practices during development to minimize the probability of introducing errors into the software, and a rigorous verification process to maximize the probability of detecting errors. Good software engineering practices (e.g., structured programming and data abstraction) reduce the amount of information that developers must remember when writing, analyzing, or changing software. However, good software engineering methods are not easy to apply, and the methods only reduce rather than eliminate errors (Parnas, 1985). Thus software verification activities remain a key concern.
Software verification seeks to determine that the software being built corresponds to its specification, and software validation seeks to demonstrate that the system meets its operational goals. Verification and validation (V&V) activities may focus on either the process or the product. Process-oriented V&V focuses on the process by which the software is produced. It typically involves performing and observing inspections and evaluating test results. Product-oriented V&V focuses on testing and evaluating the final product, independent of the process.
Different techniques for assessing software quality have been developed. These techniques fall into two broad categories, analytic or experimental, each of which encompasses a large number of methods. Analytic techniques include inspections or walk-throughs and formal analysis methods based on mathematics. Program testing is the most common experimental analysis technique.
In software reviews or inspections, teams of software developers examine software artifacts for defects. Participants may be given lists of questions about the artifact that they must answer in order to ensure that they are sufficiently prepared for an inspection, and they may be given lists of potential errors for which they are to check. Inspections have proved to be an effective method for detecting software defects (Fagan, 1976). Requirements inspections catch errors before they propagate into designs and implementations, making them less costly to repair. Also, inspections subject a software artifact to the scrutiny of several people, some of whom would not have participated in the artifact's design. Successful inspections depend on the experience levels of the participants and the quality of the artifacts inspected (Porter et al., 1996). They also depend on the requirements being expressed in a precise, unambiguous manner so the reviewers are able to check the document without having to make assumptions on how the system will be implemented. This can be challenging in practice because it is difficult to find a notation such that reviewers are able to effectively check the correctness of the requirements rather than focusing on the details of the notation. Furthermore, the notation must be "readable" by both users and developers.
Formal methods2 use mathematical techniques to assess if an artifact is consistent with a more abstract description of its general and specific properties (Rushby, 1993). General properties derive from the form of the artifact's description (e.g., that functions are total, that axioms are consistent, or that variables are initialized before they are referenced). Specific properties derive from the problem domain and are captured in an abstract description. Verification using formal methods involves the comparison of a more detailed description of a software system with the more abstract description of its properties. Verifying specific properties of programs using formal methods has proved to be very difficult (Gerhart and Yelowitz, 1976; Rushby and von Henke, 1991, 1993). Furthermore, making mathematical proofs does not guarantee the software will function correctly. Even if one could perform the verification using formal methods, testing would still be necessary to validate the assumptions in the proofs. These assumptions would include that the model matches the real world and that the code statements will behave as modeled when executed on the target hardware. Moreover, errors are often made in proofs.
Testing is used to expose program flaws and to estimate software reliability. Black-box testing seeks to determine if software has functional behavior that is consistent with its requirements. Black-box testing is concerned only with inputs and outputs. White-box testing addresses the internal structure of software (e.g., the outcome of its logical tests) and seeks to exercise the internal structure:
Some engineers believe one can design black box tests without knowledge of what is inside the box. This is, unfortunately, not completely true. If we know that the contents of a black box exhibit linear behavior, the number of tests needed to make sure it would function as specified could be quite small. If we know that the function can be described by a polynomial of order 'N,' we can use that information to determine how many tests are needed. If the function can have a large number of discontinuities, far more tests are needed. That is why a shift from analogue technology to software brings with it a need for much more testing (Parnas et al., 1990).
In testing, practitioners seek to find suitable test cases so that if the software exhibits acceptable behavior for these cases it can be inferred that it will work similarly for other cases. However, complex software systems have large numbers of states and irregular structure. Testing can only sample a fraction of these states, and it cannot be inferred that untested states are free from errors if none are exhibited in tested states. As Dijkstra (1970) points out, "Program testing can be used to show the presence of bugs, but never to show their absence!"
Software standards can help achieve acceptable levels of software quality. Because software development practices are constantly improving, standards should not require developers to use particular techniques. However, standards can include definitive acceptance criteria. An example of definitive and objective acceptance criteria in existing standards is the requirement for white-box structural coverage in the Federal Aviation Administration standard, Software Considerations in Airborne Systems and Equipment Certification (DO-178B). Depending on the safety category, software logic must be test-exercised until the specified acceptance criteria have been met.
There are several existing standards for the production of safety-critical software for nuclear power plants. These include IEC 880, Software for Computers in the Safety Systems of Nuclear Power Stations (IEC, 1986) and IEEE 7-4.3.2–1993, Standard Criteria for Digital Computers in Safety Systems of Nuclear Power Generating Stations (IEEE, 1993). IEC 880 outlines the software development techniques to be used in the development of software for the shutdown systems of nuclear power plants. Rather than mandate particular techniques, IEC 880 states the requirements on the product; it is up to the developer to meet those requirements using whatever techniques the developer considers suitable. There are guidelines presented in an appendix to IEC 880 that describe the effects that particular techniques are expected to achieve.
IEEE 7-4.3.2–1993 advocates choosing a combination of the following V&V activities: independent reviews, independent witnessing, inspection, analysis, and testing. Some of these activities may be performed by developers, but independent reviews must subsequently be performed. Walk-throughs of design, code, and test results are recommended inspection techniques. Analysis includes, but is not limited to, formal proofs, Petri net and other graphical analysis methods, and related techniques. Functional and structural testing are recommended for any software artifact that is executable or compilable. Testing of nonsafety functions may be required to provide adequate confidence that nonsafety failures do not adversely impact safety functions. The standard points out that functional testing cannot be used to conclusively determine that there are no internal characteristics of the code that would cause unintended behavior unless all combinations of inputs, both valid and invalid, are exhaustively tested.
IEEE standards have been criticized as "ad hoc and unintegrated" because they have been developed in a piecewise fashion (Moore and Rada, 1996). Generally, IEEE 7-4.3.2–1993 does not suggest which V&V activities are most effective, nor does it discriminate between activities that are mainly actuarial (e.g., witnessing) and those that are technical (e.g., analysis and testing). In addition, the standard states that path testing is feasible. Except for extremely simple programs, however, numerous references have shown that path testing requires an infeasible number of tests (e.g., Myers, 1979). Therefore, for most practical programs, path testing is infeasible. Furthermore, even if path testing were feasible and were performed, the resulting program could still have undetected errors: for example, there could be missing paths, the program might not satisfy its requirements (the wrong program could have been written), and there could be data-sensitivity errors. (As an example of a data-sensitivity error, suppose a program has to compare two numbers for convergence, that is, see if the difference between two numbers is less than some predetermined value. One could write: "If A - B < ε then…" But this is wrong, because the comparison should have been with the absolute value of A - B. Detection of this error is dependent on the values used for A and B and would not necessarily be found by simply executing every path through the program.)
Once high-quality software has been prepared initially, it is likely to undergo continuous change to accommodate new hardware, fix latent errors, or add new functions to existing systems. Configuration control requires rigorous review and formal approval of software changes. Managing multiple versions of software systems and assuring that changes do not degrade system reliability and safety is a difficult problem.
CURRENT U.S. NUCLEAR REGULATORY COMMISSION REGULATORY POSITIONS AND PLANS
The USNRC regulatory basis for software quality assurance is given in:
10 CFR 50.55a(h), Protection Systems, which mandates the use of IEEE Standard 279–1971, Criteria for Protection of Systems for Nuclear Power Generating Stations
Title 10 CFR Part 50, Appendix A, General Design Criteria for Nuclear Power Plants (Criterion 1, Quality Standards and Records; Criterion 21, Protection System Reliability and Testability; Criterion 22, Protection System Independence; and Criterion 29, Protection Against Anticipated Operational Occurrences)
Title 10 CFR Part 50, Appendix B, Quality Assurance Criteria for Nuclear Power Plants and Fuel Reprocessing
Plants (Section III, Design Control; Section V, Instructions, Procedures, and Drawings; and Section VI, Document Control)
To provide more specific guidance, the USNRC uses Regulatory Guide 1.152, Criteria for Programmable Digital Computer System Software in Safety-Related Systems of Nuclear Power Plants, and ANSI/IEEE/ANS 7-4.3.2–1982, Application Criteria for Programmable Digital Computer Systems in Safety Systems of Nuclear Power Generating Stations (promulgated jointly by the American National Standards Institute, the Institute of Electrical and Electronics Engineers, and the American Nuclear Society), in conducting software reviews. Other standards are used as reference, e.g., IEEE 1012–1986, IEEE Standard for Software Verification and Validation Plans, and ASME [American Society of Mechanical Engineers] NQA-2A–1990, Part 2.7, Quality Assurance Requirements of Computer Systems for Nuclear Facility Applications. The Standard Review Plan cites and makes use of these standards and is an attempt to integrate their various requirements.
USNRC staff reviews of the V&V processes used during software development seem quite thorough. One particularly good example is the staff review of the V&V process for the Eagle 21 reactor protection system installed at Zion Units 1 and 2 (USNRC, 1992). Staff activities included comparing V&V to ANSI/IEEE/ANS 7-4.3.2–1982, verifying the independence of V&V personnel, reviewing the development of functional requirements and subsequent software development documents, and reviewing software problem reports and fixes. They also performed a thread audit by picking sample plant parameters and tracing the software development from developing the requirements to the writing and testing of code. This review included reviewing code on a sample basis, comparing software development documents and code, and examining software problem reports and corrections. The entire system was also examined for potential timing problems between the software and hardware.
The staff noted: "Experience with computer projects has demonstrated that the development of computer system functional requirements can have a significant impact on the quality and safety of the implemented system" (USNRC, 1992). The staff randomly sampled 56 of the 408 problem reports and found that 21 percent had significant implications (e.g., equations that did not match requirements or logic defects). Discovery of this type of error raised the staff's concerns regarding the potential for common-mode failures in digital electronics and convinced the staff that rigorous V&V activities were needed to augment the developer's functional tests. The staff's thread audit discovered three discrepancies between the requirements and the design documents (e.g., a piece of source code that the requirements seemed to mandate but that was omitted in the design). The staff concluded that although there were problems in implementation of the V&V plan, the basic plan was sound. The staff also considered whether the use of different releases of compilers affected the correctness of the software. They also considered Commonwealth Edison's configuration management program for the software. The USNRC approved the approach taken on both of these issues.
Research and Plans
The seven existing sections of Chapter 7 of the 1981–1982 version of the Standard Review Plan (SRP) are being updated (project completion expected in June 1997) to incorporate digital technology aspects. Two new sections are being added (Section 7.8, Diverse I&C Systems, to deal with the ATWS [anticipated transients without scram] rule and the defense-in-depth and diversity analysis of digital safety I&C systems, and Section 7.9, Data Communications, to deal with new issues like multiplexing). New branch technical positions are also being developed for inclusion in the SRP update, including ones on software development process, software development outputs, and programmable logic controllers.
As part of the SRP update process, the USNRC is developing regulatory guides to endorse (with possible exceptions) 10 industry software standards:
IEEE 7-4.3.2–1993, Standard Criteria for Digital Computers in Safety Systems of Nuclear Power Generating Stations (an update of the 1982 version)
IEEE 603–1991, Standard Criteria for Safety Systems in Nuclear Power Generating Stations (follow-on to IEEE 279–1971)
IEEE 828–1990, Standard for Configuration Management Plans
IEEE 829–1983, Standard for Software Test Documentation
IEEE 830–1984, Guide for Software Requirements Specifications
IEEE 1008–1987, Standard for Software Unit Testing
IEEE 1012–1986, Standard for Software Verification and Validation Plans
IEEE 1028–1988, Standard for Software Reviews and Audits
IEEE 1042–1987, Guide to Software Configuration Management
IEEE 1074–1991, Standard for Developing Life Cycle Processes
The USNRC also has ongoing research programs. One of these, called Review and Assessment of Software Languages for Use in Nuclear Power Plant Safety Systems, is assessing advantages and disadvantages of programming languages used in safety systems. Another, called Measurement Based Dependability Analysis for Digital Systems, is analyzing operational failure data to estimate failure probabilities.
Finally, as a member of the Halden Reactor Project, the USNRC is following research being conducted at the project on the use of formal methods in development and in quality assurance/licensing issues.
DEVELOPMENT IN THE U.S. NUCLEAR INDUSTRY
During the course of Phase 2 activities, the committee talked with three digital I&C vendors about software quality assurance: Foxboro Controls, General Electric Nuclear Engineering, and Westinghouse. Vendors reported developing systems containing at least 10,000 lines of code in a mixture of high-level and assembly languages. Their software quality assurance programs were generally modeled after IEEE 7-4.3.2–1993 and IEC 880 and had been audited and approved by USNRC staff.
In Phase 2, the committee also talked with a number of nuclear utilities engaged in digital I&C upgrades: Baltimore Gas and Electric Company, Public Service Electric and Gas (PSE&G) Company, Northeast Utilities, and Pacific Gas and Electric Company. Representatives from several of the utilities mentioned that strong requirements analysis and configuration control were keys to producing high-quality software. The representatives noted that strong analysis requirements and configuration control should be applied to safety-critical software and nonsafety software, even though nuclear plant designs routinely separate the hardware and software so that nonsafety software does not run on the same computer as the safety-critical applications. It is clear that high standards must be applied to software running on safety-critical computers since any such program has the potential to cause a safety-critical failure. The utility representatives emphasized that the same strong requirements should be applied to the nonsafety software because even nonsafety applications could malfunction in such a way that safety systems could be required to respond and have safety implications. They also noted that hazard/failure analyses should be part of a V&V program. PSE&G described a four-stage review that considers hardware-software interactions, the software development process, thread analysis of a small number of functions, and component-based failure analysis.
DEVELOPMENTS IN THE FOREIGN NUCLEAR INDUSTRY
During the course of Phase 2 activities, the committee also talked with representatives from the Canadian and Japanese nuclear power industry and had access to information on the British experience with digital I&C systems pertaining to software quality assurance. A representative from Mitsubishi Heavy Industries asserted that they rely on the IEC 880 standard for software quality assurance. British Nuclear Electric issued Nuclear Electric Programmable Electronic Systems guidelines for the quality assurance of digital I&C systems.
Considerable controversy surrounds the results of British Nuclear Electric's tests of the Sizewell B primary protection system (PPS). These tests were not part of system validation testing, but rather a set of tests concentrated on infrequent fault scenarios that were designed to support safety claims made for the PPS (W.D. Ghrist III, personal communication to the committee, May 1996). Most test results were to be resolved automatically by use of a test driver that compared them to responses predicted from a model, and the remainder were to be resolved manually. However, only half of the first 50,000 tests were resolved automatically, resulting in reports that the PPS failed 50 percent of its tests. Manual inspections of test results were necessary because of timing problems between the PPS and the test driver. For example, inputs were not being provided to the PPS fast enough to prevent it from indicating failures of incoming data link communications, or the PPS responded at a rate much faster than input values were changing. In fact, only three or four errors were found in time delay and setpoint levels because of specification discrepancies.
One conclusion that could be drawn about this experience is that there were problems with the completeness and configuration control of the requirements: Understanding the response time of the PPS required knowledge of the system design as well as the requirements; hysteresis information was in the original functional specification but not the specification provided to the test group; and default actions on some input quantities were omitted from the specifications.
Canada's Atomic Energy Control Board (AECB) licensed a computerized shutdown system at Atomic Energy of Canada Limited's (AECL) Darlington plant operated by Ontario Hydro. The AECB had originally raised objections about the lack of a widely accepted definition of what constituted "good enough" for safety-critical software. Ontario Hydro used formal methods to verify the consistency of the software and the requirements and also used tests randomly chosen to model one of six accident scenarios to demonstrate the system's reliability (Joannou, 1993).
Ontario Hydro and AECL embarked on an effort to develop standards for the software engineering process, the outputs from the process, and the requirements to be met by each output. The standards, called OASES, use a graded approach based on categories of criticality. For each category, OASES defines a software engineering process, procedures used to perform activities within each step of a process, and guidelines defining how to qualify already developed software in each category. OASES is a more unified approach to developing standards than the USNRC approach of developing standards for individual process activities.
The AECB has also developed a draft regulatory guide, C-138, Software in Protection and Control Systems, for software assessment (AECB, 1996). They stress that "evidence of software completeness, correctness, and safety will have to be reviewed and understood by people other than those who prepared it." Several aspects were identified as critical for providing evidence of high-quality software:
software requirements specification
systematic inspection of software design and implementation
the software development process and its management
The AECB draft regulatory guide (AECB, 1996) includes a number of acceptance criteria:
Software requirements should be unambiguous, consistent, and complete. Requirements should be precise enough to distinguish between correct and incorrect implementations, and mechanical rules should exist for resolving disputes about the meanings of requirements. The attributes indicate that a formal notation be used. The notation should define how continuous quantities in the environment can be represented by discrete values in software.
Systematic inspections should include functional analysis to provide evidence that the software does what it is defined to do, and software safety analysis to provide evidence that the software does not initiate unsafe actions. Functional analysis should be based on formally defined notations and techniques so that mathematical models and automated tools can be used. A system-level hazard analysis should determine the contribution of software to each hazard, and analysis should extend into the software to increase confidence that hazardous states cannot occur.
Both functional and random testing should be employed. Functional tests should be chosen to expose errors in normal and boundary cases, and measures of test coverage should be reported for them. Random tests selected from input conditions should be used to demonstrate that the software will function without failure under specific conditions.
Software design and implementation methods are rapidly improving. Instead of mandating a single set of methods, the guide specifies that software be developed "by properly qualified people following a controlled and accepted software development and quality assurance plan." Methods selected should enable software designs and implementations to be reviewed to determine if quality attributes (e.g., completeness, consistency, etc.) have been attained.
Configuration management should be used to control change. Changes should be justified and reviewed, and all artifacts (e.g., designs and test plans) relating to the component being changed should also be updated. Changed release versions (with indicated changes) should be distributed to all holders of the original versions, including the regulatory agency.
DEVELOPMENTS IN OTHER SAFETY-CRITICAL INDUSTRIES
During the course of Phase 2 activities (see Appendix B), the committee heard presentations from John Rushby of SRI International, committee member Michael DeWalt of the Federal Aviation Administration, Joseph Profetta of Union Switch and Signal Inc., and Lynn Elliott of Guidant Cardiac Pacemakers. The committee also examined the circumstances surrounding problems in other applications.
Dr. Rushby summarized his experience with a number of high-assurance software systems by stating that mishaps are generally due to requirements errors rather than coding errors. Current techniques for quality assurance are adequate for later software development stages (e.g., coding). However, early stages have weak V&V methods because of a lack of adequate validation techniques, particularly for systems with complex interactions (e.g., concurrent, fault-tolerant, reactive, real-time systems). Dr. Rushby suggested that formal methods could be used to specify assumptions about the environment in which a system operates, the requirements of the system, and a design to accomplish the requirements. If these specifications were written, they could be analyzed for certain forms of consistency and completeness and validated by posing "challenges" as to whether a specification satisfies a requirement or whether a design implements a specification.
Committee member DeWalt presented the FAA's Software Considerations in Airborne Systems and Equipment Certification (DO-178B), which provides guidelines for the production of software for airborne systems. These guidelines represent an industry and regulator consensus document. The guidelines used by the FAA identify 66 objectives covering the entire software development process. These objectives represent a distillation of best industry practices and do not rely on or reference other standards or guidelines. The number of objectives that must be satisfied and the associated rigor applied is a function of five different severity categories of safety. These objectives for the most part have objective acceptance criteria understood by the regulators and industry. The compliance of a specific software product with the guidelines is established by examining data products produced by the software process and interviewing developer personnel. The guidelines recognize that objectives can be satisfied by alternative methods (e.g., service experience) provided that equivalent levels of confidence can be demonstrated. The FAA also has a delegation system that allows industry representatives to make compliance findings on behalf of the FAA.
Mr. Profetta described the distributed process control systems in which control signals from remote controllers could
be overridden by local signals in trains or switches. Critical software is developed following IEEE standards and development processes. Quality is assured via extensive testing on a simulation of a train control system. The application has very well-defined safety problems, and only six events are needed to characterize the problems. Extensive testing is undertaken using seeded faults to estimate the probability that test cases expose faults.
Mr. Elliott stated that his most difficult software development problem was writing and reviewing requirements specifications. Food and Drug Administration (FDA) regulators expect natural language requirements, but Mr. Elliott has found that describing systems with Statemate (Harel et al., 1990), a notation for describing event-driven reactive systems, is superior to either natural language or data flow diagrams. Guidant Cardiac Pacemakers developers use fault tree analysis to analyze the safety of their system and dynamic testing to ensure the software's behavior. FDA regulators specify guidelines for these activities but do not prescribe particular development methods.
A prior NRC study of space shuttle avionics software (NRC, 1993) identified shortcomings of inspections with respect to assumptions reviewers made about hardware and software platforms on which their implementations execute. Inspections focus on the development of software by a single contractor, and do not probe beyond the descriptions of interfaces supplied by other contractors. As a result, implementations are vulnerable to errors arising from assumptions made about erroneously documented interfaces.
The Ariane 5 failure (Lions et al., 1996) offers a cautionary note for drawing conclusions about the reliability or safety of software based on prior operating experience. The first flight of the Ariane 5 launcher ended in a failure caused by responses to erroneous flight data provided by alignment software in its Inertial Reference System. Part of the data contained a diagnostic bit pattern which was erroneously interpreted as flight data. The alignment software computes meaningful results only before lift-off. After lift-off, this software serves no purpose.
The original requirement for the continued operation of the alignment software after lift-off was retained during 10 years of the earlier models of Ariane, in order to cope with a hold in the countdown. The period selected for continued alignment operation was based on the time needed for the ground equipment to resume full control of the launcher in the event of a hold.
The same requirement does not apply to Ariane 5, but was maintained for commonality reasons, presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4.
REVIEW OF EXPERIENCE
In order to better understand what types of software problems have occurred in software quality assurance, the committee reviewed a number of licensee event reports (LERs, which are submitted to the USNRC) and summaries of LERs reporting problems with computer-based systems in nuclear power plants. LERs describing events at Diablo Canyon (LER 92-028-00), Salem (LER 92-107-00), and Turkey Point (LER 94-005-02) identify instances of software design errors, inadequate review of requirements and designs, excessive reliance on testing as a V&V method, and problems with configuration control. The Turkey Point incident illustrates several problems that can occur.3
The Florida Power and Light (FPL) Company's Turkey Point LER describes an upgrade to the Turkey Point Unit 3 and 4 emergency power system (EPS) using commercial-grade programmable logic controllers (PLCs) in the EPS load sequencer. FPL stated that these new load sequencers would replicate the functions of the old sequencers, with some improvements to the sequence timing for loading of safety equipment. In response to USNRC review, FPL committed to follow the verification and validation program in IEEE 1012-1986, Standard for Software Verification and Validation Plans, and the guidelines in Regulatory Guide 1.152, which endorses ANSI/IEEE/ANS 7-4.3.2–1982, Application Criteria for Programmable Digital Computer Systems in Safety Systems of Nuclear Power Generating Stations. Additionally, the contractor responsible for developing and installing the load sequencer performed independent V&V of the PLCs and the load sequencer logic.
FPL qualified the PLCs as Class 1E through dedication of the commercial-grade equipment based on guidance provided in EPRI [Electric Power Research Institute] NP-5652, Guideline for the Utilization of Commercial Grade Items in Nuclear Safety Related Applications. FPL guaranteed that all logic functions would be tested under the guidelines of the above-mentioned V&V program, particularly to ensure that there were no common-mode failures between the redundant trains of load sequencers. FPL stated that in addition to the regularly scheduled startup and bus load tests, the load sequencer would be tested "continuously" using an automatic self-test mode. This enhancement was approved by the USNRC (Newberry, 1990).
On November 3, 1994, Turkey Point Unit 3's sequencer failed to respond to Unit 4's safety injection (SI) signal because of a defect in the sequencer software logic. The defect could inhibit any or all of the four sequencers from responding to input signals. The problem arose in trying to design the sequencers so that if a "real" emergency signal is received while the sequencer is being tested, the test signal clears and the engineering safety features controlled by the sequencer are activated.
As actually implemented, if an SI signal is received 15 seconds or later into particular test scenarios, the test signal is cleared but the inhibit signal preventing actuation is
maintained by latching logic. The test signal initiates the latching logic, but an input signal maintains the latching logic if the signal arrives prior to the removal of the test signal. Thus, if a real signal arrives more than 15 seconds into the test scenario, the test signal clears but the inhibit logic is held locked in and actuation is prevented. As the result of erroneous inhibit signals, any sequencer output might be blocked. The outputs blocked are determined by a combination of factors, including which test scenario was executing, the length of time the test was running, and which other inputs were received.
The designer and independent verifier failed to recognize the interactions between the inhibit and test logic. An independent assessment team found that logic diagrams contained information not reflected in the ladder diagrams, and that the V&V was not comprehensive enough to test certain aspects of the logic. In its review, the USNRC stated, "The plan was weak in that it relied almost completely on testing as the V&V methodology. More emphasis on the analysis of the requirements and design would have increased the likelihood of discovering the design flaw." This incident illustrates many of the potential problems with digital systems: added design complexity from self-testing software components, incomplete requirements, and inadequate testing.
Two recent studies by Lee (1994) and Ragheb (1996) provide data on digital application experiences in the United States and Canada. Lee reviewed 79 LERs for digital failures and classified them according to their root causes. With respect to the U.S. experience, Lee found that electromagnetic interference (EMI), human-machine interface error, and software error caused a "significant number of failures" (where "significant" is not defined in the report) in digital components during the three-year period studied (1990–1993). Fewer digital system failures involved random component failure. The actual numbers are shown in Table 4-1. The report concludes that the root causes of these failures were (1) poor software V&V, (2) inadequate plant procedures, and (3) inadequate electromagnetic compatibility of the digital system for its environment.4
Although the study is not yet completed, the Canadian AECB has been reviewing data from the United States, Canada, and France on software failures in nuclear power plants (Ragheb, 1996). The reviews include only events that resulted in consequences that meet reporting criteria of the government agency and do not necessarily include all digital system failures. The results of this study are in draft form only and may change before final publication. It is also important to note again that classification of errors is very
TABLE 4-1 U.S. Software-Related LERs between 1990 and 1993
Cause of Events
Number of Events
Human-machine interface error
Random component failure
Source: Lee (1994).
difficult and may be subject to the classifier's biases or personal definitions.
In the AECB study, 459 event reports from 22 reactors over 13 years are being evaluated. The AECB found all trends either decreasing or flat, except those attributable to inappropriate human actions (which have shown a marked increase in the last five years). Hardware problems overall were found to be decreasing with time, although peaks can be found in some recent years. The number of software faults appears to be relatively constant over time.5
A large majority of the computer-related events occurred in digital control systems, which is not surprising given that they have been in operation the longest (since 1970) and perform a complex and continuous task: 363 computer system failures were in control systems, 29 in shutdown systems, and 65 in other systems. Table 4-2 shows the distribution of the failure types.
Of the problems classified as relating to software, 104 involved application software, four involved the executive or operating system, four a database or table, and five were classified as other.
We emphasize that the classification of the errors in this report was subjective and thus the data should be used with caution. However, it does appear that a number of software errors have been found in operating nuclear power plant software and more extensive evaluation and collection of data would be useful in making decisions about most of the issues in this report.
Finally, Ragheb notes that introducing modern digital I&C systems may not alleviate software quality assurance concerns. He points out: "Programmable logic controllers (PLCs) are being introduced as a cost-effective method of replacing older analogue or digital controls. PLCs have resulted in a number of incidents within the plants and it must be recognized that they are themselves digital computers."
A study of PLCs used in a U.S. phenol plant (Paula, 1993) reported a processor failure rate of approximately two per
TABLE 4-2 Summary of Canadian Software-Related Event Reports 1980-1993
year. The plant operators also reported a total of our complete PLC failures (both primary and secondary processors) for all PLCs over seven years of plant operation. No PLC failures were reported because of errors in the software, including operating systems and applications software, or because of operator error. For PLCs with fault-tolerant redundant architectures installed to perform control interlocks in several nuclear power plants of French design, Paula found there were 58 failures of both processors out of a total 1,200 PLCs over a three-year period (Paula, 1993).
In evaluating these data, Paula warns that system size and complexity are important factors. The PLCs considered are relatively simple, generally accepting a few input signals and performing only a few control functions. In a study of fault-tolerant digital control systems that are much larger and more complex than these PLCs, the failure rates were about 15 to 50 times higher (Paula et al., 1993). In these fault-tolerant digital control systems, software errors were an important contributor to system failure. In several of the systems studied, failure due to software errors occurred as often as hardware failures, and the authors further (Paula et al., 1993) conclude that software errors tended to be difficult to prevent because they may occur only when an unusual set of inputs exists. Inadvertent operator actions, particularly during maintenance, also contributed significantly to the frequency of failures of these fault-tolerant digital control systems.
CONCLUSIONS AND RECOMMENDATIONS
Conclusion 1. Software quality assurance procedures typically monitor process compliance rather than product quality. In particular, there are no generally accepted evaluation criteria for safety-related software; rather, standards and guidelines help to repeat best practices. Because most software qualities related to system safety, e.g., maintainability, correctness, and security, cannot be measured directly, it must be assumed that a relationship exists between measurable variables and the qualities to be ensured. To deal with this limitation, care must be taken to validate such models, e.g., using past development activities, and to assure that the measurements being made are appropriate and accurate in assessing the desired software qualities.
Conclusion 2. Prior operating experience with particular software does not necessarily ensure reliability or safety properties in a new application. Additional reviews, analysis, or testing by a utility or third-party dedicator may be necessary to reach an adequate level of assurance.
Conclusion 3. Testing must not be the sole quality assurance technique. In general, it is not feasible to assure software correctness through exhaustive testing for most real, practical I&C systems.
Conclusion 4. USNRC staff reviews of the verification and validation process used during software development seem quite thorough.
Conclusion 5. Exposing software flaws, demonstrating reliable behavior of software, and finding unintended functionality and flaws in requirements are different concepts and should be assessed by a combination of techniques including:
Systematic inspections of software and planned testing with representative inputs from different parts of the systems domain can help determine if flaws exist in the software.
Functional tests can be chosen to expose errors in normal and boundary cases, and measures of test coverage can be reported for them.
Testing based on large numbers of inputs randomly selected from the operational profiles of a program can be used to assess the likelihood that software will fail under specific operating conditions.
Requirements inspections can be an effective method for detecting software defects, provided requirements are studied by several experienced people who did not participate in their construction. The effectiveness of these reviews also depends on the quality of the requirements.
A system-level hazard analysis can identify states that, combined with environmental conditions, can lead to accidents. The analysis should extend into software components to ensure that software does not contribute to system hazards.
Conclusion 6. The USNRC research programs related to software quality assurance appear to be skewed toward investigating code-level issues, e.g., coding in different languages to achieve diversity and program slicing to identify threads containing common code.
Conclusion 7. Rigorous configuration management must be used to assure that changes are correctly designed and implemented and that relationships between different software artifacts are maintained.
Conclusion 8. Software is not more testable simply because the design has been implemented on a chip. Use of any technology requiring equivalent design effort to software requires commensurate quality assurance. For example, this conclusion applies to ASIC (application-specific integrated circuit), PLC (programmable logic controllers), and FPGA (field programmable gate arrays). However, the committee notes that the use of these technologies may be useful in addressing some configuration management problems.
Recommendation 1. Currently, the USNRC's path is to develop regulatory guides to endorse (with possible exceptions) a variety of industry standards. The USNRC should develop its own guidelines for software quality assurance that focus on acceptance criteria rather than prescriptive solutions. The draft regulatory guide, Software in Protection and Control Systems, by Canada's Atomic Energy Control Board is an example of this type of approach. The USNRC guidelines should be subjected to a broad-based, external peer review process including (a) the nuclear industry, (b) other safety-critical industries, and (c) both the commercial and academic software communities.
Recommendation 2. Systems requirements should be written in a language with a precise meaning so that general properties like consistency and completeness, as well as application-specific properties, can be analyzed. Cognizant personnel such as plant engineers, regulators, system architects, and software developers should be able to understand the language.
Recommendation 3. USNRC research in the software quality assurance area should be balanced in emphasis between early phases of the software life cycle and code-level issues. Experience shows the early phases contribute more frequently to the generation of software errors.
Recommendation 4. The USNRC should require a commensurate quality assurance process for ASICs, PLCs, and other similar technologies.
AECB (Atomic Energy Control Board, Canada). 1996. Draft Regulatory Guide C-138, Software in Protection and Control Systems. Ottawa, Ontario: AECB.
Dijkstra, E.W. 1970. Structured programming. Pp. 84–88 in Software Engineering Techniques, J.N. Buxton and B. Randell (eds.). Brussels: Scientific Affairs Division, NATO.
Fagan, M.E. 1976. Design and code inspections to reduce errors in program development. IBM Systems Journal 15(3):182–211.
Gerhart, S., and L. Yelowitz. 1976. Observations of fallibility in applications of modern programming methodologies. IEEE Transactions on Software Engineering 1(2):195–207.
Goel, A.L., and F.B. Bastani. 1985. Forward: Software reliability. IEEE Transactions on Software Engineering 11(12):1409–1410.
Harel, D., H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and M. Trakhtenbrot. 1990. STATEMATE: A working environment for the development of complex reactive systems. IEEE Transactions on Software Engineering 16(4):403–414.
IEC (International Electrotechnical Commission). 1986. Software for Computers in the Safety Systems of Nuclear Power Stations, IEC 880. Geneva, Switzerland: IEC.
IEEE (Institute of Electrical and Electronics Engineers). 1993. IEEE Standard Criteria for Digital Computers in Safety Systems of Nuclear Power Generating Stations, IEEE Std 7-4.3.2–1993. New York: IEEE.
Joannou, P.K. 1993. Experiences for application of digital systems in nuclear power plants. NUREG/CP-0136. Pp. 61–77 in Proceedings of the Digital Systems Reliability and Nuclear Safety Workshop, U.S. Nuclear Regulatory Commission, September 13–14, 1993, Gaithersburg, Md. Washington, D.C.: US. Government Printing Office.
Lee, E.J. 1994. Computer-Based Digital System Failures. Technical Review Report AEOD/T94-03. Washington, D.C.: USNRC. July.
Leveson, N.G. 1995. Safeware: System Safety and Computers. New York: Addison-Wesley.
Lions, J.L., L. Lubeck, J.-L. Fauquembergue, G. Kahn, W. Kubbat, S. Levedag, L. Mazzini, D. Merle, and C. O'Halloran. 1996. Ariane 5 Flight 501 Failure: Report by the Inquiry Board. Paris: European Space Agency. July 19.
Moore, J.W., and R. Rada. 1996. Organizational badge collecting. Communications of the Association for Computing Machinery 39(8):17–21.
Musa, J.D., A. Iannino, and K. Okumoto. 1987. Software Reliability: Measurement, Prediction, Application. New York: McGraw-Hill Book Company.
Myers, G. 1979. The Art of Software Tests. New York: John Wiley and Sons.
Newberry, S. 1990. SSICB Review of the Load Sequencers in the Enhanced Power System at Turkey Point Plant, Units 3 & 4. Docket Nos. 50-250 and 50-251, November 5, 1990. Washington, D.C.
NRC (National Research Council). 1993. An Assessment of Space Shuttle Software Development Processes. Aeronautics and Space Engineering Board, National Research Council. Washington, D.C.: National Academy Press.
Parnas, D.L. 1985. Software aspect of strategic defense systems. Communications of the Association for Computing Machinery 28(12):1326–1335.
Parnas, D.L., A.J. van Schouwen, and S.P. Kwan. 1990. Evaluation of safety-critical software. Communications of the Association for Computing Machinery 33(6):636–648.
Paula, H.M. 1993. Failure rates for programmable logic controllers. Reliability Engineering and System Safety 39:325–328.
Paula, H.M., M.W. Roberts, and R.E. Battle. 1993. Operational failure experience of fault-tolerant digital control systems. Reliability Engineering and System Safety 39:273–289.
Porter, A., H.P. Sly, and L.G. Votta. 1996. A review of software inspections. Pp. 40–77 in Software Process, Advances in Computers 42, M.V. Zelkowitz (ed.). San Diego: Academic Press.
Ragheb, H. 1996. Operating and Maintenance Experience with Computer-Based Systems in Nuclear Power Plants. Presentation at International Workshop on Technical Support for Licensing of Computer-Based Systems Important to Safety, Munich, Germany. March.
Rushby, J. 1993. Formal Methods and the Certification of Critical Systems. Menlo Park, Calif.: SRI International. November.
Rushby, J., and F. von Henke. 1991. Formal Verification of the Interactive Convergence Clock Synchronization Algorithm Using EHDM. Technical Report SRI-CSL-89-3R. Menlo Park: SRI International. August.
Rushby, J., and F. von Henke. 1993. Formal verification of algorithms for critical systems. IEEE Transactions on Software Engineering 19(1):13–23.
USNRC. 1992. Safety Evaluation by the Office of Nuclear Reactor Regulation Related to Amendment No. 138 to Facility Operating License No. DPR-39 and Amendment No. 127 to Facility Operating License No. DPR-48, USNRC, June 1992. Washington, D.C.: USNRC.
Common-Mode Software Failure Potential
INTRODUCTION AND BACKGROUND
Safety systems in nuclear power plants must reliably satisfy their functional requirements. To help achieve this goal, safety systems are designed to be single-failure proof, i.e., no single failure is to prevent safety system actuation if needed, nor shall a single failure cause a spurious activation. Various forms of redundancy are commonly used to achieve this design goal, i.e., to achieve the functional goals in the presence of component failures.
There are two approaches to providing redundant components: active redundancy and standby redundancy. In active redundancy, the outputs of multiple identical components or strings of components, operating in parallel, are compared or selected in some way to determine which outputs will actually be used. If the individual components are each highly reliable and fail independently, then a correct output can be assured with high probability.
To avoid the problem of spurious scrams in a nuclear power plant, the active redundancy may involve multiple channels, all carrying the same kind of information and connected so that no protection action will be taken unless a certain number of these channels trip simultaneously. For example, the output from four parallel strings of identical components might be combined using Boolean logic in such a way that the safety systems are activated when two of the four channels exceed the preset threshold level. In this way, a single channel failure cannot prevent or cause safety system activation.
The second type of redundant design uses standby (or backup) redundancy. In this scheme, one or more spares are available to replace failed components. An example of standby redundancy is switching to an alternate or backup power supply when loss of electrical power is detected. Combinations of active and standby redundancy can also be used.
In both active and standby redundancy, components are designed to implement the same function. If the components are identical, this is called component duplication. Component duplication provides protection against independent failures caused by physical degradation (e.g., wearing out) of the components.
The benefits of component duplication can be defeated by common-cause or common-mode failures. Common-cause failures are multiple component failures having the same cause. Common-mode failures denote the failure of multiple components in the same way, such as stuck open or fail as-is. Common-cause and common-mode failures occur when the assumption of independence of the failures of the components is invalid.
Common-cause failures can occur owing to common external or internal influences. External causes may involve operational, environmental, or human factors. The common cause may also be a (dependent) design error internal to the supposedly independent components.
To protect against common design errors, components with a different internal design (but performing the same function) may be used. This approach is called ''design diversity" in this report. Multiple versions of software that are written from equivalent requirements specifications are examples of design diversity. That is, the component requirements are the same, but the way the requirement is achieved within the component may be different. Two pieces of software that compute a sine function but use different algorithms to do so are an example of design diversity. As another example, consider two algorithms where the required function is to determine whether two numbers are equal. One algorithm may compute the ratio of the numbers and the other may compare their differences to some number epsilon which has a value close to zero.
A second type of diversity, which is called "functional diversity" in this report, involves components that perform completely different functions at the component level (although the components may be related in that they are used to satisfy higher-level system requirements). The crucial point is that the component requirements are different. An example of functional diversity is the use of high reactor power to flow ratio to cause a reactor trip using control rods, and high coolant temperature to cause a reactor trip using
boron concentration. Diversity in this case involves using different principles of operation or physical principles to satisfy the same or different system-level requirements. In the case of software, functional diversity means that the behavioral requirements for the software are different. For example, one program may check to see whether two numbers are equal and another, functionally diverse, program might select the larger of two numbers.
Note that the components must have different functional requirements to count as functionally diverse. Digital components that have the same functional requirements are not functionally diverse and do not make two separate systems diverse. An example of the latter case is the use of a digital component or components to provide the same protection functions where a diverse means to actually shutdown the reactor (such as control rods and soluble neutron absorption) is used. The system-level actuation functions may be physically different (dropping the control rods or injecting a soluble neutron absorber), but if the digital components are performing the same protection functions (detection of the conditions to signal the need for a reactor scram), then the digital components do not have functional diversity.
Redundancy is the use of duplication or diversity to provide alternate means of performing a required function in the event of failure of an individual item (single failure).
Redundancy may be active (all results, or components, are used) or standby (some results, or components, are not used until failure occurs).
Duplication is the use of multiple copies of the same component to provide protection against independent failures caused by physical degradation.
Design diversity is the use of two or more components with a different internal design but performing the same function.
Functional diversity is the use of two or more components to achieve different functions at the component level, although the functions may be related in terms of higher-level system requirements.
Design diversity and functional diversity are used to protect against common-cause or common-mode failures.
This chapter is concerned only with digital components. Design diversity, as defined above, is not extensively practiced in nuclear power plants for analog instrumentation and control; identical components and devices are used in redundant channels. This practice results from a conscious decision that design diversity of the nature suggested for software would introduce counter-productive complexity into the hardware environment. Analog systems are believed to fail in more predictable and obvious ways than do the more hidden and insidious failure mechanisms in software. This fact has allowed assessment and protection against common-mode analog failure potential without use of diversity except in a very limited way. It also allows the industry to collect operating experience on failure modes over a large application base.
Digital technology introduces a possibility that common-mode software failure may cause redundant safety systems to fail in such a way that there is a loss of safety function. Arguments for independence in redundant or functionally diverse hardware designs are often based on the failures being related to different physical principles or causes and therefore acceptably independent or on the ability to build in a particular failure mode, e.g., a value that is designed to fail open. These same arguments and methods do not apply to software. When considering common-mode software failure, the issue is whether assumptions about independence could be compromised when digital components are substituted for analog components.
Although the committee found that some people use the term "common-mode software error" to mean any software error, the term as used here specifically denotes errors that involve dependencies between two or more digital components. When only one of a set of diverse components is digital, i.e., when a digital component is used in conjunction with analog devices or human backups (e.g., when a relay system, a digital device, and a manual actuator are used together to provide design diversity and adequate reliability), there appear to be no additional issues raised over current practice. The committee sees nothing special concerning the common-mode failure problem in this situation that is not covered by current procedures to evaluate the potential for common-mode failure between different types of devices.
Statement of the Issue
Digital technology introduces a possibility that common-mode software failures may cause redundant safety systems to fail in such a way that there is a loss of safety function. Various procedures have been developed and evolved for evaluating common-mode failure potential in analog devices. Do these same procedures apply to computers and software or are different approaches to ensuring reliability needed? What does software diversity mean? Can it be achieved and assessed and, if so, how? Do techniques exist for assessing common-cause failure and common-mode failure when computers are involved? What are the implications of common-mode software failure for the licensing process and the use of component diversity? Are redundancy and diversity the most effective way to achieve reliability for digital systems?
Applicability to Existing and New Plants
The problem of common-mode software failure is important in both retrofits of digital components into existing plants and in new plant design. In older plants where digital components are being substituted for analog ones, assumptions