Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

APPENDIX D PROBABILISTIC RISK ASSESSMENT 1. THE APPROACH TO QUANTITATIVE RISK MANAGEMENT The output of a quantitative risk management function is a quantification and prioritization of issues, the controlling of which leads to optimal decisions involving safety, reliability, quality, per- formance, and cost. The approach is to implement a methodology that interprets, synthesizes, and integrates all elements of a product assurance program into a form suitable for clecision making. The input wouic] be the results from the various safety, reliability, and quality assurance programs of the field offices. The transformation of this information into a useful basis for decision making is the step that enables meaningful risk management to occur. The National Aeronautics ant] Space Aciminis- tration (NASA) has a variety of documents covering the approach to be taken in the discipline areas of safety, reliability, maintainability, an(l quality as- surance. These documents, subject to revisions, wouic! be the basic guides to be implemented by the various centers. It is the task of the risk assessment function to systematically process the output of the centers into a form suitable for meaningful risk management. The key require- ments for this critical information processing and assessment step are as follows: . · The figures of merit must be explicit ancT quantitative. The information processing must be baser! on an integrates! systems engineering approach (see also Section 5.~. The quantification of uncertainty must be an integral part of the information processing (see also Appendix E). The contributors to risk must be explicit, prioritized, and definec! in terms that enable measurable corrective actions. · Finally, the results shouicl provide the basis for rational analysis of alternatives for reduc- ing and controlling risk. The logic engine for carrying out the information processing is a risk-based mode! of each space ~5 system. The model should be structured to give perspective to the importance of the various tasks associated with the product assurance activity. The mode! must be a living mode! with continuous input into and from the design process. While this approach probably is not warrantee} in many cases, such as small automated spacecraft, it should be consiciere(l in large, complex programs especially those with potential risk to human life such as the STS or the Space Station. 2. TWO KINDS OF CONFIDENCE The essential objective of the risk management effort is "confidence" confidence that each space mission wit! perform substantially as planned, anc! confidence that it will not be destroyed or renclerec] significantly less useful by accidents or unforeseen problems (including excessive cost). Now, what is meant by condolence? One way we humans increase our confidence is to believe that we are highly competent. We shall call this "psychological" con- fidence. It can be extremely important for the effectiveness of an organization. NASA has done an excellent job in this area in the past, and this needs to continue. There is another kind of conficience that we shall call "engineering" confidence. This comes from in- depth understanding of the system un(ler consicl- eration, from creep knowledge of the design ant! testing program, ant! from knowing how to achieve quality in manufacturing, maintenance, operation, and flight readiness. There is another dimension to this notion of gaining engineering confidence. This comes from acknowledging that nothing ever built by man is 100°/O reliable. It comes from knowing that risks are always present. The objective, therefore, is to know just how large the risk is. Thus, engineering confidence and success come not from eliminating risk, which is impossible, but from controlling it and managing it. That means knowing what it is measuring it, knowing its size, shape, structure, etc. and taking steps to reduce the risk to ac- ceptable levels. Thus, the idea of engineering con- fidence is essentially equivalent to the quantification of risk. This equivalence makes engineering confi-

dence an objective quantity, as distinct from psy- chological confidence, which is subjective. Psycho- logical confidence is a matter of good feeling. Engineering confidence is objectively and logically related to the evidence available to the informa- tion, experience, test data, calculations, and, in- deed, to the consensual judgments of the experts involved. Engineering confidence is the quantitative expression of that evidence. That expression is formulated according to strict, logical, invariable rules. It is not a matter of opinion or mood. When a satisfactory level of engineering confi- dence has been established, then those involved in the program indeed will have a "good feeling." Therefore, engineering confidence produces psy- chological confidence. The reverse, as we know too well, is not necessarily true. 3. HOW IS CONFIDENCE GAINED OR REGAINED? The public and Congress, based on past tech- nological failures in the nation's space programs, are probably not going to be moved by psycholog- ical confidence in the future. Engineering confidence needs to be created. The issue of quantification needs to be faced. Those responsible for a program such as the NSTS need to be willing to ask themselves: "How confident are we that this design, this mission, this launch will succeed?" This is a powerful question, if it is properly used. How is this question used properly? The first step is to provide the format in which the answer is to be given. This makes the question into a workable tool. The proposed format is as follows, taking the STS as an example: Let us project ourselves into the future tO a time when we can imagine that many thousands of Shuttle missions have been launched. One can now Took back at the record and ask the following question: "In what fraction of these launches was the vehicle lost?" Let this fraction be +~ :'v This parameter would then be a very meaningful figure of merit describing the success, safety, and effectiveness of the program. At the present time, of course, the numerical value of this parameter is not known. One can only tell the state of knowledge about what this value wit! be. This is done in the form of a probability density curve against l~ rev, using a logarithmic scale, as shown in Figure Dot. / Po Move PROBABILITY DENSITY I ~ I \~ · CLOT 10-4 10-3 lo_2 ,o-1 10° FIGURE D-1 State of knowledge probability curve for frequency of loss of vehicle. This curve expresses the current knowledge about Rev based on all the information and evidence available. The width of the curve reflects the degree of uncertainty about the value of TV The whole shape and location of the curve is a portrayal of the current state of confidence in the vehicle. Therefore, this "state of knowledge" curve can be adopted as the format for quantitative expression of confidence. This curve is also the bottom-line output of a risk analysis of the vehicle. With curves of this type, together with an orderly compilation of the evidence on which the curve is based, NASA can build confidence in a tangible form. They can then communicate it convincingly to the whole technical and management team, and also tO Congress, tO review committees, and to the public at large. 4. DOCUMENTING CONFIDENCE THROUGH A QUANTITATIVE RISK MODEL At any point during the life of a project it is desirable to be able to reach for a document that presents the current risk status of the project in a compact, succinct, and quantitative form. This document should contain the bottom-line figures of merit and the numbers, tables, graphs, and diagrams that would capture and characterize the risk of the project. It also should make clear the main contributors to risk and the main sources of unreliability, doubt, and uncertainty at that time. The document, which might be called the Risk Summary Report, would be updated regularly and might be the basic document upon which the risk management function would draw. It would con- tain in an organized way the combined knowledge of the entire technical team on issues of risk. It would spell out what is known and not known on each point and would quantify all uncertainties so that decision makers could clearly understand the trade-offs among costs, benefits, and risks. Such a document can only be generated as the summary output report of an ongoing quantitative ~6

risk model (QRM) of the project. This model and this report, properly handled, could become an extremely useful mechanism, a primary channel for communication between management and the tech- nical team. Indeed, it could become an important framework and mechanism for communication and coordination among all parts of the technical team. If used in this way, the report would make a major contribution to the success of the project. The Risk Summary Report may be thought of as the final stage of an information machine. This machine is depicted in Figure D-2 as a kind of megaphone. At the right end in the figure are represented the working levels of the project and the design, fabrication, testing, and research or- ganizations. The information from all these activ- ities, relevant to risk, is continually gathered into the machine at the right. This information is digested and processed, through the logic of the QRM, and emerges finally as the Risk Summary Report. The primary information flow is thus from right to left in this figure. However, there is also ~ very important reverse flow, ~ kind of "back EMF." The fact that this machine exists, that it is orga- nizing ant! processing the information in certain ways, and that people are reading the output in certain ways, exerts a valuable orderly discipline on the working levels. Questions move from left to right, forcing the working levels to continually structure and organize their data and their thinking about risk. If the information machine is properly con- structed, it establishes not only an orderly caTcu- RISK SUMMARY REPORT PROJ ECT ~6 ^~ MANAGEMENT ~ ~3 DECISION-MAKING ~ - rating and recording mechanism but, perhaps even more importantly, it establishes a language and a conceptual framework that unifies and organizes the thinking, communication, and decision making of the whole project. Not only are better design decisions thus made, but enormous savings in time and talent can result simply from the fact that everybody is using the same language so that, to a great extent, all participants mean the same things by the same words. The QRM approach can provide an extremely valuable integrating framework for the Safety, Reliability, and Quality Assurance (SR&QA) ac- tivities. This framework would include the Failure Modes and Effects Analyses (FMEl4~) and hazard analysis work, which would become in effect part of the QRM. Indeed, one of the benefits of the QRM approach is that it would help to ensure that the results of the FMEA and hazard work are fully recognized and acted on at the decision level. One of the ways this benefit is achieved is through the discipline of quantification, which forces the major items to the surface, where attention must be paid to them. A second way is through the quantification of uncertainty, an even more stringent discipline, which forces an organization (for example), before it dismisses an item as an "acceptable" risk, to show quantitatively that the evidence available provides sufficient confidence to support that de- cision. The quantification of uncertainty also helps decision makers to know when a change in the hardware is needed or when the problem is just lack of confidence so that perhaps more testing is needed, rather than new designs. RISK REPORT PROPER (INFORMATION MACHINE) ~i: . 'INFORMATION FLOW BACK EMF PROJECT CONTRACTORS WORK PACKAGES, DESIGN, FABRICATION, TEST, etc. OUTSIDE EXPERTS FIGURE D-2 The Risk Summary Report as the final stage of an information machine. 117

5. THE ELEMENTS OF PROBABILISTIC RISK ANALYSIS 5.1 The "Set of Triplets" Definition of Risk In contemplating the design or operation of a project, those involved should say to themselves: "We know how things are supposed to work out; me know our plan. Now we would like to know what are the possible departures from that plan." Specifically, they would ask three questions: ° What can go wrong? · What is the likelihood of that happening under the current plan? O If it does happen, what are the consequences; i.e., what is the damage? The answers to these questions constitute a risk and reliability analysis. The answers might be arranged in a table as in Figure D-3. The first column contains descriptions and names of scen- arios. This is the answer to the first question above. The second column contains the likelihoods, li, of the scenarios, sit Here we use the wore! likelihood in a generic sense. How to quantify likelihood will be discussed in Section 5.2. The third column contains ' damage index," xi, which is a measure of the consequences of the ith scenario. ~ - Each row of the table thus constitutes a triplet giving a scenario, its likelihood, and consequences. This triplet constitutes then one answer to the three questions. The table itself, i.e., the set of all triplets ANSWERS TO: (1) WHAT CAN GO WRONG7 (2) WHAT IS THE LIKELIHOOD? (3) WHAT IS THE DAMAGE? SC ENA R 10 sl s2 s3 SN Ll KELI HOOD Q1 92 Q3 EN RRISK= t<sj Qj,xj> ~ FIGURE D-3 Quantitative definition of risk. DAMAG E xl x2 x3 XN _ denoted by the outer brackets, provides the total risk; in particular, R = {<si, li, Xi>) is the complete answer to the questions. Therefore this set of triplets is adopted as the definition of risk, R. This definition becomes the organizing principle for the QRM and, thus, for the SR&QA work on the project. What is being sought in this work is the identification of all possible significant scenarios and the characterization of their likelihood and consequences. 5.2 Quantifying Likelihood The idea of likelihood can be expressed quanti- tatively in different ways. For NASA-type risk work the most useful way might be what is called the "probability of frequency" approach. In this ap- proach, one can imagine a "model" in which a vehicle is launched, or a facility operated under specified conditions many, many times. In this thought experiment the scenario, si, wit} occur with a certain `'frequency," which is denoted ~i, and which is measured in occurrences per mission, per launch, per year, or other appropriate unit. These frequencies Hi may be thought of as abstract in the sense that, since the experiment cannot be run completely, the Hi cannot be meas- ured precisely. The Hi actually are parameters of the mode! and they can be usefully adopted as figures of merit indicating the safety and reliability of the system. We would like then to know the numerical values of these parameters, ~i. As mentioned above, these values will never be known precisely. However, we are not totally at a loss either. There is always a certain body of evidence and information relevant to these values. So now one can ask, "What inferences can be drawn from this evidence about the values of these parameters, and with what degrees of confidence can those inferences be drawn?" The answers to this question can be expressed in the form of probability curves against the pos- sible values of the parameters (as in Figure Dub. These curves are called state of knowledge curves. They become the final quantitative expression of risk and reliability. ~8

The remaining question is how these curves are developed from evidence available, considering that the evidence may be of very differing types: test tiara, actual flight experience, calculations, judg- ment of experts, experience of other similar equip- ment, etc. The answer is that the development of these curves makes heavy use of the fundamental theorem of inference, Bayes theorem. The use of this theorem is partly art and partly science, but it always can be done in a way that is meaningful for decision making purposes. In order for the individual state of knowledge curves on the He's to be a complete specification of the knowlecige available, certain assumptions must be macle. One is that the scenarios are approxi- n~ate~y mutually exclusive; i.e., only one can happen at a time. Another is that conditional on the data, different He's are statistically inclependent. If these assumptions are not satisfied, more complex ap- plications of Bayes theorem are required. However, for this discussion, we make these simplifying assumptions. 5.3 Structuring and Categorizing the Triplets Since the number of possible scenarios for a system can be very large, it is important in carrying out a Probabilistic Risk Assessment (PRA) to or- ganize and categorize the set of triplets. This can be done in many ways. Perhaps the most important categorization of triplets is by the magnitude of the consequent damage. For this, one wants to know what seen- 10° ados lead to destruction or Inactivation of the space mission. What is the total probability of such scenarios? What scenarios lead to substantial de- creases in the system's performance or usefulness? What is the probability of that outcome? A second way would be to categorize scenarios by the part of the system complex in which they originate. This would! give us a picture of the risk of the various elements and subsystems. Another important way of looking at the problem is to categorize the triplets by the phase of the flight in which they take place, thus making visible the risks attendant on each flight phase. value xi = 0 represents no damage and the value xi = 100 represents loss of vehicle (LOV). Inter- mediate values of xi represent partial loss of mission or vehicle. With this idea a useful pictorial pres- entation of risk can be developed in the following way: In the risk table, Figure D-3, the scenarios can be numbered in order of increasing damage; that is, such that Xi+1 Xi and let N be the total number of scenarios. Then we can define ~ (xi) N = Lli ~ j = i Thus defined, 4~(xi) is the total frequency of all scenarios having damage level xi or greater. If these ~(xi) are plotted on a log scale versus xi and the resulting step-function is smoothed, a curve, ¢(x) vs. x, is obtained which is known variously as the `'risk curve", the Rasmussen curve, or the "frequency of exceedance" curve as in Figure D-4. Its ordinate over any x is the frequency with which scenarios occur having damage equal to or greater than x. This curve also may be viewed as a figure of merit of the system. As before, since the Hi is not known exactly, one will not know the risk curve exactly. But from the uncertainty in the individual Hi, the uncertainty in 1o~1 10-2 10-3 10 - ~_ \ ~ 1 1 5.4 Pictorial Representation of Risk 100 It may be useful for some purposes to express the damage xi on an index scale, f0, 1004. The FIGURE D-4 Risk curve. ~9 x

d>(x) can be calculated. This uncertainty can then be presented in the form of a family of risk curves {apex): O'P' ll , shown, for example, in Figure D-5. This graph is called a '`risk diagram." For a fixed x, the uncer- tainty about (~(x) can be quantified by Prl~(x) 'alp (x)> = P . Suppose, for example, that ~.99~00) = 10-2. This means a conficience level of 99% that the frequency of LOV fi.e., 1)~IOO)] is less than or equal to .01. From a portrayal of such risk diagrams one can gain a rapid understanding of the contributions that various sources make to the overall risk of a system or program. s.s Use of Risk Diagrams in Decision Making Like everything else in life, large engineered systems, such as the STS, necessarily involve a degree of risk. In the case of engineered systems, however, intelligent design decisions can control the amount of risk. Sometimes through a Hash of insight it is possible to change or simplify a clesign in a way that not only recluces risk but also improves performance and reduces the cost. This floes hap- pen, and these are happy occasions. More often, however, the situation is that risk can be made, in principle, as small as one likes, but the price for this is diminished performance and increased cost of the system. The task of management, therefore, is to strike an optimal balance between risk, cost, and per- FREQUENCY OF EXCEEDANCE ~ it's art\ o FIGURE D-5 Risk diagram. formance. The balance is struck and fine-tuned continuously through ciay-to-(lay decisions, as the design evolves. In the "flash of insight" cases, the decisions are easy to make. In the more usual case, trade-offs are required. In these situations, it is useful ant] necessary to have quantitative input so that the amount of risk can be weigher! against the levels of cost and performance. The situation in such cases is portrayed in Figure D-6, which shows the anatomy of a general decision problem. Each option brings with it a certain risk, cost, and performance. If these three factors were precisely known, it would be easy to make the decision. What makes that problem interesting in real life is that these factors are never known with complete certainty. It is important, then, to quantify these uncertainties as part of the input to the decision analysis. Figure D-6 shows the uncertainties in cost and performance quantified in the form of probability curves. Each option, therefore, can be characterizec! by triplet ~C, B. R> diagrams. The (recision maker nest then choose which triplet (i.e., which option) he prefers. In the language of decision theory his degree of preference, as a function of the triplet, is called a utility function, U. The rule of quantitative risk analysis, as shown, is to provide the assessment of risk, inctucling uncertainty, as part of the input to decision prob- lems. Strictly speaking, PRA per se is limited to the risk part of the problem, but the same quan- titative way of thinking, the same probabilistic methodology, can be and should be applied to the cost and performance factors as well. 5.6 Assembly and Disassembly of Risk 5.6.1 Identifying Scenarios According to the definition of risk noted above, the first and most important step in risk assessment is to identify the scenarios. In this connection, the following are some key ideas. First of all, note that any scenario that can be described is actually a category of scenarios. Thus, "the pipe breaks" is a category that inclucles as sub-categories, "the pipe breaks longituclinally," "there is a (louble- endec! guillotine break," "the pipe breaks in such and such location," etc. A second point is that since the objective is to identify all possible significant scenarios, any method 120

A / cost, c UTILITY OF OPTION A: A p~' PA i0, US ~ U I~ OA. BA. Rid >) PERFORMANCE, b (a RA= ~ ~ PA In' ~ ,~ DAMAGE, x ~ 09: OPTION ~ ~ ~ ~ ~ POINT / \ \ PROBABILISTIC OF / \ / RISK DECISION ~ CN= PLY / ASSESSMENT ~ ~ COST ~ \ ~ / UTILITY OF OPTION N: N '/t / UN ' U ( < CN, BN. RN > ) PERFORMANCE ~ ~ OPTIMAL DECISION = ( RN= 41t CLIMAX (UA, UB,.- - UN) FIGURE D-6 Decision model. that helps one do that is good. Any new way of looking, any new way of categorizing that helps to be sure that no significant scenarios have been overlooked is good, so it is perfectly acceptable to use more than one approach to scenario identifi- cat~on. One approach that is quite useful is to break the overall engineered] system into parts and subparts. Each part can be examined in detail and the questions asked: "What can go wrong with this part? What scenarios can originate here?" This approach would seem to be particularly appropri- ate for space systems. "Parts" could be interpreted successively as physical segments of the total sys- tem, as functional subsystems in the system; they could also mean different phases of the system's mission life. Again, all different ways are helpful. Another point of interest is that some scenarios are single-event scenarios. Something fails ant! the system is ciamagec] or destroyecI. Other scenarios require several different events to happen coinci- clentally, sometimes referred to as multiple failures. Other scenarios are "chains" of events. These are "cascade" or "clomino" scenarios. Something hap- pens initially ant! because of that something else fails, which causes a chain of propagating events resulting in overall system failure. 121 Each of these types of scenarios reqires its own type of analytical tools. Failure modes and effects analyses (FMEAs) are useful for single-event scen- arios; event trees ant! event sequence diagrams for chains of event-type scenarios; ant] fault trees for coincident failures. In space systems and missions, one can expect all these types of scenarios to be present and expect all these analytic tools, and others, to be useful. The specific mix of methods and approaches should be cieterminec] by what is contributing to the risk. 5.6.2 Quantipcation of Scenarios In a methoclology that has worked well, long run frequency is used as the measure of likelihood of the scenario. Thus, an underlying Poisson-type random process model is used as the framework for discussing the risk ant! reliability behavior of the system. The scenario frequencies are then viewed as parameters in the Poisson model, and these parameters are used as figures of merit to indicate the safety and reliability of the system. The values of these scenario frequencies are determined from the frequencies of all the com- ponentevents (the "elemental" events) in the scen- ario, such as failure of valves, pumps, human errors, etc. The results of the modeling logic are thus to

express the frequencies of the scenarios in terms of the frequencies, pi, of these elemental events, Pi Fi (A t, A7, . . . Aj . . .) <~ y Now, the discipline of data analysis and statistical inference is applied. The question is asked: How big are the numbers Aj? Again, the state of knowI- ecTge probability curves are used to provide the answer (see Figure D-7. These curves must reflect all of the evidence anc! information available which are relevant to the Aj: all operating experience, test cIata, calculations, etc. In putting together this information, the logic of Bayes theorem is used to help evaluate and combine the various types of evidence correctly. The discipline of this theorem forces one to organize ant! coclify the eviclence ant! helps to curb wishful thinking. To apply Bayes theorem one needs two basic ingredients. The first ingredient is a Prior' state of knowrie(lge curve I't,j(Aj) which quantified the available qualitative information about Aj. Quati- tative infor~natic~n may be in the forte of precise knowledge of related components or expert engi- neering jucigement. The fact that this qualitative information can be quantified as a probability density is the ma jor result of the theory of sub jective probability that has been developed since the 1950's. The second ingredient is the `'1ikelihood func- tion" associated with the available data that con- tains information about Aj. These data could be industry data, test data, and/or fiefs] data. Let D = (Di, D,, ...) be the vector of data available. The likelihood function, L(Aj,D), is proportional to the conditional probability of observing the data D given Aj. For example, if the data are observed defects, then the likelihooc! function may be clerived from the Poisson distribution. Bayes theorem integrates these sources of infor- 1 1 1 /~ 1 1 , 6 10-6 10-5 10-4 10-3 10-2 10-1 100 . ~j FIGURE D-7 State of knowledge probability curve for elemental parameter A,. , PI (pi) ~~ ~ ~ ·pi 10-5 10-4 10-3 10-2 FIGURE D-8 State of knowledge probability curve for scenario frequency. / 10-1 10° mation. The state of knowledge curve for Aj given all information is Pj(Aj), which is proportional to P(,j (Aj) LAj, D) The proportionality constant is chosen so that Pj(Aj) is a probability density (i.e., it integrates to l). Having the curves Pj(Aj), they can now be "prop- agated" through equation (1) to obtain curves for the hi (Figure D-8). Finally, since the total loss-of- vehicle frequency is the sum of the ~i, ~I,O\~=~¢i, the curves Pi(~,) (through a mathematical convo- lution) are simply aggregated to obtain a new curve, I'm (~ ,,` ), for the LOV frequency. This curve, ill relation to the initial curve, Pawn) from Figure D- I, might appear as in Figure D-9. Curve PI is a more satisfactory state of knowledge than P., and thus is a better basis for a "go" decision. This aggregation shouIc! be done in stages, so they can be viewed at various levels of aggregation such as system, subsystem, unit. In this way, one could answer macroscopic questions like: "What is the total frequency of events that couIcl (destroy or inactivate the system?" By proceeding clown- ward in the aggregation, one conic! then see, at successively greater levels of detail, where the bulk of this frequency is coming from. This draws management's attention to the aspects of the (resign needing further attention. :~ P1 (~LOV) o CLOVE it/ I ~1 1 ~ - _1 ~ LOO 10-4 10-3 1 o-2 10-1 10° FIGURE D-9 States of knowledge (confidence) be- fore and after PRA. 122

5.6.3 Design Improvement !' The improvement between curves P`, ant! PI in Figure D-9 is simply an improvement in knowledge ant] confidence coming from stucly ant! analysis (PRA). It cloes not reflect any actual changes to the design of the system. If one now recognizes that, in the course of such a stucly and analysis, many areas of the design or maintenance/operation prac- tices will surely be discovered where we can do L' better, and if those improvements are then imple- mentecl, the probability curve will change again, hopefully to something like the curve P2 in Figure D-10. With repeated cycles of this type of analysis and with continued experience and technology im- provement, one may hope ultimately to achieve something like curve P3, which perhaps is what is needed to support a viable manned space program. Pa ('P~ov) / P2 ((P~ov) /~ Po (CLOVE 10-5 10-4 jo-3 jo-2 ~ ~ ~ (PAVLOV) 10-1 100 101 FIGURE D-10 Evolutionary system improvements are reflected in changes in the state of knowledge curves. 123

6