Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
4 Risk Assessment and Risk Management: The Committee's View 4.1 GENERAL CONCEPT Almost lost in the strong public reaction to the Challenger failure was the inescapable fact that major advances in mankincl's capability to explore ant! operate in space incleecI, even in routine atmospheric flight will only be accomplished in the face of risk. The risks of space flight must be accepted by those who are asker] to participate in each flight as well as by those who are responsible for the program. The Committee believes that the basis for NASA's acceptance of those risks shoulc] stem as much.as possible from rationally derived criteria. This acceptance also shouIc! depenc] very heavily on the quality of the methodology and the degree of objectivity by which the risks are deter- minecI, as well as the rigor by which the risks are controlled (i.e., managed). The Committee began its audit activities by focusing specifically on the FMEA, the CIL, and the hazard analysis process. However, very early in the data gathering phase it became clear that NASA's processes for analyzing failure modes, effects, ant! hazards coulc] only be unclerstooc] and evaluated intelligently when viewer] as elements of an overall program of risk assessment and risk management. In the Committee's view, any such program should include the following basic ele- ments: I. A comprehensive method for identifying po- tential failure mocles and hazards associates! with the system. 2. A specific, quantitative methodology for iden- tifying and assessing (or estimating) the safety risks of the system. 33 a. A risk management process by which the safety risks can be brought to levels or values that are acceptable to the final approval authority. Risk management inclucles: establishment of acceptable risk levels; institution of changes in system design or operational methods to achieve such risk levels; system valiclation and certification; and system quality assurance. In this usage, we define a "safety risk" as the probability (likelihooc! or chance) of suffering a particular consequence of a failure movie, mishap, or hazard. For a large, complex system such as the STS, there is a set of system risks each of which is comprised of many contributing risks. Thus, we use the plural "safety risks" of the system, since one may choose to manage these risks to different levels. There are actually two major functions present in the listing above. Risk assessment is comprised of the first two elements, identification and assess- ment of both the failure moctes and hazards, and the safety risks associated with them. Risk assess- ment is or shouIcl be a staff function, the results of which are proviclect as input to management. Risk management, on the other hand (the third element above), must primarily be a line manage- ment function. Within NASA, SRM&QA at Head- quarters and SR&QA at the centers are staff organizations. The Associate Administrator for SRM&QA reports to the NASA Administrator. Line management authority for NSTS extends from
the Administrator to the Level ~ Associate Acimin- istrator for Space Flight to the NSTS Program Director and thence through the Level Il Program Office to the Level ITI project managers. The concept of risk assessment ant] risk man- agement is employecl very explicitly within some private industries ant] public enterprises engagec! in the engineering development of complex systems. The nuclear power industry is one such, and the commercial aerospace industry is another. Within the USAF Systems Commanc] (including the Space Division, which clevelops military launch vehicles and spacecraft), risk assessment consists of a wide range of qualitative ant! quantitative tools, inclucI- ing the FMEA and hazard analysis. Risk manage- ment is viewed as a formal process involving the establishment, assessment, and control of risk to precletermined acceptable levels. Figure 4-1 illustrates a generic type of program planning and tracking chart that is used in risk management by the USAF. Levels of risk in the system, as evaluated by a specific risk assessment methodology, are plotted against time (anal the cost) to correct the problems contributing to risk. In this generic example, actual risk lags and exceeds the planned levels of risk for each category of risk, and throughout most of the program. The planned risk presents a target toward which the system risk is actively managed. The risk levels assessed at the conceptual design stage must eventually be evolved, through engineering, clown to levels acceptable to the approval authority (i.e., high level, program line management). This is accomplishes} through a "systems safety engineering" function that is an integral part of the engineering design and clevel- opment process from its inception. 4.2 NASA'S PROCESS: OVERALL COMMENTS The fundamental view of risk assessment and management discussed above took shape over the first few months of the Committee's activities. It former! a framework within which the Committee conic! conduct the subsequent stages of the auclit and more conficlently evaluate NASA's STS safety program of which the FMEAs, CTEs, and hazard analyses are only a few important parts. Much of the remainder of this report reflects the results of our inquiry into specific aspects of the ways in which NASA assesses and manages risks in the NSTS program. But we believe it is important, before plunging into specifics, to provide a sense of the "big picture" within which the Committee conducted its audit, anc! to give a general assessment of how NASA's current process (as clescribed in Section 3) relates to that picture. 4.2.1 NASA Risk Assessment NASA clefines risk as: "the chance (qualitative) of loss of personnel capability, loss of system, or damage to or loss of equipment or property." ENHB 5 3 00.4 ~ ~ D-2), p. a-41 To identify potential failure mocles and hazards, NASA uses input from many different sources: analyses, data gathering processes, design reviews, etc. Figure 4-2, obtained from the SR&QA Office at JSC, lists most of these sources for the NSTS. (However, the Committee is not aware of any FMEAs or hazard analyses being conclucted on software.) If employecl rigorously, these tools pro- vide a good basis for achieving element ~ of the three specified in Section 4.~. However, this list of sources might more appropriately be titled "Iclen- tify Potential Failures and Hazards," because most of the activities listed do not deal with risk. For example, the failure modes analysis identifies pos- sible hardware failure modes, but usually says little about the risk associated with each of them. When the effects analysis is added in, then part of the input needed to establish risk has been gained, but still nothing is inferred about the probability of occurrence of either the failure itself or the various possible effects that might result. A similar situation occurs in the identification of hazards. One can categorize failure modes on the basis of the consequences of their worst-case effects, as is done in a very rough way in the Critical Items List, for failure modes whose worst-case effects lead (for example) to loss of life or vehicle. Such a categorization is useful for calling urgent attention to certain failure modes and their attendant haz- ar(ls. Nevertheless, the listing of such items does not establish their contribution to the various risks of the system. In the NASA safety process, each item on the CIL has a retention rationale written for it. These retention rationale statements usually contain information which could, if used properly, contribute to a process for estimating the associated risk. However, the rationales appear to be used strictly as arguments for a waiver of the NSTS requirement that no single-point Criticality ~ or 34
ye an . .95 ~ @@ ~ 1 1 1 ° ~ 1 1 1 1 1 1 1 1 1 O ~ 0 ~ ID U) ~ CO Cad O - ~ S13~31 ASIA W31SAS 3113 ~ 6 35
HAZARD ANALYSES DESIGN & ENGINEERING STUDIES DEVELOPMENT & ACCEPTANCE TESTING SAFETY STUDIES AND ANALYSES FMEAs, CILs, & LILA CERTIFICATION TEST AND ANALYSIS SNEAK CIRCUIT ANALYSES MILESTONE REVIEWS FAILURE INVESTIGATIONS WAIVERS AND DEVIATIONS WALK-DOWN INSPECTIONS MISSION PLANNING ACTIVITIES SOFTWARE REVIEWS ASTRONAUT DEBRIEFINGS AND CONCERNS OMRSD/OMI FLIGHT ANOMALIES FLIGHT RULES DEVELOPMENT AEROSPACE SAFETY ADVISORY PANEL LESSONS LEARNEDOTHER PROGRAMS ALERTS CRITICAL FUNCTIONS ASSESSMENT INDIVIDUAL CONCERNS HOT LINE PANEL MEETINGS SOFTWARE HAZARD ANALYSIS FAULT TREE ANALYSIS INSPECTIONS CHANGE EVALUATION REVIEW OF MANUFACTURING PROCESS HUMAN FACTORS ANALYSIS SIMULATIONS PAYLOAD HAZARD REPORTS REAL TIME OPERATION PAYLOAD INTERFACES FIGURE 4-2 Techniques for the identification of potential sources of risk in the NSTS Program (after NASA JSC SR&QA). IR failure mocles be present when a mission is launched (see Sections 3.4. ~ and 5. ~ ). Similarly, in NASA's hazard analysis process, hazards are categorized as to level ant! status. Hazards are definer! as either critical or cata- strophic, clepending on whether or not there is time for any possible emergency action to be taken. Each "closed" hazarc! is categorized as being elim- inated, controlled, or an "accepted risk." Ration- ales are written to justify accepting the uncontrolled hazards; many times the same rationale is employee] that was used for retaining the critical failure modes (see Section 5.3 for elaboration). However, as in the case of the CTEs, these justifications do not establish the risk levels of the hazards. Thus, although the term "risk assessment" is used in many different ways and places in NASA clocu- ments and presentations, the Committee fount] that nowhere was the total activity described that is neecled to accomplish element 2 in Section 4. ~ above (i.e., a quantitative methodology for assess- ing safety risks). In NASA's definition of risk (above), the word "chance" is used as the measure (or basis of comparison ) of the risk. The definition clearly implies evaluation of a set of risks based on the chance of occurrence of each of the various con- sequences clescribed. However, NASA acknowI- eciges, and our reviews have confirmed, that these "chances" are not formally or specifically esti- mated; nor are they documented. Rather, STS risks are assessed based on subjective judgments ant! the approval of qualitative rationales by various board and panel chairmen, and Level T! anc! ~ authorities, as described in Section 3. However, many quanti- tative engineering analyses and test data relevant to risk assessment are available anct often are used in arriving at what are finally qualitative subjective judgements. With such a non-specific (i.e., non- value based) risk acceptance process there is little basis for making objective comparisons of the several major risk categories associates! with the STS, nor for carrying out risk evaluations by independent agencies. Neither can one systemati- 36
cally evaluate the results of efforts to reduce the risk of the various possible losses. Without more objective, quantifiable measures of relative risk it is not clear how NASA can expect to implement a truly effective risk management program. was described to the Committee at lSC. It is conceived to be a synthesis of activities in four broad categories: Programmatic 0 Engineering/development 4.2.2 NASA Risk Management ~ Mission operations The various NASA documents identified in Sec- tions 3. ~ and 3.4, with some of their key provisions noted, basically describe a framework within which to operate an effective risk management program. At the core of such a program is the idea of risk management through the control of hazards. Re- sidual hazards (risks) that cannot be designed away would be controlled at least to levels consistent with program objectives and cost constraints. The definition and analysis of hazards and levels of risk associated with a system and its operation was to be performed within a system safety function. Since the effective level of hazard control was not always expected to be perfect, a "residual hazard risk analysis" would be performed to provide the re- tention rationale for accepting such hazards and for continuing to operate (perhaps with con- straints). In parallel with and providing inputs to this system safety function is a reliability activity. This function was to be basically concerned with estab- lishing a data base for selection of components which would meet allocated failure probability requirements; performing failure mode and effects analyses; establishing redundancy criteria and con- figuration definitions, maintainability criteria, and life limits; and preparing critical items lists con- taining items with single-point failure modes which could cause catastrophic results. O Product assurance As depicted in Figure 4-3, activities in all cate- gories are conducted throughout all phases of the NSTS Program, from concept definition to flight operations. The risk management process is said to be characterized by top-down direction and control, with "bottom-up" response and account- ability from the staff organizations and line man- agement at the NASA centers. The process of risk assessment and management is described as one of "independent but integrated participation" by Pro- gram management, design/development (project engineering), operations (Astronaut Office and Mission Operations Directorate), and SR&QA. These terms are key: the degree of independence and integration of organizations and functions within the overall process comprise a major, re- curring theme of the discussion presented in the following Section 5. 4.3 SUMMARY The basic organizational elements are in place within NASA for assessing and managing risk; however, there is a need for a change in the scope of functions and the way that they are carried out. Certain shortcomings in process and methodology , exist which are discussed in the following section. A third element in the overall safety and risk In particular, there is a fundamental problem in management program is quality assurance. This the nature of and the methods used to develop the function, as defined by NASA, would be responsible overall assessments on which NASA line manage- for assuring that the hardware and software pro- ment bases its decisions about how to reduce and duced for the system was produced in a controlled control risk in the STS. Also, it appears to the way end met allrequirementsof the qualify control Committee that there is no clear, formal, and criteria documents. This assurance role also in- rigorous view among NASA line managersat eludes supervision of personnel certification and least on any consistent basisof the nature and establishment of non-destructive testing methods goals of risk management. to detect flaws in components and non-conforming materials. These functions provide the basic staff capability which line management can bring to bear on the management of risk in the NSTS Program. NASA's own explicit view of risk management for the NSTS To reiterate what was said earlier, the Committee believes that risk management for any system involving complex engineering must be the respon- sibility of line management i.e., (in the case of the NSTS) the system Program Manager, the As- sociate Administrator for Space Flight and, ulti- 37
i l c) cC ~ - L' o l Hi :~ ~ cO I CD J LL r ct 1 ..o 1 ] 1 ....1 ~ Am_ _~ L_~.. := :CE :0 C) -:W Z .~ ' ':' ' Hi_ - Ct: z o N O..... .~ 'L : .~ cn ~ ~ ~ ~ cat ~ z z En cat LLI Let i_ I_ (D ~ ~ ~ CL ~ ~ ~ a: LL Cl) JO O ~ ~9 LL (9 O Z Lll Lt O ~ ~ Cl E c: ~ ~ m i~ ~ z 0 ~ O ~ ~ 7 O ~ z I CJ) 111 ~ 6 ~_ ~. ~ r . ..... ... ! ; .. ~ , ; .. ......... , .... .: .... :.: ! :; . . . . , . . . . . .... ............. ~ ..... ....... ~ : ! ....... z ' .... ... , , 0 ~ , :- ~ ~ 1 ~ I'ZI;( ~1 ~ Qo <~! ~ `iZ~ ~ ~ :.,.. ., ~n ... ~ :.... z ::.., ...: (I) ... O - L~ C' .,.,.,. ~ '.. ~ .... ~ :'.,:: ~ ~'~N ~ ~ ;~ ~ ~ ~ ~ ~ ~' ~ Q ~ Z ;~ O .: ~ ' ' .,~ CL , ~ , Ct: . - - ~ ~:. ~ , ~ :~ . _ ~ : :- ~ :':. ' ~ .,. ~ ~' :,.: : : ~ ~ , ,, ,,` ~. .-~.::~ .::.. ~ ~ : - ::\ .. .- : .:~'~, ~.:'. : 2 ` :'. , ~ : : -.~- .: ~ , \ ~ ._ _ .,, .N . , .- , .: . ~ ~. ~ -., , ~ : \ :. ~: :: ~ : '' :' :: ': ~ ~ . ~ .. ...,.., ., . ~ : : :. : :. -: :.:: ~ ::: ::~: . ~. :: l :~ :: : ~ ::: ~: :::::: ::: :::: : ::: :: :: :: :::: : ::: :. ~ :~::: :s z ~ [E (5 z ~ - - cr) cO z - - a) ~n c) a ~ - a a ~ - a a a) o Q a) Ci) a) Q Ct) a' . _ CD a C~ e' cr: lL
mately, the Administrator of NASA. Only this program management, not the safety organizations, can make juclicious use of the means available to achieve the operational goals while evolving the safety risks clown to acceptable levels, as described earlier. The safety organizations at NASA centers and Headquarters are staff organizations i.e., they can and should! be responsible for providing the assessments of the system's risks. They should also ~ . r be responsible for assuring that the activities as- sociatec! with controlling the risks to the levels assessed have been carried out and clocumentecI. Safety organizations cannot, however, assure safe operation; they can only assure that the safety risks have been evaluates! by approved, proper, rigorous, quantitative, and objective methods, and that the system configuration and its operation are being controller! to those risk levels. 39