Read "Post-Challenger Evaluation of Space Shuttle Risk Assessment and Management" at NAP.edu

« Previous: 'RISK ASSESSMENT AND RISK MANAGEMENT: THE COMMITTEE'S VIEW'

Page 40 Cite

Suggested Citation:"SPACE TRANSPORTATION SYSTEM RISK ASSESSMENT AND RISK MANAGEMENT: DISCUSSION AND RECOMMENDATIONS." National Research Council. 1988. Post-Challenger Evaluation of Space Shuttle Risk Assessment and Management. Washington, DC: The National Academies Press. doi: 10.17226/10616.

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 National Space Transportation System Risk Assessment and Risk Management: Discussion and Recommendations - .1. CRITICAL ITEMS LIST RETENTION RATIONALE REVIEW AND WAIVER PROCESS The Committee views the NASA critical items list (CIL) waiver decision making process as being subjective, with little in the way of formal and consistent criteria for approval or rejection of waivers. Waiver decisions appear to be driven almost exclusively by the design- based~ FMEA/CTE retention rationale, rather than being based on an integrated assessment of all inputs to risk management. The retention rationales appear biased toward proving that the design is "safe," sometimes ignoring sig- nificant evidence to the contrary. Although the Safety, Reliability, and Quality Assurance (SR&QA) organizations of NASA collect, verify, and transmit all data related to FMEA/CIL and hazard analysis results, the Committee has not found an independent, detailed analysis or assessment of the CTE retention rationale which considers all inputs to the risk assessment process. As set forth in the NASA documents identified in Section 3.l, both the performance of the Failure Modes and Effects Analysis (FMEA) and the iden- tification of critical items are intended to be carried out under the aegis of the reliability function. In principle, the FMEA shouic! be both a design too' to provide an impetus for design change, and a too' for the evaluation of the final configuration in order to define the necessary control points on the 40 hardware. The identified critical items would re- quire supporting retention rationale and waivers as appropriate in order to be included in the overall as-flown system configuration. How this retention rationale was to be generated, who developed it and who evaluated it against what safety criteria became crucial questions for the Committee's re- view of the whole process. According to prescribed procedures, the hazard analyses being performed by the safety function of SR&QA, and the FMEA and CTE identification performed by the reliability function, were to come together in the generation of Mission Safety As- sessment (MSA) reports which would contain analyses and justification of the retention rationale for the critical items and their associated "hazards", as well as a safety-risk assessment of the resulting units, subsystems, and systems. The hazard analysis and Mission Safety Assessment parts of this overall safety and risk assessment process as it was sup- posed to be done prior to 1986 are shown in Figure 5-l, obtained from ISC's SR&QA. As Figure 5-1 indicates, according to specified NASA procedure the CIL retention rationale is to be used as one of many inputs to the more com- prehensive hazard analysis. In reality, however, the hazard analysis is often simply a derivative of the CTE and its retention rationale, and is not used as a major basis for waiver decisions. Examination by the Committee showed that often these retention rationales were simply discussions of the hard- ware's specifications, design, and testing. They were generated primarily by the functional development engineers responsible for the design. They are intended to be justifications, an(l do not, in our

t 1 In a a F fir o I .4 Z Z o Z Z U) ~ . V) In LIZ ~ C , ~ ~ ~ ~ , ~ - Ct: In In < Z o IL ~ In Z NolS3a 41 OX' a: ~ ~ 1~ Oozes, I' -- \ ~< / \ / ~ i . ~ 1 fir: · ~ o fir {c IL Z ~ o o F 3< _ J _ J Z Z _ ~ Z 1 ~ O In z - ~r . ~ - L m U. ~ . C} ~n ~2 ~n ~ o o . _ Q . _ o Q 6 cn z >- a Q . _ C~ Ct) a) Q CO CO CO o ~+ ~n a) . _ > . _ O a) c: 6 o ~ CO o ~ ._ , Ct .o . _ CD z _' Ct I _ U' a)

view, provide a true assessment of the risk of the hazards. Sometimes the rationale appears to be simply a collection of judgments that a design shouIc! be safe, emphasizing positive evidence at the expense of the negative, and thus cloes not give a balanced picture of the risk involved. For example, the CIL retention rationale of December 1982, for the Solid Rocket Motor (SRM) inclicated in support of re- tention that: there had been no failures in three qualification, five development, and ten flight mo- tors; there hacT been no leakage in eight static firings and five STS flights; 1076 Titan TIT joints (presum- ably of similar clesign) were tester] successfully; etc. Missing from the retention rationale was, among other points, any discussion of the dissimilarities between the SRM and Titan Ill (e.g., insulation design and combustion pressure on the O-ring); the O-ring erosion observed in the Titan IT! program and on the second STS flight; a failure during an SRM burst test; and, since the rationale was not updated, all of the O-ring anomalies seen after December 1982. Furthermore, in many cases we reviewed: O No specific methoclology or criteria are estab- lished against which these justifications can be measured. · The true margins against the failure modes often are not defined or explicitly validated. The probability of the failure mode is never establisher! quantitatively. Design "fixes" are accepted without being analyzed and compared with the configuration they are replacing on the basis of relative risk. The point is worth reiterating: The retention ra- tionale is user! to justify accepting the design "as is"; Committee audits of the review process dis- coverec! little emphasis on creative ways to e~imi- nate potential failure modes. Since 51-L, there has been a major increase in the attention and resources given to STS SR&QA and risk assessment and management functions at all levels of NASA ant] its contractors. In 1986, NASA appointed an Associate Administrator at Heaclquarters for Safety, Reliability, Maintainabil- ity, and Quality Assurance (SRM&QA) an(l charged him with establishing a NASA-wide safety an(l risk management program. To implement this program, policy directives are being cleveloped relating to various procedures ant! operational requirements. Specific instructions and methoclologies to be used in the conduct of various analyses ant! assessments, such as hazard analyses, are being clevelopecI. Independent institutional assessments ant! audits will be macle of SR&QA activities ant! technical effectiveness at each NASA center. Some important elements of this revamped NASA safety programinclucling hazard analysis and mission safety assessment are depicted in Figure 5-2, which was obtained from the ISC SR&QA organization in May 1987. Several things shown in the figure shouic! be noted. First, there is now a specific new set of NSTS instructions to all con- tractors and NASA organizations for conducting hazard analyses, and for preparing FMEAs and CTEs for the NSTS (these new instructions affect the activities in the boxes in Figure 5-2 marked A. Second, it can be seen that the FMEA/CTE docu- ments are intended to be one of many inputs into the hazard analysis and Hazard Report, which in turn are shown as an input into the Mission Safety Assessment. However, since (as discussed in Section 4.2) the Hazard Reports do not provide a comprehensive risk assessment, nor are they even required to be an independent evaluation of the retention rationale stated in the CTEs, the Committee believes that NASA plans at least for the near term to con- tinue using the retention rationale of the CILs directly and individually as the basis for Criticality ~ and IR waiver justifications to Levels IT and T. We have indicated this by ad(ling the Criticality ~ and IR waiver path within the dashed lines on the left side of Figure 5-2. The current plan is to take the critical item waiver requests to the PRCB and Level ~ via a data package prepared by DISC SR&QA. It is our impression, however, that most of the arguments in this data package will still basically be those contained in the original CTE retention rationale. Thus, we see too little in the way of an independent detailed analysis, critique, or assess- ment of the risk inherent in Engineering's rationale. Since mid-1986, NASA and its contractors have been performing a massive rework of all STS program FMEAs, updating the resulting Clips, and reviewing all prior HAs. This new FMEA/ClI effort has had value in identifying new failure modes that were missed earlier or introduced through past changes, and those resulting from new changes made mandatory before next flight. However, the new NSTS instructions for preparing FMEA/ClI s 42

NSTS 01700 DELIVERABLES _ SD-77-St0113 _ RISO DELIVERABLES _ ~ ROCKWELL HA OES INSTRUCTl[lN Ann 7e ;- ~ N PD 1 700 1 BASIC POLICY OH SAFETY _ ~ NHB 17001 (Vl-A) BASIC SAFETY MANUAL _ . . NHB 5300 4 ( 1 0-2) SR&QA AND MAINTAINABILITY PROVISIONS FOR THE SPACE SHUTTLE NHB 1 700 1 (V3) SYSTEM SAFETY METHODOLOGY ._ NSTS 22254* METHODOLOGY OF CONDUCT OF NSTS HAZARD ANALYSIS . _ NSTS 22206* INSTRUCTIONS FOR PREPARING FMEA/CIL CILS ~ ___~___ l r l dSC SR DA l 1 1 l _ _ ~ _ _ _ J '1 ~ ll OATA PACKAGE l t- - r- - -a L. I LEVEL I I AUTHORITY Len_. _ _ , ~ ,,,,,Ls ~ , FMEA CIL _ I DOCUMENTS ~ ~ r CRIT I&JR J l | PREPARED HAZARD REPORTS l RISO SHUTTLE HAZARDS ~ INFORMATION MANAGEMENT I ~1 RISO ERB l SYSTEM (SHIMS) l - rat | SUBSYSTEM MANAGER 1: ~ SAFETY l MISSION OPERATIONS ' _ . DIRECTORATE l . LEVEL Il | PRCB I I [ _ _ . TY S U B PA N E L l l 1 ' ~ J I r - 1 l l | PROJECT MANAGER | ~ i___ 1 1 | ORBITER CONFIGURATION CONTROL | | BOARD (CCB BASELININGI* WAIVER REQUEST OATA PACKAGE | MISSION SAFETY ASSESSMENT | Dashed boxes added by the Committee * New procedures added since 51-L I NSTS 0700 1 I TECHNICAL RFnillRFll.F~Tc I · PREVIOUS EXPERIENCE · DESIGN ENGINEERING STUDIES · SAFETY ANALYSES · SAFETY STUDIES · CRITICAL FUNCTIONS ASSESSMENTS · FMEA S/CIL S · CERTIFICATION PROGRAM · SNEAK ANALYSES . MILESTONE REVIEW DATA/RID S . PANEL MEETINGS . CHANGE EVALUATIONS · FAILURE INVESTIGATIONS · WAIVERS/DEVIATIONS · OMRSD S/OMI S · WALKDOWN INSPECTIONS · MISSION PLANNING ACTIVITIES · FLIGHT ANOMALIES . ASAP INPUTS · INDIVIDUAL INPUTS l · PAYLOAD INTERFACES l | SENIOR SAFETY REVIEW BOARD | ~ ·1 | LEVEL II PRCB BASELINING l ~- | NASA SPACE SHUTTLE | ~ | HAZARDS OATA BASE 1 FIGURE 5-2 NASA JSC safety analysis, hazard reports, and safety assessment process in 1987, as modified by the Committee (adapted from NASA JSC SR&QA). (NSTS 22206) have also resulted in a large increase in the number of Criticality ~ and IR items. The Committee believes this new complexity will pose additional severe problems for both the mechanics and credibility of the CIL and waiver processes. The strong dependence on the CIL retention rationales in waiver (recisions makes it critical that they be comprehensive and up to Late. It is not clear to the Committee whether, in the pre-51L environment, changes in the STS configuration or

the operational experience base led directly and surely to review and appropriate updating of the relevant CIL retention rationale. In the wake of the 5 l-L accident, the NSTS program issued a document (NSTS 22206) which is intended to strengthen the process for updating the retention rationale. Once a retention rationale has been accepted and a waiver granted for a critical item, any changes to the item itself, the FMEA, or the CIL that could affect the retention rationale mean that the CIL must be resubmitted to the Level Il/l PRCB for its approval (NSTS 22206, p.2-7, para.2.2.61. Any change, whether it be to the test environment, level, procedures, methods, or fre- quency, is to be reflected in changes to the retention rationale. If crew procedures are changed to reduce risk, corresponding changes are also to be made in the retention rationale. The question is whether this updating is con- ductect regularly and in a consistently rigorous fashion. Although this policy is new and may not yet have been fully imposed in all quarters, NASA and contractor personnel interviewed by the Com- mittee seemed variously uncertain about or una- ware of these requirements and how they are met. Updating the retention rationale seems to many to be considered a routine bookkeeping chore, of secondary importance, yet these rationales are the · r · , primary casts tor granting waivers. During its audit the Committee developer! a concern that the FMEA and associated retention rationale on a given crltlca Item may sometimes fait to provide data in various important categories of information, such as the effects of environmental parameters. The lack of data in a certain case may or may not be significant with respect to the threat that item represents. Yet the absence of such data, even though it resulted in uncertainty, in the past has sometimes had the effect of bolstering the rationale for retention and providing unwarranted confidence in readiness reviews. This problem was especially in evidence with Mission 5 I -it. Data suggesting that temperature was a factor in the erosion of the O-rings did exist, but (according to the Rogers Commission) the relevant analyses ap- parently were considered to be inconclusive by those responsible, and these data did not appear in the retention rationale. Thus, the rationale im- plied that there were no data to suggest that temperature was a problem. Strengthening and closing the problem reporting loop since the a`.ci- dent may well reduce the likelihood of sim itar future occurrences. Still, we note that the "negative answer" indicates uncertainty about the issue at hand. If the uncertainty is crucial to the decision process, then it implies the need for more experi- ments, tests or analyses to reduce the uncertainty. (Appendix E includes an analysis of the O-ring temperature effect and the uncertainty implied by extrapolation to low temperatures.) Thus, the Committee's central concerns here are the reliance on and quality of the retention ration- ale, and the fact that we can perceive no clocu- mented, ob jective criteria for approving or rejecting proposed waivers. CIL waiver decision making appears to be subjective, with no consistent, formal basis for approval or rejection of waivers. At! items are considered and discussed at length during the CCB and PRCB reviews. It appears that, if no action item is generated as a result of the review, the critical item waiver is approved. There was no formal "approved or disapproved" step in meetings audited by the Committee, although we are in- formed that such approvals do appear in the minutes of the meetings. NASA managers empha- size that Level Ill engineers and their "Leve! IV" contractors are accorded a high level of responsi- bility and accountability throughout the program, and that their opinions and analyses are the real bases for making retention decisions; these engi- neers bear the burden of proving that the rationale is strong enough to justify retention and waiver of the item. However, the Committee believes that engineer- ing judgment on these matters is not enough. Such judgment is crucial, but it is often too susceptible to vagaries of attention, knowledge, opinion, and extraneous pressures to be the sole foundation for decision making. We are concerned that, for all the reasons discussed above, without professional, detailed evaluation against specific criteria for re- ducing risk (not just review by panels and boards), the retention rationales can be misleading or even incorrect regarding the true causes and probabilities of the failure modes for which retention waivers are being requested (see discussion of probabilistic risk assessment in Section 5.6~. Recommendations (1~: The Committee recommends that NASA estab- lish an integrated review process which provides a comprehensive risk assessment ancl an inclepen(lent evaluation of the rationale justifying the retention 44

of Criticality ]/]R and 2/2R items. This integrated review should include detailed consideration of the results of hazard analyses and all other inputs to the risk assessment process, ir' addition to the FMEA/CIL retention rationale. Further, the review process should assure that the waivers and sup- porting analyses fully reflect current data and designs. Finally, NASA should develop formal, objective criteria for approving or rejecting critical item waivers. 5.2 CRITICAL ITEMS LIST PRIORITIZATION AND DISPOSITION At present, in NASA instructions all Criti- catity ~ and ~ R items are formally treated equally, even though many differ substantially from each other in terms of the probability of failure or malperformance, ant! in terms of the potential for the worst-case effects postulated in the FMEA to be seen if the particular failure occurs. The large number of Criticality ~ ant] JR items at the time of the 5 I-L accident has since been substantially increased clue to changes in grounc! rules for classification and the complete reevaluation of the entire STS. The Committee believes that giving equal management attention to all Criticality ~ ant! IR potential failures could be cietrimental to safety if, as is the case, some are extremely unlikely tO occur, or if the probability is very low that the postulates] worst-case conse- quences of the failures will result. Treating all such items equally will necessarily detract from . . . . the attention senior management can give to the most likely and most threatening failure mocies. Critical items in the Shuttle system are catego- rized according to the consequences of worst-case failure of that item. However, it has been the case that within each criticality category no further ranking is formally macle. In practice, managers do sometimes discriminate within a category, e.g., in their decisions regarding those STS items which should be fixed prior to next flight. Prior to the 51-L accident there were aireacly 2369 Criticality and iR items (the most critical) present in the Shuttle system. There has been a substantial in- crease in the number of such items, now estimates! by NASA to be 4686, of which 2148 have been approved by the PRCB (Director, ~SC/SR&QA, personal communication, November ~ 0, ~ 9 8 7) . This increase resulted from the reevaluation of the entire Space Shuttle system and the new ground rules specified for the preparation of FMEAs- e.g., the carrying of analyses down to the indiviclual component level (even where multiple, identical components are involved) and the inclusion of pressure vessels which were formerly exclucled (see Section 3.5.2~. To take just one example, the number of Criticality ~ and IR items in the SSME turbomachinery rose from 8 to 67 uncler the new ground] rules. In view of this problem, NASA is now taking steps to prioritize the most critical items and will reevaluate the current scheme for defining levels of criticality. Initially, the reassessment process seemed to the Committee to be tOO heavily focused on Level I. The presence of a very large number of Criticality ~ and IR itemseven admitting that many are clustered with identical itemsobviously places a heavy clemand on the time and attention of key NASA decision makers and could prevent their penetrating deeply enough into the analyses sur- roun(ling each item to make a valic] decision on all of them. We were concerned! not only about the workload placed on Level ~ management, but also about the danger that crucial technical details might be lost or obscured as the rationale for retention was presented at successively higher levels. Al- though the same information is presented at the Level T! and ~ PRCBs, it seemed entirely possible that technical debates occurring at lower levels might not be adequately relayed to Level I. A post-51 L organizational change that shifter] the Level I] NSTS Program Director at JSC to Level at Headquarters has Deviated these concerns to some extent. NASA recognized that the waiver clecision-making flow was not icleal especially from I=eve! I! to Level I. Consequently, the Level NSTS Director (who also chairs the Level ~ PRCB) now participates in the Level Il reviews as a basis for sign-off at Level I. Thus, there is now a more direct "hand-off" of concerns and rationales from Level Ill to Level I, via Level Il. Nevertheless, the process still places a heavy workloacl on Level T. and there is still a cianger that important technical information might be Lost in transmission. The organizational change streamlinecl the waiver decision-making process, but it flick not help in 45

i ' ~ ?% handling the large number of Criticality ~ and JR items. Many of these items differ substantially from each other in terms of the probability of failure or malperformance, and in terms of the possibility that the worst-case effects postulated in the FMEA will be seen in the event the particular failure does occur. (In this connection it might be noted that, prior to 51-L, 56 Criticality ~ failures occurred] on the Orbiter during flight without any of the pos- tulated worst-case effects resulting.) Thus, the items vary considerably in their potential impact on Shuttle operational safety i.e., on risk. Early in its audit the Committee began urging NASA to find a way to prioritize the Criticality ~ and IR items (see Appendix C, first interim report). NASA managers tendec] to assert that, since all Criticality ~ and IR items are (by definition) equally catastrophic in their consequences, all shouic! be treated equally and, indeed, we saw evidence in our audits that they were handler! with equal attention. But it is the position of the Committee that giving equal management attention to all such items could be detrimental to safety if (as is the case) some are extremely unlikely to fait, or the probability is very low that the postulated worst- case consequences of the failures will result. The most likely ant] most threatening failure mocles merit the most attention. It is illogical to dissociate the probability of an event or its consequences from decisions about the management of risk. For example, in the development of a probabil- istic risk assessment for a modern nuclear power plant, fault tree and event tree analyses typically identify several million potential sequences of events (including multiple independent failures and cas- cading failures) that can lead to core melt-down. However, only 20 to 50 of these sequences con- tribute significantly to the risk, with five to ten of them contributing 90°/O of the risk. These particular sequences are exhaustively analyzed to identify ways to substantially reduce the overall risk. A secondary consideration of the Committee was the possible impact of the disclosure that, as the resumption of Shuttle operations nears, there are more Criticality ~ ant] IR items (with all of them being waived) than there were before the accident. That perception would not be justifier] by, and would not fairly reflect, the real strides in system safety that have been macle since 51-~. Responding to suggestions on the part of the Committee, NASA developed and tested a number of techniques that could be used to prioritize the CIL on the basis of the relative risk each item represents. One such schemetermec! the Critical Item Risk Assessment (CTRA) procedurewas se- lectec! and instructions for its implementation have now been promulgates] throughout the NSTS pro- gram (NSTS 2249 I, June ~ 9, ~ 9 8 7) . The CTRA procedure is currently qualitative in nature although it employs reliability and test data to some extent. It is based instead on judg- ments about the degree of threat inherent in dif- ferent risk factors. The Committee is concerned about the potential negative impact on the CIRA of ambiguous measures of risk and probability. However, the technique does lend itself to the incorporation of more rigorous quantitative meas- ures of risk and probability of occurrence as these measures are developed for use within NASA. (See Appendix E for a discussion of CTRA and one approach to quantitative measures suggested by the Committee.) Current plans for the implementation of CIRA, spelled out by the NSTS Deputy Director (Program) in a memorandum dated July 2l, 1987, are for STS project managers to prioritize the Criticality 1, TR,and ISitemsin each project after completing the FMEA/CIL reevaluation and presenting the CTE at the Level TIT CCB. By two weeks before Design Certification Review, each project manager wit! provide the NSTS Deputy Director (Program) with a list of "the 20 items in his project that represent the greatest risk to the program." The Deputy Director will then compile and distribute a report. This assessment effort will run parallel to, and may not actually affect, the preparations for STS-26 (the next schecluled Shuttle flight). However, "an alternate course of action" may be chosen for subsequent missions. The Committee views this implementation procedure with concern. It does not appear to reflect a serious concern on the part of the NSTS Program for the need to prioritize the CIL by assessing relative risks. Recommendations (2~: The Committee recommends that the formal criteria for approving waivers include the proba- bility of occurrence and probability that the worst- case failures will result. We further recommend that NASA establish priorities now among Criti- callity ~ and ~ R items, taking care not to use ambiguous measures of risk anal probability. NASA should also modify the definitions of criticality in 46

terms of the probability of failure and probability of worst-case effects. Finally, we recommend that NASA Leve/ ~ management pay special attention to those items identified as being of highest priority, along with the rationale that produced the priority rating. Responsibility for attending to lower-prior- ity items 'within the present Criticality ~ and JR categories, when reclassified, should be distributed to Levels I! and II] for detailed evaluation and , . . ctectslon. 5.3. HAZARD ANALYSIS AND MISSION SAFETY ASSESSMENT - NASA hazard analyses currently do not address the relative probabilities of a particular hazardous condition arising from failure modes, human errors, or external situations. The hazard analysis ant! the mission safety assessment clo not: address the relative prob- abilities of the various consequences which may result frown hazardous conditions; provide an independent evaluation of the retention rationales stated in the input CILs; or provide an overall risk assessment on which to base the acceptance and control of residual hazards. Hazard analysis (HA) is intenclec! to be a key part of NASA's safety and risk management proc- ess. Because it considers hazardous conditions, whatever their source, it is a top-down analysis that shouIc] encompass the FMEA ant] other bot- tom-up analyses and cover the safety gaps that these other analyses might leave. In reality, how- ever, the HA has not player] the central role it was designed to play. Instead, the main focus has been on the FMEA and its corresponding CTE retention rationale. These are design-based analyses, pre- pared by the project engineering staff. (See Section 5.~.) The Committee's audit of the FMEA/CTE re- evaluation and hazard analysis review produced, at first, a somewhat confusing and contradictory set of perceptions about the relationships between these safety analyses and the nature of the overall risk assessment and management process of which they are a part. Gradually, it became clear that there were differences between the officially pre- scribed process and the real process, as well as differences in the way the process is perceived by various NASA personnel, clepencling on their func- tion and point of view. Beyond that, there were also differences among the NASA centers in the implementation at the detail level. Figure 5-l (shown earlier), which was prepared by the Safety Division at iSC, depicts fairly accu- rately the process, as the Committee has come to understand it, that was prescribed by NASA policy at the time of the Challenger accident. Here, the HA is clearly an important element, buttressed by a number of complementary analyses including the FMEA/CIL. The ultimate product of the safety analysis is the Mission Safety Assessment (MSA), feecling into the deliberations of the various engi- neering and readiness review boards. Figure 5-3, also prepared by the Safety Division at JSC, shows the process from the perspective of that Division, focusing on the HA as the central activity. Note that the FMEA/CIL is listed as one of many inputs to the hazard analysis. The actual process appears to be quite different from the one suggested by the preceding two figures. During the latter part of 1986 anct the first few months of 1987, our audit led to the impression that, although some of the FMEA/CTEs were inputs into the HA function, the real risk acceptance process within NASA operated essentially as shown in Figure 5-4 (obtained from ISC). One can see from the diagram that the "Hazarcl Analysis As Required" is a deacI-end box, with inputs but no output with respect to waiver approval decisions. Our impression was supported by subsystem proj- ect managers, engineers anc] their functional man- agement at ISC. Many of them believer! that the CIL path shown in Figure 5-4 was the actual approval route for retention of designs with Crit- icality ~ and IR failure mocles. A key problem, in our view, is that the risk assessment shown in the box entitled "Retention Rationale and Risk Assessment" was not really an independent assessment of the risk levels by profes- sional system safety engineers; such indivicluals (and they are few in number within NASA) were "left out of the loop." Neither did the assessment contain an evaluation of how system hazards re- sulting from critical item failure modes wouicl be controlled. In practice, in most cases reviewed by the Committee, the retention rationales written on the CIL forms were simply transferred to the hazard analysis reports and became the basis for final acceptance of resiclual hazarcls, and for decision- making at Flight Readiness Reviews (FRRs). 47

Cot ~ - c] Al - ~ o #I z - UJ c:) J 0~ 3.smo a: I o ~ ~ O J < lit O ~ 5m .` in ¢, O CO ka in ~ '~_ b , in . ~ tic L To , it in mo o ~ o in in o ~o^ . , in o CL Z ~ D: J Z Z ~ _ C) Z aS s ~ 3~1nHS Oz~ 00~O I SOVOlAYd 1 9uS 1 13 H3119~0 z cn UJ tn ~n ~L ~n z o co ~n , , 1 , I ~ 11V ' a . ~ ~ 6 c _ ~ j: :; ~ ~ ~ °t 0 ~ ~ C ,` 0t ~ ~ a 0 ~ ,,, ~ a, °- g 8 t°; Z c 0 _ S c~ u~ ~n ~n ~ ~n ~n ~ ~ ~ ~ ~ c~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Z ~ ~ o · ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ .. 48 C] tn z U] z o C~ ~n ~n ~n a) . CD N Cl) a) CO CD ._ ~ CE CO a) . ~ ~ tn - z o ~L Fz ~n Z ~ Z _ [L _ . 1 C: U? . _ _, . _ a) c~ <( .o a) cn ° Cf) c~ a) CO o _ ._ ~ cn ~ (n . _ LL ~

- - \ to ~ If ( - \ G G A, / Amoco LO ~ - LO J to - LucO - - ~ ' = - ~ z z z cO ~ O G L&' LU," CO G ~ CO 49 ~e`. co at: an G TIC IN OG ~ UJ cn CO a) ~0 Q a) l ~n o . _ C) Ct o Q a a~ J Z Ul O 3 m co c~ ~ c~ o Ul o ~ / c~ L~ 3 it z LU c~ ~ _ z co as LU o c~ a~ co 0 / L~ ~ _ cn > a) cn cn = a u' ~ IIJ o ct ~ au - ~

NASA does not use the HAs and (in turn) the MSAs as the basis for the Criticality ~ and IR waivers. In fact, HAs for some important subsys- tems were not upciated for years at a time even though design changes had occurrec! or dangerous failures were experienced in subsystem hardware. (An example is the ~ 7-inch disconnect valves between the ET and Orbiter.) The Committee's auclit showed that standards and detailec! instruc- tions for the conduct of HAs were not found to be consistent throughout the STS program; NSTS 22254 was issued to correct that problem. In summary, the Committee found in its review of the HA process that: 1. HAs were clone for only the largest subsystems of the STS; they acIdressed certain overlays of hazards but were not traceable to all failures in units within the subsystems. 2. HAs were not clone routinely for each major subsystem. The HA assumer] worst-case consequences and simply categorizes! hazard levels (cata- strophic or critical) based on whether there . , . was time tor counter-act~ons. 4. The HA process called for an independent evaluation of the HA results. Analyses of catastrophic ant! critical hazards were to be verified using risk assessment techniques. However, the HAs did not address the relative probability of occurrence of various failures, basec! on actual flight ant] test information, nor clic! they evaluate the validity of the CIL retention rationale against any formal set of . . crlterla. We found that many engineering personnel, functional managers, anc! some subsystem man- agers were unaware of what tasks must be done to complete the hazarc! analysis, die! not know whether they hac! actually been done, ant! clid not contribute to them. Some, in fact, believed that HAs were just an exercise done by reliability and/ or safety people and that they were reclunciant to the FMEA/CTEs. Their belief appears to be justified, in that these HA activities die! not seem to be authoritatively in-line as part of a true hazard control and risk management process. It appears they were carried out in a relatively sterile envi- ronment outside the mainstream of engineering. The safety personnel did use the HAs along with the FMEA/CIEs to create Mission Safety Assess- ments for the major elements of the STS ant! for the overall missions. These MSAs were to provide "a formal, comprehensive safety report on the final design of a system." However, in practice, the MSA reports essentially served as process assurance re- ports. They listec! the hazards ant! statec! whether they were eliminated or controlled; compared harcI- ware parameters with safety specifications; speci- fied precautions, procedures, training or other safety requirements; anct generally documented compli- ance with the various reliability and safety tasks. They did not provide in-clepth quantitative risk assessments, and relief] almost exclusively on the CILs and HA reports for justification of acceptable risks. New design changes and/or flight data were "examinecl" and "jucigect" for safety by various personnel and boards at NASA Levels TIT, If, and I; the vehicles for the approval of changes appear to have been the FRRs and various special reviews. The HA anc! MSA reports were not viewer! as controlling documents on a specific system config- uration which was judged to be safe by the safety organizations. The initial waivers to fly Criticality ~ anc! IR items were not always redone in a timely way after new data were obtained. Thus, our audit supports the impression that the hazarc! analysis is not used to its fullest advantage and that overall system safety assessments, based on test and flight data and on quantitative analyses, are not a part of the process of accepting critical failure modes and hazards. Since the Hazard Report does not provide a comprehensive risk assessment, or even an incle- pendent evaluation of the retention rationale stated in the input CTEs, we believe the overall process shown in Figure 5-2, representing NASA's current plans, has serious shortcomings. The isolation of the hazard analysis within NASA's risk assessment and management process to date can be seen as reflecting the past weakness of the entire safety organization. For that reason, this issue of the role of hazard analysis drives to the heart of our most sweeping conclusion, which is that the information flow, task descriptions, and functional responsibit- ities implied by Figure 5-2 must be mo(lifiec! if NASA is to achieve a truly effective risk manage- ment process. The reordering of functions which the Committee recommends is Ascribed in detail in Section 5.~. 50

Recommendation (3~: The Committee recommends that the FMEA/ CILs be used as one of many inputs considered in the hazard analysis and system safety assessment. We also recommend that the overall system safety assessment encompass a quantitative risk assess- n~ent which in turn uses the CILs and hazard analyses as input. Finally, the Committee recom- mends that this risk assessment be the primary basis for retention or rejection of residual hazards as well as critical items. 5.4 RELATIONSHIP OF FORMAL RISK ASSESSMENT PROCESS TO SPACE TRANSPORTATION SYSTEM ENGINEERING CHANGES Elements of formal risk assessment, such as EMEA/CILs ant] hazard analyses, appear to have hacT little direct impact on the STS re- covery engineering process as they have not figured prominently in the majority of engi- neering change decisions macle by NASA man- agemeIlt. The foregoing sections have addressee! the rela- tionship between FMEA/ CTF and hazard analysis, and their relationship to the CTE retention rationale review and waiver decision-making process. It is important also to take a broader perspective ant! examine the relationship of the risk assessment process, as a whole, to the actual STS engineering redesign activity and recovery process. Shortly after the Challenger accident, groups representing various parts of NASA (design centers, Astronaut Office, etc.) presented the NSTS Program Manager at ISC with their lists of items deemed to require attention. All were Criticality ~ or ~ R items. From these lists, the ISC Level I! Program Requirements Control Boarc] selected 90 (consist- ing of hardware, software, and procedures) to undergo redesign, test, or analysis before the next flight of the Shuttle. These decisions were made without formal ref- erence to the FMEA. Since that time, the number of mandatory next-flight changes across the STS system has grown to 159. Of these, only a handful have the FMEA/CIL/retention rationale (or the hazarc! analysis) listed as the original source of the change (e.g., ~ out of 23 on the SSME, 4 out of 48 on the Orbiter). Only a few of the mandatory changes have arisen out of the current FMEA/CIL reevaluation. Incleed, the redesign activity has, for the most part, prececlec! these reevaluations. Most of the mandatory changes were longstanding con- cerns, iclentifiecl before the 51-L accident, which were derived from flight experience, engineering analysis, etc. NASA and contractor personnel told the Com- mittee that the stand-clown provided an opportu- nity to address known hazards things that were already "in the mill" before the accident. Thus, the FMEA/CIL and hazard analyses seem not to have affected STS engineering very significantly. Yet the FMEA/CIL reevaluation and the hazard analyses were the heart of the mandate the Committee (via NASA) received from the Rogers Commission in its recommendation ITI (see Appendix B). For this reason, the Committee was concerned as it gained an increasing impression that the FMEA/CIL and hazard analyses are fairly narrow parts of the overall STS risk management/reliability picture. The special System Design Review Boards established in March 1986 to review design changes state(l for completion before the next flight appar- ently did not take the FMEA/C)Ls formally into account. As discussed in Section 5.3, the hazard analyses in actual practice appear to have little or no influence on the waiver decisions to accept Criticality ~ and IR designs for flight. Also, the original scheduling of the first flight some six months after completion of the FMEA/CIL and hazard analysis reevaluations seemed to presuppose that no substantial design change requirements would result from the process. NASA and contractor personnel explained to the Committee that the FMEA/CTE is primarily a (resign tool, used as an input to Preliminary Design Review in the early days of the Shuttle program. In their view, the current reevaluation is essentially a design validation effort; thus, they say, the fact that it has disclosed few new critical items confirms the strength of the original design. Furthermore, they assured the Committee, engineering changes are processed through the same configuration control boards that review the FMEA/CIL, and the total process is not complete until the last change to be implemented before flight has undergone a FMEA and been dispositioned by the board. The Committee accepts this explanation. How- ever, accepting it forces us to conclude that NASA 51

may have overemphasized the importance of the FMEA/CIL reevaluation while simultaneously not giving sufficient attention to its results. Also of concern is the Committee's continuing impression that the extensive FMEA/CIL effort has focused on a "moving target," as the redesign work goes forward without adequate feedback into that proc- ess. For example, the contractor conducting an inclepenclent FMEA on the Orbiter (McDonnell Douglas) reporter! anc! ISC confirmed that per- sonne! conducting the FMEAs have hac! to utilize oic! "as-built" harc~ware drawings as a data base, telephoning engineers whenever they believe an item might have been moclifiec! since the original design. In its first interim report to NASA (see Appenclix C), the Committee recommenclec! that NASA take steps to ensure a close linking between the STS engineering change activities anc! the FMEA/CIL- hazard analysis processes. A subsequent revision in the change review procedure appears to be helping in that regarcI. It requires an assessment of each proposer! clesign change to determine if any Criticality ~ or 2 hardware is affected. Furthermore, NASA's Administrator has assurec! the Committee that flight schecluTe considerations will not be allowed to recluce the rigor with which reviews ancT analyses are concluctecI. The Committee is substantially reassured regarding the strengthened relationship between the risk assessment process anct STS engineering changes. However, concerns remain regarding the long-term outlook for a strong connection between these activities, as Shuttle op- erations resume anc! engineering improvements continue. Recommendation (4): The Committee recommends that NASA take firm steps to ensure a continuing and iterative linkage between? the formal risk assessment process (e.g., FMEA/CIL and HA) and the STS engineering 1 change activities. 5.5 TIMELY FEEDBACK OF DATA INTO THE RISK ASSESSMENT AND MANAGEMENT PROCESSES 6 The Committee has found many indications that ciata from STS inspection, test and repair, and inflight operations clo not always feec! back rapidly enough or effectively enough into the risk assessment anc! management proc- esses. One of the key failures that lee! to the Challenger disaster was that data regarding O-ring erosion in earlier flights hac3 not surfaced with enough visi- bility or in a timely enough fashion to impact the O-ring CIL retention rationale or the Flight ReacI- iness Review for that ill-fated mission. The Com- mittee has founc! numerous inclications that data from STS inspection, test anc! repair, anc3 inflight operations clo not always feec! back rapicIly enough or effectively enough into the risk management process. For example, with a high Shuttle flight rate (such as the rate of one per month being experienced just prior to 51-~), there may be a lag of two or more flights before in-flight anomalies are reviewed by the responsible NASA managers. A primary issue here is the feedback of opera- . . . tlona experience, Inspection, test anc repair re- ports, ciata anc! anomalies into the FMEA ant! the CIL retention rationale, and their impact on waiver and commit-to-launch decisions. Information that could affect the CIL waiver retention rationale often appears in other parts of the system long before it finds its way into the rationale for reten- tion. For example, the SSME prime contractor has set up a board (Rocket~yne's Engineering Review Board) to disposition every item identified as trou- blesome by the project engineers. However, the relevant CIL number and (locument is identifiec! only after disposition is macle. Similarly, the effects of activities such as inspection, test and repair, anc! inflight operations appear not to be adequately accounted for in hazard analyses. Furthermore, it is not clear to the Committee what processes exist for methodically incorporating operational experience into performance analysis programs anc! the system change process, or into the FMEA/C)~. Mission Operations Directorate (MOD) personnel at JSC have been heavily involves! in the FMEA/CIL and hazard analysis reevalua- tions, and 14 astronauts have been assigned to safety functions such as FMEAiCIL. This involve- ment in reviews leads to the development of flight rules, which, as one astronaut noted, is an effort to address a problem through procedural changes when it is too late for design changes. However, flight rules and procedures development often do lead to system design changes. (The Director of 52

MOD described 2 X such changes macie during 1985 and 1986.) Another critical problem is the need to provide rapic! feedback of information on anomalies de- tected during inspections, tests, and repairs as well as those occurring in flight, into the Flight Readiness Review (ERR) ant! the commit-to-launch decision. For example, in the past, information from the previous STS flight was not available in time to influence the decision to launch the next mission. There is a well-establishec! process for handling and reporting in-flight anomalies. Once cletectecI, an anomaly is evaluated ant] tracker! by a Mission Evaluation Team (MET) (or the equivalent). A Problem Report (PR) is preparer! on each anomaly which includes data ant] analysis regarding the fault isolation and its possible resolution, anc! potential effects on future flights ant! schedules. The PR is then reviewed, evaluated, ant] approved by the relevant project organizations, SR&QA, and the NSTS Deputy Director (Program). The PRs and the status of their resolution are tracked in the Problem Reporting and Corrective Action (PRACA) System. Finally, all reported anomalies and other concerns are compiled into a list which is available to the FRR Boarc] for the next scheduled flight. CONFIGURATION AND MAINTENANCE REQUIREMENTS ~ l ON-LINE n I FLIGHT OPERATIONS ~ L OFF-LINE I CLO SED- DESIGN LIFE 1 LOOP VALIDATION ~ -- ACCOUNII NG ~ ~ _ ! , ~ CONFIGURATION AND . - MAINTENANCE I M P I ~ M ~ ~ T ~ T I ~ ~1 _ _ The problem has been the delays in the feedback from anomaly detection on one flight to the FRR for the next flight. NASA has a "quick look" procedure for expediting the reportage of signifi- cant anomalies up the management chain, but some data will simply entail an irreducible lag. NASA intends, for the initial flights of the Shuttle after its resumption, to reduce all the data from each flight before launching the next one. However, after the first few flights, NASA plans to increase the flight rate to a point where the data stream from postflight activities will once again lag. Al- though vigilance will certainly remain higher for some time in the wake of the Challenger accident, the Committee is nonetheless concerned] that the same dangerous preconditions will once again be present. NASA is now establishing a new closecI-Ioop accounting and review system known as the System Integrity Assurance Program (SlAP). (See Figure s-5~. Among other things, this system will tie all Criticality 1, IR, and IS items (defined in Section 3.4.1 and Table 3-~) to fin(lings in the field. A key feature of SlAP is its Program Compliance Assur- macie ance Status System (PCASS). This is essentially a computer-basec] information system for the SIAP. Still being developed, the PCASS will function as | LOGISTICS ~ ~ SRM & QA ~ ~ TRAINING AND REQUIREMENTS ~ | REQUIR -MENTS | | CERTIFY SPARES AND PROVISIONS _~ ..... ~ A ~ ~ ~ l _ .' DATA COLLECTION ON-LINE ~ _=ND ANALYSIS _' | FLIGHT OPERATIONS ~ . RELIABILITY L OFF-LINE I L~RFORMANCE r ~ _ _ | ~ | HARDWARE/SOFTWAR E PROBLEM RESOLUTION . --- 1 ' ~ ~ ~ ~ ~ , ~ ~ PROGRAM COMPLIAN E ASSURANCE AND STATUS SYST M _ · REQUIREMENTS · RISK DECISION · INTEGRATED ACCOUNTING STATUS PROBLEM STATUS ASSESSM ENT · HAZARD ANALYSIS ~ _ 6 · TREND · CRITICAL ITEM ANALYSIS STATUS AND HISTORY FMQA/CIL DATA MANAGEMENT REPORTS FIGURE 5-5 The NASA NSTS System Integrity Assurance Program (NASA). 53

~ central data base that integrates a number of r existing Information systems ant sources across the NSTS (Figure 5-6~. For example, the PRACA system mentioned above will be a part of it, speeding the transmission of data on flight anom- aties. The PCASS has the potential to provide in near real-time, to decision makers such as the parti- cipants in the FRRs, an integrated view of the status of problems with the STS, inclucling trencis, anomalies ant] deviations, and closure information. However, the PCASS will be ineffective unless inspection, repair, test, flight, and other cIata are fee! into the system in a timely manner, and the data are available promptly in convenient, usable form. For example, clelays in reporting on anom- alies and trends from previous flights can jeopardize proper decisions to launch the next flight. The Committee believes that the STAP, including the PCASS as an integrated data base, can and should become a central element of STS risk as- sessment and management. However, great care must be taken to assure that the ciata base is correctly ant] adequately maintained. Essential to the successful assessment and man- agement of risk is the certain and timely feedback of preflight, flight, and postflight system perform- ance data; along with inspection, test and repair ciata; test results; and failure or degradation re- ports. Thus, a prime neec! recognizes] by NASA managers is to ensure that all problem actions are promptly placer] in the PRACA/PCASS system. In many cases this involves a strong reliance on the thoroughness of maintenance ant] hancIler person- ne! as well as project engineers. The paperwork burclen on NASA technical ant! safety personnel is already enormous. But the timely and diligent reporting anc! the proper evaluation of such ciata are among the most important tasks they can perform. It is precisely where the system broke clown in the months preceding 51-~. Recommendations (5~: The Committee recommends that high-level NASA management attention and priority tee given to increasing the efficiency of the pow, analysis, and use of inspection, test anc! repair, test results, and in-fight operations data throughout the dlecision- making process. The Committee also recommends that p`~i implementation of the System Integrity Assurance Program (SIAM, including its Program Compliance Assurance Status System (PCASSJ, be given a high priority. Diverse professionals (e.g., design and (development engineers, operating per- sonnel, statistical analystsJ should be used in the development of this program, with maximum pos- sible early involvement by potential users and key decision makers. The Committee further recom- mends that procedures be implemented to ensure that all mission anomalies cletecter1 in real time and from recorded events, and those detected during the near-tern~ inspection of recovered hardware, also are fed into the formal risk assessment anti management processes for action prior to commit- ting to the next Fight. Finally, the Committee HARDWARE MA NTENANCE REtUIREMENTS CONFIGURATION FLOG HT I ~ MATERIAL ~ ~ PROGRAM COMPLIANCE ~ STATUS ~ ASSURANCE AND , DERC:SSiON REPORTS STATUS SYSTEM (PCASS) HISTORY / DATA BASE ~ INVENTORY l l FMEA/CIL 6 STATUS/M ISTORY L J I I HAZARD TREN D ANAI Y.RF~ DATA FIGURE 5-6 Data base elements of the NASA NSTS Program Compliance Assurance and Status System (NASA). 54

recon'n~encis that all such anomalies be called to the in'n~ediate attention of launch decision makers who will justify in writing their decisions regarding the disposition of the anomalies. 5.6 THE NEED FOR QUANTITATIVE MEASURES OF RISK Quantitative assessment methods, such as probabilistic risk assessment, have not been used to directly support NASA decision mak- ing regarding the STS, although quantitative analyses and test data often are used in arriving at qualitative subjective judgments in reaching clecis~ons. Powerful methods of statistical in- ference are now available which allow the integration of all sources of information on risk, inclucling data on partial degradations and failures as well as engineering models of failure modes. NASA is not adequately staffed with spe- ciaTists and engineers trained in the statistical sciences to aid in the transformation of com- plex data into information useful to decision makers, anc! for use in setting standards and goals. The key technical decision makers in NASA operate as chairmen of bodies that review relevant technical information. Their decisions involve re- quirements, design, waivers, launch decisions, etc. Much of this information is in the form of complex engineering clata. Data are routinely collected from flight ant! ground tests, part changeout and failure histories, anomaly reports, computer simulations, and other sources. Some of these data are used in various ways for design qualification, system cer- tification, and configuration control. They are also used to establish or verify redlines and safety margins. They are sometimes employed in the FMEA to support rationales for retention, and in the hazard analyses to support classification of a hazard. They may come into play in the waiver process and the Flight Readiness Reviews. In other words, numbers and statistics appear throughout the risk management process, but they are generally used as raw data, and in a qualitative way. Nu- merical data have not normally been used directly to generate indicators of risk or reliability. Even trend analysis, a relatively simple statistical tech- nique for anticipating failures, has not been em- ployect routinely or to maximum effectiveness. The Committee was informer] by a number of NASA persons cluring discussions that early in the history of the Apollo program a decision was made not to use numerical probability analyses in NASA's clecision-making process. This disinclination still prevails today. As a result, NASA has not had the benefit of more modern and powerful analytical assessment tools that have been developed in recent years, and that are user! by other high technology organizations, such as in the communications and nuclear power inclustries. Without such tools, it wouic! be very difficult at best for safety engineers to transform the massive data base which has developer! in the STS program into specific infor- mation regarding what was truly known ant! what was not known. In addition, the failure to use numerical probability analyses had the unfortunate effect of denying NASA designers the required statistical data base on various types of failures, along with the better understancling of the mech- anisms of failures that can be obtained from such data. Quantitative approaches to the overall analysis of risk in complex systems are known by various names, such as quantitative risk assessment and probabilistic risk latter here. Using modern techniques of statistical inference in combination with engineering models of failure modes and system models, these ap- proaches have become sophisticated and powerful in recent years. They are employed by the nuclear power, aircraft, and communications industries, the military aerospace sector, and other developers ant] operators of complex systems. While these . . . . quantitative approaches are not a panacea, since not everything affecting flight safety can be rigor- Ously quantified, they can permit more objective assessment of the varying types and quality of information and data which are available as well as reflect the uncertainties introduced by incomplete data or knowledge. An approach to statistical inference that is par- ticularly useful for assessing risk is the Bayesian approach (using, for example, Weibull, binomial, or Poisson likelihoo(l functions). This allows the integration of information from a variety of sources, such as inclustrial data on components and mate- rials, test slate, analytical engineering moclels, field data, and qualitative engineering judgment. The assessment (PRA); we use the 55

Bayesian approach (see Appendix D for more details) produces a "State of Knowledge Curve" (technically a probability clensity) for the parameter of interest, such as the frequency of a Criticality ~ failure. The curve provicles an estimate of the frequency and measures the uncertainty in the estimate. If only the data from the few or zero observed failures during flights were used, then the uncertainty would be too large to be useful. But the relevant information goes wet} beyond that scant data base. For example, it may include a mocle! of the mechanism which would cause the failure mocle. This cause mociel may involve loads and safety margins whose uncertainties have been well characterized by existing engineering data bases or carefully designed margin validation tests. Suppose, however, that after a complete analysis, the uncertainty about the frequency spans both the safe and unsafe regions of the frequency scale. This is not a sign that the analysis has failed, but it is an indicator that more (carefully designed) tests are neeclecI. The experience and intelligence of the subject matter experts has aireacly been fully re- flected in the Bayesian analysis; so it is inappro- priate to ask them now to resolve the unco~nfortahie uncertainty. Oniv new information will cio. If the State of Knowledge Curve spans primarily the unsafe region of the frequency scale, then a design or procedure change is required. But if the safe region of the frequency scale carries all the uncer- tainty, then the uncertainty itself is of little con- sequence because the risk is now low enough to fly. Probabilistic risk assessment iclentifies all possi- ble failure scenarios along with their probabilities of occurrence and their consequences. The methods used in PRA to identify and organize these scenarios into a structured pattern variously include the use of master logic diagrams, fault trees, event trees, ant] FMEAs, among others. Since NASA has a great deal of experience with FMEAs in the design process, it is logical that they be a principal input to the PRA. Among the pay-offs to NASA from using PRA is that literally thousands of scenarios ant! their associates! risks can be eliminates! from further consideration in the hazarc] analysis and other risk assessment processes, if their contribu- tions to total risk and/or their probability of oc- currence are extremely low. (The specific limits shouIc! be set by the top management of NASA. However, failure scenarios that contribute less than 0.01 percent of the total risk or have a probability of occurrence of less than ~0-7 per Hight would appear to be reasonable candiciates for removal from further consideration.) Thus the proper use of PRA methods could significantly recluce the time and effort expended on risk assessment activities while, at the saline time, identifying in a quantitative manner the most important contributors to overall risk. By concentrating on these priority items, NASA can recluce the overall risk and perhaps the total cost of risk assessment. Quantitative methods of analysis rely on the modeling of statistical data of many kinds. For an example of the application of a statistical technique caller] logistics regression to reveal a statistically significant trend and predict the probability of an STS event while specifying the prediction uncer- tainty, see Appendix E. It is essential that such analyses be performed with the advice of profes- sionals who understand the full range of analytic tools available through the modern statistical sci- ences. There currently are not enough professionals in the statistical/analytical sciences among NASA's civil service and contractor personnel to fully ana- lyze such data on a regular basis. One result of NASA's earIv clecision not to use a specific relia- hility or risk analysis approach (apparently because of the lack of a large statistical data base) was that NASA safety organizations were not staffed with professional statisticians or safety-risk analysts, ant! project engineers were not trained in moclern sta- tistical analysis techniques. Partly in response to the Committee's interim reports (Appendix C), NASA has begun taking tentative steps toward the use of moclern proba- bilistic analysis and other analysis techniques. A NASA handbook on PRA is being written. Con- tractor studies have been initiated to conduct trial PRAs of the Orbiter Auxiliary Power Unit and the similar Hydraulic Power Unit in the SRB, as well as on the Shuttle main propulsion pressurization system. In addition, the let Propulsion Laboratory is conducting for NASA a study of ways to improve the SSME certification process. They are using a Bayesian approach with a Weibull likelihoocl func- tion. The prior distribution is derived from engi- neering models of failure mode life. The iclea of integrating engineering models with techniques of statistical inference is very promising. Basec! on the results of these studies, NASA plans to assess the benefits and applicability of PRA to the STS risk management process. The new Associate AcImin- istrator for SRM&QA has indicated that he will 56

personally evaluate the technique and develop and pursue a strategy for introducing it throughout NASA. The Committee is concerned that the test with this very limited sample particularly with the evaluation criterion states! in the NASA response to our first interim report (see Appenclix C), namely comparison of the PRA results with the (current) `'mainline FMEA/CIL activity" could give a clis- torted result and leac! NASA not to introduce PRA. We have cautioned] NASA not to evaluate PRA merely by comparing the results of two or three disparate tests of PRA with the results obtainer! earlier through the FMEA/CIL process. The crite- rion shouIc] not only be whether a significant new problem is identified by the PRA. What should be asked is whether PRA wouIc] have helped in making NASA's original (decisions (e.g., regarding the waiver on a Criticality ~ item), or would have given increased confidence in the decisions that were made. The PRA also should improve the under- stancling of the nature of the failure modes, and increase the confidence in anc! objectivity of the assessment of risk. The jucigment of experienced engineering prac- titioners is crucial for ensuring system safety. How- ever, a complex risk assessment process can actually obscure some of the prime contributors to risk. Probabilistic~risk-analytic modeling techniques can provide clecision makers with an input that clarifies the key choices facing them. Numbers and accom- panying analyses should not drive decisions di- rectly, but they can help ensure that system weak- nesses and problems "bubble up" for consideration and clecision. Also, having available a detailed quantitative breakdown of risk does provide ex- periencec] decision makers with a better basis for intelligently managing risk. Clearly, however, the Committee does not wish to suggest that NASA subordinate sound technical judgement to numer- ical analysis. Such an approach would be, in our opinion, unrewarding ant! perhaps counterprod- uctive. Recommendations (6~: The Committee recommences that probabilistic risk assessment approaches be applied to the Shuttle risk management program at the earliest possible ciate. Data bases clerivec] from STS failures, anom- alies, anc! /?ight anc! test results, and the associates! analysis techniques, should be systematically ex- pandec] to support probabilistic risk assessment, trencl analyses, anal other quantitative analyses relating to reliability anc! safety. Although the Committee believes that probabilistic risk assess- ment approaches will greatly improve NASA's risk assessment process, it recognizes that these ap- proaches shouic! not be a substitute for good engineering and quality control practices in (resign, (development, test, manufacturing, and operations, all of which must continue to receive high priority emphasis by NASA ant! its contractors. The Com- mittee further recommends that NASA built! up its capability in the statistical sciences to provide improved analytical inputs to decision making. 5.7 THE NEED FOR INTEGRATED SPACE TRANSPORTATION SYSTEM ENGINEERING ANALYSIS IN SUPPORT OF RISK MANAGEMENT NASA safery-related analyses tend to focus primarily on single-event, worst-case failures to the relative exclusion of possible multiple and synergistic failures in different subsystems or elements of the STS. In adclition, the con- nection between the various analyses appears tenuous. There does not appear to be an adequate integratecl-system view of the entire STS. NASA's risk management process provides some mechanisms for identifying cross-element interface effects and failure modes, including propagation of failure modes to interfacing or physically aclja- cent modules or subsystems. One mechanism is the Element Interface Functional Analysis (ElFA), cle- scribed in Section 3.4.3. There are three ElFAs: Orbiter/ET, Orbiter/SSME, and Orbiter/SRB-ET (a fourth ElFA, on ground/flight systems, is now being generated). The hazard analysis is intended to be a top-down analysis that addresses cascading fail- ures. Interface Control Documents are a third mechanism concerned with safety at the subsystem interfaces. Finally, a Critical Functions Assessment (CFA), conducted initially in 1978 to identify critical functions during each mission phase, is currently being reevaluated by Rockwell Interna- tional. The CFA can include multiple and cascading failure combinations. 57

6 The NSTS Engineering Integration Office at ISC is responsible for managing system integration activities, the systems analysis and interface design effort, and analysis of integrated structural loads and thermal effects. As part of this responsibility, a series of Level Il Systems Integration Review (SIR) panels are assigned to review the FMEAs on both sicles of an interface. The Office is supported by Rockwell International in the provision of Space Shuttle integration analyses although Rockwell's support responsibility apparently does not extenct to some areas (e.g., on-orbit or reentry phases) or elements. The Engineering Integration Office, with the support of Rockwell, also produces Integrated Hazard Analyses (THA) bridging two or more STS elements. To the extent that the hazard analysis is a top- clown analysis, it is important that its output lead to the generation or modification of the FMEAs. But there is no indication that this is happening. For example, a member of the Committee audited the FMEA/CILs and hazard analyses related to potential interactions between the Orbiter fuel cells, water management, active thermal control, and life support subsystems; in particular, he looked for indications of possible effects of the presence of hydrogen in the cooling or potable water which would result from a failure of the hydrogen sepa- rator. The FMEA/CILs identified only two possible effects: degradation of the performance of the flash evaporator ant] a reduction of water storage ca- pability. Other, potentially more damaging effects not covered in the FMEA include: the effect of the possible shutdown of flash evaporators between 140,000 anc] IO0,000 feet on the active thermal control system; the violation of water quality standards, with resultant crew discomfort; and the inability to accurately assess the amount of water onboard. It shouic! be noted that no hazard analysis seems to exist relater! to the potential presence of hydrogen in water; the Element Interface Func- tional Analysis is not applicable because all of the subsystems of concern are within the same element (the Orbiter). Although the FMEA/CTE is a bottom-up analysis, it shouts] be able to expose cascading failures initiated by the subject failure. However, at present the FMEA process usually does not consider the cascading of failures beyond the first occurrence. For example, it wit! not consider propagation of a failure in the hydrogen separator into the flash evaporator and the subsequent propagation into the thermal protection subsystem. The FAiEA/CIL ground rules restrict the analysis to incliviclual subsystems. Contractor personnel clo analyze the effects of a failure in the subject subsystem on other subsystems, but no further. External failures are considered in thereclun- dancy screen,9 but not in the FMEA. The Com- mittee notes the dichotomy between the concern with failure of redundant items, contrasted with the lack of concern in the FMEA over nearly simultaneous failures in separate subsystems which could have an equally critical effect. The prevailing impression of the Committee is that, although there are several mechanisms that take a partial systems view, ant] although the level of effort is much greater than it was prior to 51-L, the various analyses clo not acid up to a truly integrated, total-systems analysis in support of risk assessment. Nor are they linked to the FMEA/CTE in such a way as to compensate for its limitations. . . . . 1e existing rls ~ management process consists primarily of separate, bottom-up lines of analysis, without a thorough top-down, integrated systems analysis. The Associate Aciministrator for SRM&QA has been (lirecte(l by the Administrator to develop a new agency-wide risk management system that integrates the various parts of the risk assessment and management process. This is a promising development. It is important for NASA to call attention to the totality of "risk management" as the sum of various processes, including total STS risk assessment, that ultimately must be considered on an integrated basis by fine management as well as by SRM&QA. It may be noted that, of all the organizations and groups observed by the Committee, operations personnel (astronauts and flight controllers) appear to have the broadest and most integrated perspec- tive of the Shuttle system. Flight controllers in training have actually found real problems on spacecraft while performing cross-element analy- ses. The continuous development and updating of flight rules and procedures is an important source of this perspective. For example, the Mission Op- erations Directorate (MOD) flight rules sheet now Y The redundancy screen is a method for documenting the capabilities for redundancy verification: Acapable of checkout during normal ground turn-around between flights. Bloss of redundant element is readily detectable in flight. C there is a possible single event (e.g., contamination or explosion) which can cause loss of all redundancy. 58

6 lists the relevant hazards, FMEAs, ant] CILs in a matrix format. An experimental system being cle- veloped by MOD the Shuttle Configuration Analysis Program (SCAP) and Failure Analysis Program (FAP)is able to simulate multiple fail- ures ant! their effects. This system couIc] be useful in integrated risk analysis. Another strong example of the integrated, sys- tems engineering approach is the Avionics Audit, a series of studies performer! by Rockwell since 1979 on selectee] avionics hardware, software, ant] Orbiter functions. An audit looks at failures across the STS, inclucling cascading failures and interac- tions. The output of the audit is fee] back into the FMEA/CIL/retention rationale, hazard analysis, etc., to ensure that they are consistent ant] complete or that a design change is implemented, with all relevant documents being revised accordingly. Both the Avionics Audit ant] the Critical Functions Assessment are promising techniques. However, they are presently not scoped broadly enough, nor are there enough highly skillet! engineers available, with an understanding of both the STS and the audit techniques, to do the job. (We understand] that there are tentative plans to expand the Avionics Audit to embrace the entire STS.) The expansion of effort on integrated analysis is a positive sign. However, the Committee remains concerned] that we have not found at Level I! a consolidated, integrated STS systems engineering analysis, including system safety analysis, that views the sum of the Shuttle elements as a single system. We hope that, in attempting to develop an agency- wide risk management system, NASA will devise an integrated STS system analysis and assessment process which is closely couplet] with the FMEA/ CIL and other components of risk management, to ensure assessment of the truly critical safety items in the STS. This would inclucle all combinations of hardware, software, ant! procedural failures and malperformances, and cascading failures. Opera- tions personnel should be brought heavily into play in the clevelopment of such an integrated system evaluation. Finally, the safety/risk management process should be reviewer! to identify ways to improve both the coordination of analysis efforts and the efficiency of the overall process. Care must be taken to assure that each part of the process is necessary anct contributes significantly to the over- all STS risk management system. A "top-down" integrates! system engineering analysis, inclucling a system safety analysis, that views the sum of the STS elements as a single system should be performed" to help identify any gaps that may exist among the various "bottom- up" analyses centered at the subsystem and element levels. 5.8 INDEPENDENCE OF THE SPACE TRANSPORTATION SYSTEM CERTIFICATION AND SOFTWARE VALIDATION AND VERIFICATION PROGRAM In general, hardware certification and veri- fication, and software vaTiciation ant] verifi- cation of STS components are managed and conducted primarily by the same organiza- tional elements responsible for the (resign and fabrication of the units. Thus, the indepencI- ence of the certification, validation, and veri- fication processes is questionable. For exam- ple: The contractor that builds the Orbiters (Rockwell International, STS Division) is also responsible for preparing the docu- mentation and performing the work in- volvect in certification, but does not answer to an entity independent of the NSTS Program with regarc! to the certification function. - At Marshall Space Flight Center (MSFC), the Engineering Directorate has the prime responsibility for design requirements for the propulsion elements of STS and also has responsibility for the review ant! ap- proval of their certification. The Program Office is responsible for the design and development phase as well as for perform- . . ~ . . . . sing t he cert~ncat~on activities. At the Johnson Space Center (JSC), prime responsibility for design requirements, cle- sign and development, and certification for the Orbiter all rest with the Program Office, supported by the Engineering and Opera- tions Directorates of the Center. "Independent" validation and verification (IV&V) of software is carried out by the 59

same contractor (IBM) that produces the STS software, with some checks being macle by the Johnson Space Center (JSC). STS certification methods and responsibilities are described in the Shuttle Master Verification Plan (NSTS-0770~10-MVP-011. This plan now is being revised to define reverification requirements which must be met prior to the return to flight. Figure 5-7 depicts the phases of the process ant] respon- sibilities for preparation, review, and approval (i.e., by the contractor or NASA). Figure 5-8 shows the time sequence for the various aspects of the certi- fication-verification process for a subsystem, from the establishment of requirements to operations. According to the NASA Associate Administrator for SRM&QA, his office is responsible for devel- oping certification plans, reviewing the results, ant] approving the certification of STS. However, as the following discussion points out, the certification process is actually carried out by the NASA centers and their contractors who are building the STS. Although the general approach to certification is the same at the three centers involved in the STS program (JSC, MSFC, and KSC), there are several differences in detail, especially with respect to the degree of involvement of the SR&QA organizations (Director, ISC SR&QA, personal correspondence). At MSFC, the Engineering Directorate has the prime responsibility for establishing design require- ments and also for reviewing ant! approving cer- tification. The Program Office has responsibility for the design ant] clevelopment phase as well as for the performance of certification activities. Under the cognizance of the MSFC Chief Engineer, a leac] engineer is designates] for each element (ET, SRB, SSME) to oversee the certification activity. The MSFC SR&QA office reviews and approves all certification ant! verification documentation, ant] performs an independent verification assessment to insure that all STS elements for which MSFC is responsible are properly certified and qualified for flight. For the Orbiter, the ISC Program Office subsys- tem managers (supporter! by the Engineering and Operations Directorates of the Center) have prime responsibility for (resign requirements, design and development, and also the review ant! approval of all aspects of certification of hardware. However, the ISC SR&QA office is responsible for assuring the adequacy of all flight equipment through review and approval of all certification requirements, plans, ant] test reports. In the case of unresolvec! differ- ences between the Orbiter Project Manager and the lSC Manager of SR&QA regarding a certifi- cation issue, the appeal route is to the Director of ISC. As shown in Figure 5-7, the Orbiter element contractor (Rockwell International, STS Division) is responsible for preparing the documentation and performing the work involved in certification. At KSC, the verification program used during the establishment of the Shuttle Launch and Land- ing Site (ILLS) was, because of the nature of that facility, quite different from that user! for flight hardware. The LLS project at KSC certifier! that critical ground systems meet design performance requirements. KSC SR&QA anc! operating person- ne! also participate in facilities, systems, ant! equip- ment certihcat~on. STS Orbiter flight software is developed by IBM uncler contract to NSTS/ISC. Another group of the same contractor, but not reporting to the devel- opment manager, carries out the indepenclent val- idation and verification (IV&V) of the software produced by the development group. NASA per- sonnel consider the multi-organizational, multi- facility participation in software testing and veri- fication to be a strong feature of their procedure. They consider that {V&V is adequately performed in two stages: (~) by a group in IBM separate from the development group, and (2) through testing in the Shuttle Avionics Integration Laboratory (SAIL) at JSC. However, the Committee noted very close collaboration at ISC among NASA personnel and support contractors involved in software develop- ment, with little clear differentiation of roles and responsibilities. While such an atmosphere pro- motes teamwork and cooperation, it does not tend to promote the maintenance of adequate checks and balances required for truly independent IV&V. The Committee agrees that the existing software validation and verification process is avert run, with good quality control, and we believe it should be retained. Indeed, performance of STS software has never created a problem in STS operations. How- ever, the Committee questions whether independ- ent validation anal verification by a second group within the development contractor is sufficiently independent. The degree of independence certainly would lead to serious questioning by outsiders if significant problems were to develop in the flight software. The Committee further believes that the SAIL, while it may be a good end-to-end test, is not adequate to fulfill the purposes of TV&V. Also, 60

A ~ o ~ o 3 of < ~ 11 11 11 11 ~ ohms ,. ., A A o o - a ~ .! o - m In a 2 In a: 6 In o O V) Al S X ~ UJ O LIZ ~ ~ 3~z ~ so US a: L) ~ o Z F UJ fir , _ ~ UJ uJ ~ In ~ ~ C) m ~ z ~ 3= z mO ~ _J And .~` 3 Hi z ~ zip i A z o - m ~ ~n ~ `°n _ ,< x 0 o T i 1 1 _ 1 o - Z ~ ~o - IL tn z o t ~ Z llJ Ct~ _ a: _ _ ~3 Co U.J Z ~o _ ~ _ S G O Z Fm ~ llJ CC _, o ~: ~: ~n o IL o - z C] - _ _ ~ ' 1 ~n D: , o 2 tn o ..t o ~L 1 CO ~n cS ~ ~o ~: i 1 iL:~ ~ .4 ~ j ~ ~ i ~ | 0 0 0 0 0 61 ~: o~ tr: cn z - J J ~0 a: ~n c: - ; - ~ oo ~ IL o o C] z ~3 a: o ~n ;r ~ t , Z L, ~o~ 1 Z ~ ~: ~U. o~ _ ~m _ = , o IS: C] z C] - IL IL o ~n - 11 - C) ~ LL 3 F ~ ~: U] 1 ~ c: ~ o o o ~: ~n ct) a) c: o Q ~n a) cn

REQUIRE- / DESIGN /FABRICATION/ACCEPTANCE/ FLIGHT TEST | OPERATIONS \ ,... VERIFICATION REQUIREMENTS IDENTIFICATION AND ALLOCATION ~ _ COMPONENT, BREADBOARD AND ASSEMBLY DEVELOPMENT VERIFICATION PROCESS CERTIFICATION ACTIVITIES VERIFICATION I ANALYSIS l SUBSYSTEM ~ l DEVELOPMENT r- QUALIFICATION l l ! MAJOR GROUND TEST VERIFY READINESS FOR APPROACH AND ENDING TEST , n VERIFY READINESS FOR FIRST MANNED I y ORBITALFLIGHT 1 1 I n VERIFY DESIGN AND READINESS I y FOR OPERATIONAL PHASE l l l l l 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ____, . APPROACH l AND _ ~ I LANDING TEST = ~ FLIGHT L - , I / DEMONSTRATION , CERTIFICATION FOR ALT | ACCEPTANCE ORBITAL. , CHECKOUT FLIGHT SUBSYSTEM ~,,~/,] _, A CERTIFICATION CERTIFICATION I ~ ~///////~/////~/////~/~ FOR FMOF FIGURE 5-8 The time sequence for the hardware certification-verification process in the NSTS Program (NASA). members of the Committee were toic] by ISC representatives that, because of limited staff, the ISC SR&QA organization now provides little in- clependent review and oversight of the software activities in the NSTS program. Based on the Committee's review of STS certi- fication-validation-verification processes, it appears that the work is managed and concluctec] primarily by the same organizational elements responsible for the design and fabrication of the STS units. The SR&QA organizations seem to have a seconcI- ary role. Thus, the degree of independence of the SR&QA hierarchy in the certification process is questionable. This situation is in stark contrast to that which prevails for military aircraft, in which a totally separate organization is responsible for both certification and software IV&V. It also is in contrast with the process prevailing in the com- mercial aircraft industry, where the Federal Avia- tion Aclministration is responsible for certification. The FAA uses "Designatecl Engineering Represen- tatives" (DERs) who are employed by the airframe manufacturer but are responsible to the FAA while serving as DERs. This approach provides for in- clependence of the certification process from the design, clevelopment and production of the air- planes, while bringing to bear the experience of hands-on engineering practitioners. 62

Hi. Recommendation (8~: Responsibility for approval of hardware certifi- cation and software IV~V should be vested in entities separate from the NSTS Program structure and the centers directly involved in STS develop- ment and operation. However, these organizations should continue to conduct activities supporting certification and I VeSr V. 5.9 OPERATIONAL ISSUES Operational aspects of the NSTS program require considerable attention in risk assessment and man- agement. Three aspects are focused on here: Launch . ~ . . . . ommlt criteria waiver po Icy, ~ human error as a contributor to risk, ant] cannibalization of spare parts at KSC. 5.9.1 Launch Commit Criteria Waiver Policy An average of two Launch Commit Criteria (LCCs) are waived by NASA in the course of each launch. The Committee questions the valiclity of an operational procedure that '`in- stitutionatizes" waivers by routinely permit- ting established criteria to be violatecI. Launch Commit Criteria (ECCs) are technical requirements and conditions pertaining to the STS system, ground systems, and the physical environ- ment that must be met before a launch can proceed. NASA divides LCCs into three classes: mandatory, highly clesirable, ant] desirable. However, all LCCs are subject to waiver based on the judgment of responsible NASA managers, and typically a few (an average of two) are waived for each launch. To date, no ECC waiver has ever proclucec! a problem on a Shuttle mission. However, Committee members questioned the validity of an operational procedure that "institutionalizes" waivers by rou- tinely permitting established criteria to be violatecI. There was a general feeling that "waivable" criteria are not valic] criteria. NASA officials toic! the Committee that an av- erage of 2,000 LCCs come into play on a given Shuttle launch, so that the number waived per launch is an insignificant percentage of the total. The great majority of these are apparently not critical. Furthermore, they explained, in most cases NASA engineers know that there is some extra margin of safety between the LCC and the actual reasonable limits of safety, because they have learner! more about the systems involved since the time the LCC was establishecI. Thus, a typical LCC waiver represents fine-tuning- for example, a slight cleviation in leak rates or pressurization rates. Few such waivers have ever lee] to design changes. The Committee is not persuacled by these arguments. As a result of the 5 l-L accident, NASA has begun revising the grounc] rules for waivers and reassess- ing the LCCs across the board. A time will be selected (probably launch minus 5 min.) beyond which waiver of an LCC cannot be executed unless contingency procedures are prescriber] in advance, thus forcing a launch scrub. Furthermore, each waiver will now trigger a formal reassessment of the particular LCC that was waived, perhaps re- sulting in a change to it. Although these changes in policy are appropriate, there are aspects of LCC policy that the changes do not address. The Committee is uncertain about what criteria are used to establish ECCs initially, especially in the weather and environmental area. For example, ice on the pad at the time of mission 51-L was later shown by films to be a serious hazard; yet there was no LCC governing icing. Similarly, there was not an LCC on temperature at the SRB O-rings only an unrealistic (as it turner! out) LCC on ambient air temperature. The Flight Readiness Review Boars] for that mission was aware of SRB O-ring erosion on past flights, but clid not recognize the effects of temperature on the O-ring. At the same time, there is a concern that too much faith may be placer! in the LCCs. A possible case in point is the Atias Centaur launch failure of March 1987, in which a decision was made to launch the vehicle into a storm because lightning strikes at the time of launch appeared] to be beyond] the 5-mile range permitted by the LCCs. The Atlas was clestroyed by lightning shortly after launch, and observers (including NASA personnel) later said that conditions were clearly not suitable for launch.~° In the view of the Committee, LCCs are designed to permit launch; they shouIc! not be allowecl to force a launch. Experienced judgment must continue to be exercised. But itwouic] be useful in this regard if LCCs were more accurate ant] more comprehensive in their definition of i" NASA: Report of the AtIas Centaur 67/FETSATCOM F-6 Inves- tigation Board, ~ 5 July ~ 98 7. 63

allowable limits; in that case they would not be so su elect to waiver. We note the U.S. Air Force system for indicating the criticality of flight equipment by a "red cross" (a mandatory NO-GO), "red diagonal" (system not fully operational, but safe to fly), and "rec] dash" (some inspection not clone). A comparable prioritization would be appropriate for NASA's LCCs. Loss of an STS may be much more costly in dollars ant] lives than loss of any USAF system, and any means of focusing judgment should be welcome. There must be room for experiences] judgment; but there must also be inviolable rules that prevent errors in judgment being macle uncler pressure of time on certain critical LCCs. We recognize the objections of launch directors to inviolable criteria; but in our view the best launch director is one who is willing to be conservative ant] to live with a conservative system. The Committee welcomes the present review of LCC waiver policy. We believe that the presence of the newly appointed NSTS Deputy Director (Operations) will also help to ensure the application calf experiences! judgment and knowledge whenever ECC waiver decisions are being made. Recommendation (9a): The Committee recommends that NASA estab- I'sk ~ list of mandatory LCCs Rich may NOT be rvaived by anyone. This should comprise the bulk of the LCCs. A limited number of criteria would be separately listed, for special cases, together with a discussion of the circumstances under which they may be waived and who may make the waiver ~ . . aectsto11. 5.9.2 Human Factors as a Contributor to Risk Human factors, which are considered] in some of the STS hazard analyses, do not appear to be taken into account as the cause of failure mocles in the FMEAs. Since the FMEA is one of the principal safety tools used in the eval- uation of the STS design, the Committee be- lieves that the STS design process shouic] explicitly consider ant] minimize the potential contribution of humans to the initiation of the defined failure modes. NASA's risk assessment ant] risk management process for the STS focuses primarily on failure of hardware, and secondarily on software faults and errors. Human error, which can be a major con- tributing factor in accidents, is accordec! relatively little attention in the present risk management system although it is consiclered in some of the hazard analyses. While proceclural aspects of STS operations are regularly relied upon to justify the retention of critical items, human factors do not appear to be taken into account as a source of failure mocies in the preparation of the FMEAs. Human error can affect both flight operations (through crew operations and flight controller pro- cedures) and grounc! operations (testing, certifica- tion, maintenance, assembly, etc.~. Hazard analyses can consider human error in both types of opera- tions activities; but the Committee has not found that hazard analysis is regularly used to assess this element of risk. Procedures utilized in both ground and flight operations are controller! by formal Configuration Control Boards. Personnel are, of course, trained ant] certifier! for the operations that they will carry out. Procedures are verified by a variety of methods, inclucling trainers, simulators, mockups, engineer- ing models, and analysis tools. The Committee initially had some concerns re- garcling the lack of involvement of flight operations personnel in engineering redesign decisions ant] safety reviews, but through discussions with NASA personnel these concerns were largely resolved. However, we remain troubled by aspects of ground operations, with respect to their human error potential. We note that two of the three fatal spacecraft accidents in the U.S. manned space program to date occurred on the ground, of which one was caused by procedural errors on the part of the ground crew.) ~ Removal an(l replacement of parts, test, repair, and all the various ground operations provide enormous potential for error that can lead to serious problems. The potential may be exacerbated by the fact that, at KSC, ground personnel are reliec! upon to report any errors they make which could induce damage; there is little incentive for self-reporting. A ciraft NASA Handbook on Systems Assurance, recently prepared by the Safety Risk Management Two Shuttle processing workers were asphyxiated and killed in late 1986 during a test involving nitrogen gas. (The Apollo fire in 1967 was not caused by human error, but by a shorted wire which initiated a fire in the pure oxygen atmosphere.) 64

Program Office of HeacIquarters SRM&QA Safety Division, places new emphasis on human error in risk assessment. In a proposed risk assessment mode! (Figure 5-9), sensitivity to human error is presented as one factor that contributes to the: likelihood! of a failure mode occurring. This is a positive sign, but it now is far from being imple- mented in the fabric of NASA system design and safety assurance. Recommendation (9b): The Committee recoupments that the NASA FMEA include hunted factors among the recog- '~ized sources of potential causes of failure modes. This step 'could provide another valic] link between the FMEA and the hazard analysis, which are bow, in our view, too tenuously connected. 5.9.3 Cannibalization of Spare Parts By the tinge of the Challenger accident, "cannibalization," the ren~oval of parts at the Kennecly Space Center (KSC) from one oper- ational STS element to fulfill spares require- ments in another, had become a prevalent feature of STS logistics, thus introducing a variety of failure potentials associated with human error. Cannibalization is not evaluated as a producer of potential failure in either the hazard analysis (where it would be most ap- propriate) or the FMEA. NASA initiated a spares program in 1981, as Shuttle test flights began. Early flights were sup- portec! with spare parts procluced on orcler, a source of trouble since parts were often not available in a timely fashion. After other Shuttles came on line ant! as the flight rate increased, parts shortages became increasingly severe. Cannibalization was often the only answer to meet the flight-rate de- mand. As the President of Rockwell International STS Division said to the Committee, "In the last year of flight, cannibalization was the name of the game. We were robbing Peter to pay Paul all throughout the system." With budgetary constraints and cost overruns a chronic reality, NASA apparently de- cidec! to emphasize STS fabrication and launchings over purchasing adequate spare units; the result was logistics problems. From a safety standpoint, cannibalization raises many problems. First, having workers enter one vehicle ant! remove a part presents the cianger that they will inadvertently (anal perhaps unknowingly) damage an adjacent part of the vehicle. Second, there is the risk that the part itself will be clamagec} upon removal and transport. Thircl, there is the chance that the part will be improperly replaced in the vehicle for which it was cannibalized as well as in the original vehicle when the part is returned or replaced. The latter two possibilities are theo- retically covered by post-installation checkout ant] inspection, but the risk of error increases as the incidence goes up. Workers are required to report any possible damage they cause, but the "honor system" may not be 100% reliable. Finally, can- nibalization per se is not explicitly evaluated within the hazard! analysis process. Figure 5-10 shows the incidence of cannibali- zation over approximately the last year before the accident. It can be seen that at least one-third of tile Orbiter Line Replaceable Units (LRUs) flown on some missions were obtainer! through canni- balization. A NASA official at KSC total the Com- mittee that the problem of spares had become so acute that, if Shuttle flights had continued uninter- rupted, KSC would not have been able to sustain 1 ~ operations. The flight hiatus has given NASA time to improve the spares inventory and to make some needed changes in logistics management. Responsibility for Orbiter logistics has been assigned to KSC. The spares budget has been increased. Furthermore, there has been a sharp drop in planned flight rate, which should reduce the requirement for canni- baTization. Also, stricter management controls have been placed on cannibalization, making it unlikely that personnel will readily resort to this practice. The program hopes to achieve a level of support in which lack of spares wouIcl delay processing no more than 5 percent of the time (the aerospace industry stanclar(l). The new NSTS System Integrity Assurance Program specifically prohibits cannibal- ization except by approval of the chairman of the PRCB, anc] requires the collection and analysis of supportability trend data in support of logistics management. Reducing the repair time for spare parts is the fastest way to improve the inventory and reduce cannibalization. The repair processing time is cur- 65

1 ' 1 z He you In ~ An In In LL fir ~ o O: LL LL In - ~ #) - - J - - ~: LL] 1 a) '_ O _ on Z ~0 [S ~ 66 C) o O I - 111 by - cn a: o LU a: , . . . L a) a) I Oh a) _ <t a) Cr <( . ~ , ` o t) 1~ Ul z a: - X 111 In o - LL tL O O Z ~: <: _ LU cn a' ·u a) IL I

- - ~ \\\\\\ m CL a: at - 11 it: l I I I I I I'm a\\\\ 1 1 0 0 0 0 0 0 0 0 0 0 US 0 U' 0 UP ~ CO CO Cot Cot _ _ ~IlN'no nun 6 67 J l _ US ED m - ~D l (D - U' - U) m l l _ US o - ~n in 8 ~ ~ _ ~- _ _~ ~ Cot ~ U) ~ ,Oq Cot ~ Cot _~ , ~ CN ~ _ _ ~ N C~ C~ {D ~ r~ O CN ~ t.~D ~ O ~ C~ 0 00 CO ~ cn _ ~ N c' m 0 CO C~ ~ cn a) Q cn o CO cn . a) Q a) QCD a) y a) . _ a) . _ Q o CO z ct CJ) ~ ,~ co Q o O ~ a a) ~ o o Q ~ a~ cn . _ ._ ~ o a a) o - o . _ C~ .N CC ~n cn C~ o - ~r o cn _ ~

6 rently too long, but a gradual reduction in flow . . tulle IS expecter to occur. Recommendations (9c): The Committee recommends that NASA main- tain its current intense attention toward reducing cannibalization of parts to an acceptable level. We further recommend that adequate funds for the procurement and repair of spare parts be made available by NASA to ensure that cannibalization is a rare requirement. Finally, we recommend that NASA include cannibalization, with its attendant removal and replacement operations, as a potential producer of failure in the integrated risk assessment recommended earlier (Section 5.~J. 5.10. OTHER WEAKNESSES IN RISK ASSESSMENT AND MANAGEMENT 5.10.1 The Apparent Reliance on Boards and Panels for Decision Making The multilayerec! system of boards and panels in every aspect of the STS may lead individuals to defer to the anonymity of the process ant! not focus closely enough on their individual responsibilities in the decision chain. The sheer number of STS-relatec! boards and panels seems to produce a mindset of "collective responsi- bility." The NSTS Program is a large organization whose mission involves the development, deployment, ant! operation of a complex space vehicle in a wide range of missions. Associated with each milestone in the development of any NASA space system ant] its constituent parts, or in the preparation for a space mission, are one or more reviews. These reviews may be made from the standpoint of requirements, engineering design, (levelopment sta- tus, safety, flight readiness, or resource require- ments. Conducting each review is a team, panel, or boarcl, which may or may not be permanently empanelecI. As described in Section 3.2.2, in the NSTS Program there are review groups at every level of management, including the contractor or- . . ganlzatlons. Figure 5-l ~ depicts the review groups associated with the NSTS FMEA/CIL and hazard analysis processes alone. There are also boards to review design requirements and certification, software, the Operations and Maintenance Requirements ant! Specifications Document (OMRSD) ant! the Op- erations and Maintenance Instructions (OMI), the Launch Commit Criteria, ant! mission rules. There are flight readiness reviews at each stage of prep- aration, with a Launch System Evaluation Advisory Team to assess launch conditions and a Mission Management Team to oversee the actual mission. The Committee cleveloped a concern about a possible attitudinal problem regarding the decision process on the part of the NASA personnel engaged in it. Given the pervasive reliance on teams and boards to consider the key questions affecting safety, "group clemocracy" can easily prevail, with the result that inclividual responsibility is diluted anti obscured. Even though presumably the chair- man of each group has official responsibility for the decision, most decisions appear to be highly participatory in nature. In a CCB review auditec! by the Committee, for example, there were 25-35 people present and the role of the chairman was not especially distinct. Each action appearec! to be a consensus action by the board. It is possible that this is a factor in the problem iclentified by the Rogers Commission: " . . . a NASA management structure that permitted internal flight safety problems to bypass key Shuttle managers" (Vol. I, p. 82~. For example, the Level II PRCB conducts daily and weekly meetings usually via teleconferencein which as many as 30 people participate. It is certainly conceivable that inclivicl- uals might be reluctant to express their views or objections fully under such circumstances. Also, passing ciecisions upwarc] through the ranks of review boards may reduce each chairman's sense that his decisions are crucial. As a case in point, it is clear from the report of the Rogers Commission, and from statements made to the Committee by NASA personnel involved, that the lines of au- thority and responsibility in the flight readiness review decision-making chain had become vague by the time of mission 51-~. In discussing this issue, NASA's Associate Ad- ministrator for SRM&QA pointed to the SR&QA directors at the field centers as the inclividuals with primary responsibility for the safety of the Shuttle system. They are said] to have full "responsibility, authority, and accountability." Nevertheless, these in(lividuals do make inputs to larger and higher boards, so that in the en(l all decisions become 68

- J J LL - 6 I tin aC o ~ a: ~ m I _ _. ~ _ _ ~ m 3 o c~ (I) O _ j;T o o o oom _ ~ ,2 £ ~ _ ~ ~ 0 - 1 -' in a) a' en Cal ~ 1 - 0 c C ~I~ c: o. I u, ' ~ U) a, g e E 2 ~ mO a' 'a ~ 0 ~ 0 ~ 0 ~ 0 o us in tar m o :~ 0 0 0 L) .= ct) a) . _ au a) . _ a) o in o . _ .cn a) o ~4 o Q cn a) Q o 1 ^. a) cn >~ cn a a >~ - o CO a . _ a) tn C! ~r 1 69

collective ones, lacking the crucial minciset of individual accountability. It is possible that a semantic problem is partly at fault here, in that NASA managers often refer to "the board" as being synonymous with its chairman, with respect to decision authority. Nevertheless, a mindset is thereby establisher] in which it is not clear whether these are inclivi`dual or group decisions. The Committee contrasted the NSTS system with tha. of the U.S. Air Force, in which the board (including its chairman) makes recommendations to the decision maker. One positive point in favor of NASA's system is that, there, the chairman (who is the decision maker) is requires] to listen "in public" to all dissenting views. The Committee recognizes the important role played by the many panels and hoards in the NSTS program in providing coordination, resolving prob- lems and technical conflicts, and reviewing and recommending actions. These entities allow the different interests and skill groups to bring forward their inputs, contribute their knowledge, and thus minimize the risk that a proposed action will negatively affect some aspect of the STS. Recommendation ( 1 Oa): The Committee recommends that the Adminis- trator of NASA periodically remind all NASA personnel that boards and panels are advisory in nature. He should specify the individuals in NASA, by name and position, who are responsible for making final decisions while considering the advice of each pane! and board. NASA management should also see to it that each individual involved in the NSTS Program is completely aware of his/ her responsibilities and authority for decision mak- ing. 5 10.2 Adequacy of Orbiter Structural Safety Margins The primary structure of the STS has been excluded, by definition, from the FMEA/CIL process, based on the belief that there is an adequate positive margin of safety. However, the Committee questions whether operating structural safety margins have actually been proven adequate. Completion of the Mode! 6.0 loads stucly and the reevaluation of margins of safety based on these loads will significantly improve NASA's grasp of actual operating margins of safety. NASA groundrules exclucle primary structure from the FMEA/CIL process. NASA has apparently assumed that the structural reliability of the STS Including the Orbiter, External Tank, ant] Solid Rocket Boosters) is close to 1.00, because the operating Toads are believer! to be less than the proof load to which the vehicle has been subjected. It is true that some structures have reliability approaching 1.00; examples include briclges, build- ings, and even commercial airliners. But there is a considerable difference between the Shuttle, a first- of-its-kinc! vehicle operated uncler unique condi- tions and challenging environments, and a com- mercial airliner, which is designed and tested to loads and conditions that are well understood. In addition, in the case of a commercial airliner the certifying agency (FAA) and operator organizations act as independent rule makers and aucTitors. No such indepenclent check and balance exists for the STS, where NASA controls all functions in-house (inclucling requirements, analysis methocis, testing, and certification)primarily within the NSTS pro- gram. The original development plans for the Orbiter- the most complex and vulnerable element, and the only manned element included a conventional structural test program for certification of the structural integrity. A complete, full-scale structural test article (an Orbiter vehicle) was to be incluclec! which was to be loaded to I.4 times the operating limit Toad in the most critical conditions. (This compares to the conventional value of I.5 used by the military and the FAA.) Due to budget problems NASA decided to eliminate one of the planner! flight vehicles and convert the static test article (#099, Challenger) to a flight vehicle after a series of proof tests to only 1.20 times the limit loacI. Some loading conditions actually clid not exceed] 1.15 times the limit load. Therefore, the tests die] not even verify a 1.4 strength margin over limit loads. Subsequent flight test data ant! calculations show that in some areas the maximum operating loads are actually 15°/O to 20% higher than those originally postulatecI, so that the static proof load- ing tests clemonstrated only approximate limit conditions. Thus, today there is no clemonstrate 70

verification of safety margins for critical elements of the Orbiter. The mocle} of loads and stresses on the Orbiter used in its original design has been revised once. By 1983 even these data hac] become suspect, ant! another complete revision of loads using the latest test and analysis ciata was begun. Calculates] strength margins from this study (called Mode] 6.0) are expected to be available by November 1987. The Committee believes that the margin of actual strength over maximum expecter! limit loac! for critical areas of the Orbiter structure is not well known. Partly this is because loading conditions are complex and unprececlented, ant! partly it is because very little (if any) of the flight structure was actually tested to failure. The Committee agrees with the decision not to use the FMEA/CIL process on STS structures. However, we remain concerned about the uncertainty in the actual strength margins of safety. The Mocie! 6.0 loads calculation now nearing completion should correct the known clis- crepancies in external loads. Verification of the Mode! 6.0 loacis by data routinely gatherer! from an instrumentec! and calibrates] flight vehicle, be- ginning with the next flight, can help verify the mode! and establish the margins of safety more clefinitively. This knowledge will greatly improve NASA's ability to keep Shuttle operations within a safe envelope of structural loads. Implicit in the safe operation of any such struc- ture is a monitoring system to assure that deteri- oration of structural integrity floes not occur. An effort now underway could adc! materially to NASA's ability to operate the Orbiter's structure safely over its service life. People with airline experience, working uncier Rockwell International, are cleveloping a maintenance ant] inspection plan for the structure. A well-plannec! periodic inspec- tion of this sort is essential, and is the best preven- tive for unpleasant occurrences clue to structural deterioration or other causes. Recommendations (lOb): The Committee recommends that NASA place a high priority on completion of the Mode! 6.0 /:oads, the reevaluation of safety margins for these lo ads, and the early verification and continued monitoring of the morle! 6.0 loads by permanently instru- menting and calibrating at least the next full scale STS vehicle to fly. We further recommend that NASA complete and implement a comprehensive plan for conducting periodic inspection and main- tenance of the structure of the Orbiters throughout the service life of each vehicle. 5.10.3 Software Issues NASA FMEAs do not assess software as a possible cause of failure modes. There is little involvement of DISC Safety, Reliability and Quality Assurance in software reviews, resulting in little independent quality assurance for software. A large amount of data much of it flight specific must be loaded for each Shuttle mis- sion but it is not subjected to validation as rigorous as that for the software. The Shuttle onboard data processing system consists of five general purpose computers (GPCs) with their input and output devices, and memory units. Four of the five GPCs contain the primary software system, known as the Primary Avionics System Software (PASS); the fifth is a redundant computer which contains the Backup Flight System (BFS). The PASS is developed by IBM, and the BFS is built by Rockwell. In addition to flight software code, there are also flight software initialization data, called "I-Ioads", which are mission-unique parameter values. The basic code is reconfigured for specific missions, with about two such "reconfigured flight loads" per flight. After the software requirements are approved, three levels of development tests are performed leading to the First Article Configuration Inspection, or FACI. At the FACI milestone, the software package is handed off to the contractor's . ~ . . . . verlhcatlon organization or 1nc epenc lent testing, called Independent Validation and Verification (IV&V), which leads to the Configuration Inspec- tion (CI) and delivery to NASA. (The degree of independence of the IV&V was discussed in Section 5.8.) Following mission-specific reconfiguration and testing in the SAIL and other JSC laboratories, the package is ready for Flight Readiness Review. A Shuttle Avionics System Control Board (SASCB) is the I=eve! II flight software control board, to which the Program Requirements Control Board has delegated responsibility for software configu- ration control. The Manager of the NSTS Engi- neering Integration Office chairs this board and signs the flight readiness statement on software; thus he is the focus of configuration control and 71

management authority for software. At Level Ill there is a Software Control Board, corresponding to the Configuration Control Boars] for hardware issues. The testing, control, and performance of STS software seem quite goocI. Out of some half-million lines of code in the Shuttle flight software, typically an average of one error is cliscovered beyond the CI. With the emphasis placer] on early detection of errors, error rates are quite low throughout the total 10 million-line Shuttle software system. Only once has a software problem disrupted a mission (on STS-7, uncertainty about the effect of installed software code on a particular abort scenario causer! a launch scrub). Both the developers ant! the `'inclepenclent" certifiers perform their own inspec- tions of the cocle. Special "code audits" are also carried out to reinspect targetec! aspects of the code on a one-time basis, based on criticality, complex- ity, Discrepancy Reports (DRs), ant] other consi`~- erations. Software quality control includes weekly tracking of DRs through the Configuration Man- agement database (which tracks all faults, their causes and effects, and their disposition); trencis of DRs are reported quarterly. Although generally impressed with the Shuttle software development and testing process, the Committee made a number of specific finclings. First, we note that software is not a FMEA/CTE item. NASA personnel state that all software is consiclerect to be Criticality I, with each problem being fixer! as soon as it is cletected through testing ant! simulation. The Committee believes that iclen- tification ant] precliction of software faults or error modes may be feasible by dividing the software into functional modules ant] then considering the various possible failures (e.g., improper constants, cliscretes or algorithms, missing or superfluous symbols). There is little involvement of the ISC SR&QA organization in software reviews, due to the limi- tations on staff. As a result, there is little incle- pendent quality assurance for software. Finally, we note that a large amount of data much of it flight specific must be loader! for each Shuttle mission. However, the data ant! its entry are not validated with the same rigor as in the IV&V of the software. Recommendations (lOc): The Committee recommends that NASA: explore the feasibility of performing FMEAs on software, including the efficacy of identifying and predicting fault and error modes; request DISC SRdrQA to provide periodic review and oversight of software from a quality assurance point of view; provide for validation of input data in a manner similar to software validation and verification. 5.10.4 Differences in Procedures Among NASA Centers Differences in the procedures being usec! by the main NASA centers involved in the NSTS Program may reflect an imbalance between the authority of the centers and that of the NSTS Program Office. The Committee is con- cernec] that such an imbalance can leac! to serious problems in large programs where two or more centers have major roles in what must be a tightly integrates! program, such as the NSTS and Space Station. Without strong, central program direction and integration, the success and safety of these complex programs can be placed in jeopardy. In March 1986, the NASA Associate Aclminis- trator for Space Flight ant! the Manager of the Level Il NSTS Program issues] memoranda setting forth NASA's strategy for returning the Space Shuttle safely to flight status. Their orclers rescinder! all Criticality I, IR, and IS waivers anc! required that they be resubmitted for approval. The process also required the reevaluation of all FMEA/ClI s and retention rationales, as well as hazard analyses. Other instructions required that a contractor be selected for each STS element (that contractor not otherwise being involved in work on the element) to conduct an inclepenclent FMEA/CIL. No specific guidelines were issued by the NSTS Office for the conduct of the inclepenclent evaluations; the meth- ods to be used were determined by the NASA centers concernecl. Also, the FMEA/CIL reevalua- tions were initiated using pre-51 L FMEA/CTE in- structions, in which there were differences in ground rules between ISC and MSFC. (In October 1986, the NSTS Program Office issued new uniform instructions, NSTS 22206, for the preparation of FMEA/CILs, but it took several months for revised directions to reach the STS contractors.) Thus, some differences emerged in the nature and results of the reevaluation conducted] by different con- tractors. 72

; ~ /, These differences are especially noticeable with respect to the FMEA/CIL reevaluation procedures. The Committee found that, at MSFC, all contrac- tors had been instructed to conduct a new FMEA, "from scratch." At ISC, the independent contrac- tors were told to prepare a new FMEA, but the prime contractors were instructed to reevaluate the existing FMEA. At KSC, where E;MEAs are con- ducted only on ground support equipment, a single group (not the original designer) was reevaluating each category of FMEA, working with the existing FMEA. Procedures with respect to the independent reviews also differed. At MSFC, the independent contractor first performed its FMEA and developed any necessary retention rationales; it then com- pared those results with the FMEAs and retention rationales prepared by the prime contractor and wrote specific Review Item Discrepancies (RlDs) on points of difference or disagreement. At ISC, no RlDs were written and no retention rationales were prepared by the independent contractor. Fur- thermore, some Orbiter subsystems were initially excluded front the review. Initially, the Committee was concerned that these differences in procedure Night recluce the valiclirv and effectiveness of the F-MEA/CIL reevaluation process. However, an audit by the Committee of the documentation ant] review process used by ISC in the case of the Orbiter inclicatec! that it is a reasonable alternative to the RID process employed by MSFC. Nevertheless, the Committee suggested in its second interim report to NASA (see Appendix C) that the NSTS Program Office "review the FMEA/CTE reevaluation processes as implemented for each STS element to assure itself that any differences will not compromise the quality and completeness of the overall STS FMEA/CTE effort." This more specific concern for procedural dif- ferences led, moreover, to a broader concern over the nature of management control within NASA. Differences in procedures used by the NASA centers in this context and others (e.g., with respect to the independence of STS certification, as discussed in Section 5.8) lead the Committee to suspect that an imbalance may exist between the authority of the centers and that of the NSTS Program Office. The Committee is concerned that such an imbalance can lead to serious problems in large programs where two or more centers have major roles in what must be a tightly integrated program, such as the NSTS and Space Station. Without strong, central program direction and integration, the suc- cess and safety of these complex programs can be placed in jeopardy. Recommendation ( 103~: The Administrator should ensure that strong, central program direction and integration of all aspects of the STS are maintained via the NSTS Program Office. 5.10.5 Use of Non-Destructive Evaluation Techniques Non-destructive evaluation (NDE) tests on the Solid Rocket Motor (SRM) are performed at the manufacturing plant. Subsequent trans- portation ant] assembly introduce a risk of cleboncling and other damage which may not be apparent upon visual inspection. No NDE is done on the SRMs in the "stacked" config- uration at the launch facility. New NDE techniques now being developed have potential applicability to the STS. I'roble~ns have been cletected by NASA and its contractor on the STS Solid Rocket Motor (SRM) with clebonding between the propellant, liner, in- sulation, and case. In April 1986, a USAF Titan 34D (comparable in design to the SRM) experi- encecl a destructive failure shortly after launch, due to debonding. No such severe consequences have been seen from SRM debonding, but bone] line problems are nevertheless viewed as critical failure mocles, especially given the redesign of the SRM joints. Voids within the propellant mass are also of concern. Destructive inspection of the SRM (e.g., cutting and probing) is not feasible, so non-(lestruc- tive methods must be used. On the SRM, most of these tests are performed at the manufacturing plant; later transportation ancI assembly introduce a risk of deboncling and other damage which may be more difficult to detect at the launch site. There are essentially two issues here: the tech- niques employed and the location where inspection is clone. Shuttle SRM NDE assessment to date has employecl a combination of visual, ultrasonic, and radiographic techniques. The range of NDE tech- niques considered by NASA (but not necessarily tested) as of lanuary 1987 is shown in Table 5-~. According to NASA's Aerospace Safety Advisory Panel, acoustic and thermographic techniques are 7~

TABLE 5-1 Non-Destructive Evaluation Methods Considered By NASA Method Looks For Remarks Ultrasonics Unbonds: case/insulation, inhibitor/propellant, and propel- Propellant/liner to be confirmed. Iant/liner Radial radiography Propellant voids/inclusions Tangential Gapped unbends: Propellant/liner, flap bonds, and flap radiography bulb configuration Thermography Unbonds: case/insulation inhibitor/propellant, and propel- Limited experience base; lant/liner prop./liner to be confirmed Mechanical Unbonds: near joint end case/insulation Complex insulation geometry Oblique-light Gapped edge unbends: case/insulation and inhibitor/pro- Magnifies and automates visual video pellant unbend inspection Computed Gapped unbends: all intersecting interfaces, propellant Long term tomography voids/inclusions Holography Unbonds: near joint end case/insulation Excitation and scale concerns Acoustic emission Unbonds: case/insulation Long term (Source. NASA MSFC) thought to be those with the greatest near-term potential for improving NDE capabilities with respect to the SUM. Another promising group of techniques is based on X-ray technology. The USAF, in its Titan recovery program, has empha- sized NDE techniques including ultrasonic, ther- mographic, an(l X-ray.i Sin~ilar efforts are being pursuer! in the Navy's Triclent progran~. ~ ~ With respect to the issue of location, NASA has cletermined that the ' stackecI" configuration of tile SRM is not amenable to NDE of critical areas using available methods. However, NASA engi- neers believe that the assembly, rollout, and pad hoicI-down loads on the SRM will not cause de- boncling. Therefore, inspections are conducted at key processing points in the plant and at critical SRM segment locations before stacking at Kennecly Space Center. Nevertheless, the Committee remains concerned! about the possibility of damage resulting from transportation, assembly, and rollout. We recognize that NASA is (anal has been) paying serious attention to the NDE issue. However, we believe that the technologies are developing rapidly enough that continued close attention is warranted. Recommendation (lOe): The Committee recommends that NASA apply all practicable NDE techniques to the SRM at the launch facility, at the highest possible level of assembly (e.g., SRMs in the "stacked" configura- 12 NASA: Aerospace Safety Advisory Panel, Annual Report for 19X6 (February 1987). ~ Lt. Col. Frank Gayer, USAF Space Division, personal cc~mmunica- tion. i~ Dale Kenemuth, SP-273, Dept. of the Navy, personal communica- t~c~n. 74 tiong, and em phasize (levelo pment of im proved NDE methods. 5.~1 FOCUS ON RISK MANAGEMENT The current safety assessment processes used by NASA do not establish objectively the levels of the various risks associates! with the failure anodes an(l hazards. It is not reasonable to expect that NASA management or its panels and boards can provide their own detailed assessments of the risks associated with failure modes and haz- ards presented to them for acceptance. Validation and certification test programs are not planner! or evaluates! as quantitative inputs to safety risk assessments. Neither are operating conclitions ant! environmental con- straints which may control the safety risks adequately definer] and evaTuatecI. In the Committee's view, the lack of objec- tive, measurable assessments in the above areas hinders the implementation of an effective risk management program, including the reduction or elimination of risks. Throughout its audit the Committee was shown an extensive amount of information related to program flow charts, organizations, review panels and boards, information transmission, and reports. But the Committee did not become aware of an organization and safety-engineering methodology that could effectively provide an objective assess- ment of risk, as described in Section 4. Throughout the flow of NASA reports and approvals, both

before the 51-L mission ant] after, judgments are macie and statements of assurance given by persons at every level which are based on data and assertions having a wicle range of validity. The Committee believes that it is not reasonable to expect program management or NASA Level ~ management to provide its own in-clepth evaluation of presented hazarc! risks. Nor will other panels or boards be able to clo so without the necessary professional staff work being done. That work, in turn, cannot be performed without methods for assessing risk ant! controlling hazards. The methods must include tile establishment of criteria for design margins which are consistent with the acceptable levels of risk. The Associate Administrator for SRM&QA, in his new plan for management of NASA's SR&QA activities, stipulates that the SR&QA directors of the NASA centers are responsible for assuring the safety of their Center's products and services. However, we conclude that unless the safety or- ganizations at the centers have (~) the appropriate methodology an(l tools (both analysis programs and personnely, ant! (2) the authority to establish criteria for safety margins, specific requirements on verification test programs, environmental con- straints on operations, ant] total flight configuration validation, they cannot be held responsible for assuring an-.acceptable level of safety of flight systems. (In fact, they can never "assure safety," but only assure that the risks have been assessed objectively by approved methodologies, ant! that they are being controlled to the levels accepted by the appropriate NASA authorities.) Figure 5-12 shows that even in the current post- 51-L planning, the final result of the hazard analysis and safety assessment process is a NASA Space Shuttle Hazards Data Base. Having an approved list of accepted, identified hazards ant] a sophisti- cated closed-Ioop accounting and review system (the SNAP) may be useful. However, nearly every catastrophic accident since the beginning of the missile and space programs was caused by some aireacly-identifiec] hazarc! related to potential failure modes. The essence of safety-risk management, in the Committee's view, is not just the identification ant! acceptance of potential hazards, nor even the performance of a risk assessment for each failure mocle and hazard; it is getting control of the conditions which turn potential into real. The FMEAs, CILs, hazard reports, and safety assess- ments identify risks, summarize information, ref- erence data, provide status, etc. They do not analyze or establish the risk levels. Neither do they assess quantitatively the val:i(lity of the test programs in establishing failure margins, or (refine the operating conditions or environmental constraints which af- fect the risk levels. We believe that the key requirements and con- cepts contained in various relevant NASA clocu- ments (see Section 3, for example) provide a good overall framework within which a comprehensive systems safety and risk management program couIcl be cleaner! ant] implementecI. It is the opinion of the Committee that such a program wouic! require bringing together appropriate activities into a fo- cused "Systems Safety Engineering" (SSE) function at both Headquarters ant! the centers. This SSE function would apply across the entire set of (resign, development, qualification and certification, and operations activities of the NSTS. These activities would be an integral engineering element of the NSTS Program. They would involve more than just the preparation of reviews, reports, or data pack- ages. Instead, systems safety engineering would combine the functions of reliability and systems safety analysis. It should be responsible for defining the requirements and procedures, ant] performing or managing, as appropriate, at least the following functions which comprise the basis of a risk as- sessment and risk management system: I. Identification of failure mo(les and effects A. Establishment of design criteria for redun- dancy 3. Identification of hazards and their potential consequences 4. Identification of critical items 5. Evaluation of the probability of occurrence of causes and consequences of failure mo(les anc! hazards 6. Establishment of safety-risk level criteria for design margins ant] hazarc] controls 7. 8. Design of qualification and certification test programs . Objective assessment of safety risks 9. Development of acceptance rationale for retained hazards and hazard reports 10. Specification of environmental and operat- ing constraints at all levels (parts, subsystem, 75

Rae N PD 1 700 1 BASIC POLICY ON SAFETY NHB 1 700 1 (V 1 -A) BASIC SAFETY MANUAL _t NOR RlNn d 1 1 n 71 - NSTS 07700 DELIVERABLES 50 77-SH 0113 r RISD DELIVERABlES SR&OA AND MAINTAINABILITY PROVISIONS FOR THE SPACE SHUTTLE ' 1 ~ r ROCKWELL HA DES INSTRUCTION 400-24 NHB 1700 1 (V3) SYSTEM SAFETY METHODOLOGY NSTS 22254* METHODOLOGY OF CONDUCT OF NSTS HAZARD ANALYSIS NSTS 22206* INSTRUCTIONS FOR PREPARING FMEA CIL . I _ | CONDUCT HAZARD ANALYSIS |< FMEA CIL DOCUMENTS ~ ~ __ | PREPARED HAZARD REPORTS l l (KEY) _ RISD SHUTTLE HAZARDS INFORMATION MANAGEMENT ~ I RISD ERB SYSTEM (SHIMS) _ SUBSYSTEM MANAGER MISSION OPERATIONS DIRECTORATE FIGURE 5-12 JSC SR&QA). NSTS 0700 TECHNICAL nEQUIREMENTS ~ ~ JSC SAFETY | SYSTEM SAFETY SUBPANEL | | MISSION SAFETY ASSESSMENT 1 I ~ | PROJECT MANAGER r , ~ , ORBITER CONFIGURATION CONTROL BOARD (CC8 BASELINING)* | SENIOR SAFETY REVIEW BOARD l LEVEL II PRCB BASELINING .l * New procedures added since 51L NASA SPACE SHUTTLE HAZARDS DATA BASE · PREVIOUS EXPERIENCE · DESIGN ENGINEERING STUDIES . SAFETY ANALYSES . SAFETY STUDIES . CRITICAL FUNCTIONS ASSESSMENTS · FMEA S CIL S · CERTIFICATION PROGRaM . SNEAK ANALYSES . MILESTONE REVIEW DATA RIO S . PANEL MEETINGS . CHANGE EVALUATIONS · FAILURE INVESTIGATIONS . WAIVERS DEVIATIONS . 0MRS0 S OMI S · WALKDOWN INSPECTIONS · MISSION PLANNING ACTIVITIES . FLIGHT ANOMALIES · ASAP INPUTS . INDIVIDUAL INPUTS . PAYLOAD INTERFACES NASA NSTS safety analysis, Hazard Reports, and safety assessment process in 1987 (NASA element, and system) to assure that valiclatec} margins are not violates] 11. Quantitative evaluation of flight data to update safety margin validations 12. Oversight of quality assurance functions to control safety risks 76 13. Overall system safety risk assessment and definition of the potential to reduce the level of risk. All of the above systems safety engineering func- tions (elaborate(l upon in Appendix F) are necessary both for achieving creclible risk assessment and for

defining the risk controls required to justify ac- ceptance of critical failure mocles and other hazards. During design ant] development, the quantitative evaluation of relative risks for each design against acceptable criteria for levels of risk should be considered as an integral part of the systems en- gineering activity. These activities also wouIc] pro- vide a definitive basis for establishing the design margins and operational constraints neecled to reduce the overall risk to the accepted] level anc! subsequently control the risk. Function 13 above Definition of the potential to reduce the level of risk) is an essential input to risk management. The Committee has the impression that changes to the STS often are considered only if they will improve its performance or reduce risks to that level which has previously been accepted in the program. The Committee believes that such risks, accepted in the past, logical as that may have appeared to be at the time, shout not continue to be accepted without a concentrated effort to plan ant] implement a program to remove or reduce these risks. The magnitude of the preceding tasks point to the neec! for a large number of highly qualified professional systems safety engineers (i.e., systems engineers with a safety orientation) at NASA anc! at its major contractors. We were disturbed to learn from rhe Director of the Safety Division at Headquarters SRM&QA that, as of April 25, ~ 987, he had only one professional systems safety engi- neer in his division, and that he expects to add only two more in the near term and four additional ones in the Tong term. It is troubling to the Committee that this important and extremely com- plex systems engineering function should be so severely constrained by staff limitations, in light of the cost of the Shuttle ant] the risk to its crew. Taken together, the tasks listec! above have the highest leverage on overall risk assessment and the control of the causes of hazard. Only professionally dedicated systems safety engineers working to- gether can develop the expertise and motivation to carry out these functions properly. They can per- form their control of validation and certification programs in an objective way (if not functionally assignee] to program organizations). The need for independent entities to perform certification and software IV&V to provide substantiation and con- fidence was discussed in Section 5.8. This risk- managed approach to the validation and certifi- cation functions, including the feedback of flight data, shouIc! not be done by those responsible for design ant] development. They are performance orientecl; they generally do not design hardware configurations to facilitate margin validation, and their proposed certification programs usually are not oriented to the demonstration of failure mar- gins. Finally, it seems to the Committee that it is not managerially reasonable to make an organization responsible for holding system safety to an agreed level of risk without according it responsibility and authority over all of the above functions, which actually control the risks. Another major element of an overall risk man- agement program is the quality assurance (QA) function. Quality assurance certifies that the har`cl- ware anc! software have been procluced to the exact designs which clescribe the vaiiciateci ant! quail system. The "configuration" includes all aspects of the hardware and software, including the environ- ments which in any way influence the properties of materials, stress margins, or temporal behavior of parts, subsystems, ant] elements. In 1986, responsibility for policy and oversight of the quality assurance function was assignee! to the new office of the Associate Administrator for SRM&QA. (his is appropriate, because overall risk management anct total systems safety are clependent on the quality assurance function throughout NASA. The QA function shouicl be performed separately from the systems safety en- gineering functions (although there is certainly a strong oversight interaction between the two). Quality assurance should be a responsibility of each NASA center (and, of course, each contractor). Its purpose is not to design but to control ant] assure. As part of this function it shouIcl eontro! the entire set of final released engineering cloeu- ments describing the complete configuration of the system. As the Committee unclerstands it, that is precisely NASA's current practice. Recommendations (11~: The Committee recommends that NASA con- sider establishing a focused agency-wicle Systems Safety Engineering (SSEJ function, at both Head- quarters anti the centers, which would: be structured so as to be integrally involved" in the entire set of (resign, development, valirlation, qualification, anti certification activities; provide a full systems approach to the continuous 77

l identification of safety risks (not just failure modes and hazardsJ and the objective (quanti- tative) evaluation of such safety risks; provide the output of this function to the NASA Program Directors in support of their risk man- agement; support the Program Directors by providing assurance that their systems are ready for final G safety certification to the risk levels established by the NASA Administrator. The Committee also recommends that the STS risk management program, based in part on the definition of the potential to reduce the level of risk (levelope(1 by the system safety risk assessment, include a concerted effort to remove or reduce the risks. 78

Next: LESSONS LEARNED »

Post-Challenger Evaluation of Space Shuttle Risk Assessment and Management (1988)

Chapter: SPACE TRANSPORTATION SYSTEM RISK ASSESSMENT AND RISK MANAGEMENT: DISCUSSION AND RECOMMENDATIONS

Welcome to OpenBook!

Get Email Updates