Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
6 Lessons Learned Although this report anct its recommendations are directec! to the NSTS Program, they are of broacler applicability. It wouIc! be wise to consider the lessons learnec! by the Committee when struc- turing a risk assessment anc! management system for other programs with similar characteristics, such as the Space Station Program. These charac- teristics wouic! include large size, use of highly J r- · ion by several NASA centers anct prime contractors. The following are generalized conclusions derived from the preceding sections. Numbers in parentheses refer to the principal sections of the report from which the conclusions were clerivecI. complex technolo~v. and mnior oarticinat 6.1 ELEMENTS OF AND RESPONSIBILITIES FOR RISK ASSESSMENT AND RISK MANAGEMENT In the Committee's view, any large, complex, multi-center program shouic! entail an overall risk assessment anct risk management process which inclucles the following basic elements: Risk assessment: A comprehensive method for identifying po- tential failure mocles anc! hazards associates! with the system. A specific, quantitative methodology for iclen- tifying anc! assessing (or estimating) the safety risks of the system. Risk management: --A management process by which the safety risks can be brought to levels or values that are 79 acceptable to the final approval authority. Risk management includes establishment of acceptable risk levels; the institution of changes in system clesign or operational methods to achieve such risk levels; system valiciation and certification; and system quality assurance. (4. ~ ~ The Committee believes that risk management nest be the responsibility of line management (i.e., the program manager anti, ultimately, the Acimin- istrator of NASA). Only this program management, not the safety organizations, can make judicious use of the means available to achieve the opera- tional goals while reclucing the safety risks to acceptable levels. The safety organizations at NASA centers anc! Headquarters are staff organizations- i.e., they can anc! shouic! be responsible for provid- ing the assessments of a system's risks. They shouIc! also be responsible for assuring that the activities associates! with controlling the risks to the levels assessed have been carried out anc! clocumentecI. Safety organizations cannot, however, assure safe operation; they can only assure that the safety risks have been properly evaluatecI, anc! that the system configuration anc! operation is being controller! to those risk levels which have been accepted by top management. (4.l, 4.3) In each such major program, the risk assessment anc! management processes shouIc! be supported by a focused agency-wicle Systems Safety Engi- neering function, at both Headquarters and the centers involves! in the program, which wouicI: be structurecI so as to be integrally involves! in the entire set of design, clevelopment, validation, . ~ . . . . anc qua ~hcat~on activities; provide a full systems approach to the contin- uous identification of safety risks (not just failure
noodles and hazarcis) and the objective (quantitative) evaluation of such safety risks; provide the output of this function to the program clirector in support of his risk management process; support the program director by providing assurance that his system is ready for final safety certification to the risk levels established by the NASA Administrator. (5. ~ ~ ~ This focused systems safety engineering would combine the functions of reliability and systems safety analysis. It should be responsible for (refining the requirements and procedures, and performing or managing, as appropriate, at least the following functions which shouicI comprise the basis of a risk assessment and risk management system: T. Identification of failure modes and effects 2. Establishment of design criteria for redun- (lancy 3. Iclentification of hazards and their potential consequences 4. Identification of critical items 5. Evaluation of the probability of occurrence of causes and consequences of failure modes and hazarcis 6. Establishment of safety-risk level criteria for design margins and hazarct controls 7. Design of qualification and certification test programs 8. Objective assessment of safety risks ,. Development of acceptance rationale for retained hazards and hazard reports 6 10. Specification of environmental and operat- ing constraints at all levels (parts, units, subsystem, element, and system) to assure that validated margins are not violated 1. Quantitative evaluation of flight data to update safety margin validations 12. Oversight of quality assurance functions to control safety risks 13. Overall system safety risk assessment ant] definition of the potential to reduce the level of risk. All of these systems safety engineering functions (elaborated upon in Appendix F) are necessary 80 both for achieving credible risk assessment and for defining the risk controls requires! to justify ac- ceptance of critical failure modes and other hazards. During clesign ant! development, the quantitative evaluation of relative risks for each design against acceptable criteria for levels of risk should be consiciered as an integral part of the systems en- gineering activity. Finally, these activities would provide a definitive basis for establishing the design margins and operational constraints needed to reduce the overall risk to the accepted level ant! subsequently tO control the risk. They also can provide a rational basis for decisions on which risks should be recluced through changes in design or procedures. (5. ~ ~ ~ In controlling risks, there must be a formal, continuing, and iterative linkage between the risk assessment and risk management processes, on the one hand, and the system's engineering change activities, on the other. (5.4) As a program moves toward its operational phase, a system should be establisher! for the rapid and effective feedback of inspection and test results, and repair and flight data into the risk assessment, risk management, ant! decision making processes. In the case of flight programs, this should include ensuring that all mission anomalies detected] in real time anc! from recorded events, as well as those (detected during the near-term inspection of any recovered hardware, are promptly fed into the formal risk assessment and management processes for action prior to committing to the next flight; all such anomalies should be caller! to the immediate attention of launch decision makers. (S.S'J 6.2 ESTABLISHMENT OF RESPONSIBILITY FOR PROGRAM DIRECTION AND INTEGRATION An imbalance between the authority of the NASA centers and that of the Program Office could lead to serious problems in a large program where two or more centers have major roles in what must be a tightly integrated program, such as the STS and Space Station. Without strong, central direction and integration, the success and safety of these complex programs can be placer! in jeopardy. The Administrator of NASA should ensure that strong direction and integration of all aspects of such a program are maintained at Level ~ via the Program Of lice. (5. ~ 0.4) There also must be clear and unambiguous direction of the program at all levels. 80
Those responsible for decisions should be desig- nated and known to all. Boards and panels should be advisory to these persons and not decision making bodies in themselves. (5.10.~) 6.3 THE NEED FOR QUANTITATIVE MEASURES OF RELATIVE RISK Top management and program attention should he focused on those items with the greatest risk to the safety of a system by means of a prioritization of all contributors to the overall risk. (5.2) Ac- ceptable levels of risk in each program should be set by the Administrator of NASA. However, suitable quantitative measures of risk, such as probabilistic risk assessment, are required to ob- jectively define the acceptable levels, track progress toward achieving these levels, and evaluate alter- nate courses of action to reduce risk. (5.6, 5.11) 6.4 THE NEED FOR INTEGRATED REVIEW AND OVERVIEW IN THE ASSESSMENT OF RISK, AND IN INDEPENDENT EVALUATION OF RETENTION RATIONALES There should be an i~tegrateci review process which provides ~ comprehensive, overall assess- ment of risk (including an in~epen~e'~t evaluation, constantly updated, of retention rationales) upon which to base any decisions to grant waivers which permit operating with items that appear on the Critical Items List. (5.l, 5.3, 5.~) A balance is needed between "bottom-up" assessment tools (e.g., FMEA/CIL) and'`top-down" analyses (e.g., hazard analyses). In particular, the "top-down" analysis processes must encompass an integrated system- wide engineering analysis, including a system safety analysis. (5.7) 6.5 INDEPENDENCE OF THE CERTIFICATION OF FLIGHT HARDWARE AND OF SOFTWARE VALIDATION AND VERIFICATION Responsibility for approval of hardware certifi- cation and software Independent Validation and Verification (IV&V) should be vested in entities separate from the program management structure and the centers directly involved in the program's development and operation. However, the latter organizations should continue to conduct activities supporting certification and TV&V. (5.8) 81 6.6 SAFETY MARGINS FOR FLIGHT STRUCTURES Safety margins for flight structures should be established which are in consonance with the ac- cepted levels of safety risk for the program. How- ever, great care is needed to properly verify that the margins have been achieved and are maintained in the flight structures. Verification can include the use of analytical models, but should be supported by static tests before flight, and in the case of reusable flight hardwarecontinued monitoring in flight by permanently instrumenting, calibrating, and analyzing data from a representative flight system. Also, in the case of reusable hardware and man-rated systems destined to remain in orbit for long periods of time, comprehensive plans should be developed and implemented for conducting periodic inspection and maintenance of the struc- ture of each system throughout the service life of each vehicle or platform. (5.10.2) 6.7 OTHER There are other important factors in risk assess- ment and management which have been discussed in this report with respect to the STS as it existed following the Challenger accident. However, they are items which are considered to be less important than those enumerated above or not generally applicable to several other programs. Where ap- plicable, they certainly should be given serious consideration in structuring the risk assessment and management program. These other factors are listed here by title and section reference: Operational Issues (5.9) Launch Commit Criteria Waiver Policy (5.9.~) Human Factors as a Contributor to Risk (5.9.2) Cannibalization of Spare Parts (5.9.3) Other Weaknesses in Risk Assessment and Man- agement (5. ~ 0 ) Software Issues (5. ~ 0.3 ~ Use of Non-Destructive Evaluation (NDE) Techniques (5. ~ 0.5~. For any new program, such as the Space Station, there is the opportunity to structure an optimum risk assessment and management program at the outset which builds on the experience gained in the NSTS Program and assembles those techniques which will be most effective in establishing, mon- itoring, and controlling risks to accepted levels.