| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 33
4 Risk Assessment and Risk
Management: The Committee's View
4.1 GENERAL CONCEPT
Almost lost in the strong public reaction to the
Challenger failure was the inescapable fact that
major advances in mankincl's capability to explore
ant! operate in space incleecI, even in routine
atmospheric flight will only be accomplished in
the face of risk. The risks of space flight must be
accepted by those who are asker] to participate in
each flight as well as by those who are responsible
for the program. The Committee believes that the
basis for NASA's acceptance of those risks shoulc]
stem as much.as possible from rationally derived
criteria. This acceptance also shouIc! depenc] very
heavily on the quality of the methodology and the
degree of objectivity by which the risks are deter-
minecI, as well as the rigor by which the risks are
controlled (i.e., managed).
The Committee began its audit activities by
focusing specifically on the FMEA, the CIL, and
the hazard analysis process. However, very early
in the data gathering phase it became clear that
NASA's processes for analyzing failure modes,
effects, ant! hazards coulc] only be unclerstooc] and
evaluated intelligently when viewer] as elements of
an overall program of risk assessment and risk
management. In the Committee's view, any such
program should include the following basic ele-
ments:
I. A comprehensive method for identifying po-
tential failure mocles and hazards associates!
with the system.
2.
A specific, quantitative methodology for iden-
tifying and assessing (or estimating) the safety
risks of the system.
33
a. A risk management process by which the
safety risks can be brought to levels or values
that are acceptable to the final approval
authority. Risk management inclucles:
establishment of acceptable risk levels;
institution of changes in system design or
operational methods to achieve such risk
levels;
system valiclation and certification; and
system quality assurance.
In this usage, we define a "safety risk" as the
probability (likelihooc! or chance) of suffering a
particular consequence of a failure movie, mishap,
or hazard. For a large, complex system such as the
STS, there is a set of system risks each of which is
comprised of many contributing risks. Thus, we
use the plural "safety risks" of the system, since
one may choose to manage these risks to different
levels.
There are actually two major functions present
in the listing above. Risk assessment is comprised
of the first two elements, identification and assess-
ment of both the failure moctes and hazards, and
the safety risks associated with them. Risk assess-
ment is or shouIcl be a staff function, the results
of which are proviclect as input to management.
Risk management, on the other hand (the third
element above), must primarily be a line manage-
ment function. Within NASA, SRM&QA at Head-
quarters and SR&QA at the centers are staff
organizations. The Associate Administrator for
SRM&QA reports to the NASA Administrator.
Line management authority for NSTS extends from
OCR for page 34
the Administrator to the Level ~ Associate Acimin-
istrator for Space Flight to the NSTS Program
Director and thence through the Level Il Program
Office to the Level ITI project managers.
The concept of risk assessment ant] risk man-
agement is employecl very explicitly within some
private industries ant] public enterprises engagec!
in the engineering development of complex systems.
The nuclear power industry is one such, and the
commercial aerospace industry is another. Within
the USAF Systems Commanc] (including the Space
Division, which clevelops military launch vehicles
and spacecraft), risk assessment consists of a wide
range of qualitative ant! quantitative tools, inclucI-
ing the FMEA and hazard analysis. Risk manage-
ment is viewed as a formal process involving the
establishment, assessment, and control of risk to
precletermined acceptable levels.
Figure 4-1 illustrates a generic type of program
planning and tracking chart that is used in risk
management by the USAF. Levels of risk in the
system, as evaluated by a specific risk assessment
methodology, are plotted against time (anal the
cost) to correct the problems contributing to risk.
In this generic example, actual risk lags and exceeds
the planned levels of risk for each category of risk,
and throughout most of the program. The planned
risk presents a target toward which the system risk
is actively managed. The risk levels assessed at the
conceptual design stage must eventually be evolved,
through engineering, clown to levels acceptable to
the approval authority (i.e., high level, program
line management). This is accomplishes} through a
"systems safety engineering" function that is an
integral part of the engineering design and clevel-
opment process from its inception.
4.2 NASA'S PROCESS: OVERALL
COMMENTS
The fundamental view of risk assessment and
management discussed above took shape over the
first few months of the Committee's activities. It
former! a framework within which the Committee
conic! conduct the subsequent stages of the auclit
and more conficlently evaluate NASA's STS safety
program of which the FMEAs, CTEs, and hazard
analyses are only a few important parts. Much of
the remainder of this report reflects the results of
our inquiry into specific aspects of the ways in
which NASA assesses and manages risks in the
NSTS program. But we believe it is important,
before plunging into specifics, to provide a sense
of the "big picture" within which the Committee
conducted its audit, anc! to give a general assessment
of how NASA's current process (as clescribed in
Section 3) relates to that picture.
4.2.1 NASA Risk Assessment
NASA clefines risk as: "the chance (qualitative)
of loss of personnel capability, loss of system, or
damage to or loss of equipment or property."
ENHB 5 3 00.4 ~ ~ D-2), p. a-41
To identify potential failure mocles and hazards,
NASA uses input from many different sources:
analyses, data gathering processes, design reviews,
etc. Figure 4-2, obtained from the SR&QA Office
at JSC, lists most of these sources for the NSTS.
(However, the Committee is not aware of any
FMEAs or hazard analyses being conclucted on
software.) If employecl rigorously, these tools pro-
vide a good basis for achieving element ~ of the
three specified in Section 4.~. However, this list of
sources might more appropriately be titled "Iclen-
tify Potential Failures and Hazards," because most
of the activities listed do not deal with risk. For
example, the failure modes analysis identifies pos-
sible hardware failure modes, but usually says little
about the risk associated with each of them. When
the effects analysis is added in, then part of the
input needed to establish risk has been gained, but
still nothing is inferred about the probability of
occurrence of either the failure itself or the various
possible effects that might result. A similar situation
occurs in the identification of hazards.
One can categorize failure modes on the basis
of the consequences of their worst-case effects, as
is done in a very rough way in the Critical Items
List, for failure modes whose worst-case effects
lead (for example) to loss of life or vehicle. Such a
categorization is useful for calling urgent attention
to certain failure modes and their attendant haz-
ar(ls. Nevertheless, the listing of such items does
not establish their contribution to the various risks
of the system. In the NASA safety process, each
item on the CIL has a retention rationale written
for it. These retention rationale statements usually
contain information which could, if used properly,
contribute to a process for estimating the associated
risk. However, the rationales appear to be used
strictly as arguments for a waiver of the NSTS
requirement that no single-point Criticality ~ or
34
OCR for page 35
ye
an
. —
.95 ~
@@ ~ 1 1 1 °
~ 1 1 1 1 1 1 1 1 1
O ~ 0 ~ ID U) ~ CO Cad — O
- ~
S13~31 ASIA W31SAS 3113 ~
6
35
OCR for page 36
HAZARD ANALYSES
DESIGN & ENGINEERING STUDIES
DEVELOPMENT & ACCEPTANCE TESTING
SAFETY STUDIES AND ANALYSES
FMEAs, CILs, & LILA
CERTIFICATION TEST AND ANALYSIS
SNEAK CIRCUIT ANALYSES
MILESTONE REVIEWS
FAILURE INVESTIGATIONS
WAIVERS AND DEVIATIONS
WALK-DOWN INSPECTIONS
MISSION PLANNING ACTIVITIES
SOFTWARE REVIEWS
ASTRONAUT DEBRIEFINGS AND CONCERNS
OMRSD/OMI
FLIGHT ANOMALIES
FLIGHT RULES DEVELOPMENT
AEROSPACE SAFETY ADVISORY PANEL
LESSONS LEARNED—OTHER PROGRAMS
ALERTS
CRITICAL FUNCTIONS ASSESSMENT
INDIVIDUAL CONCERNS
HOT LINE
PANEL MEETINGS
SOFTWARE HAZARD ANALYSIS
FAULT TREE ANALYSIS
INSPECTIONS
CHANGE EVALUATION
REVIEW OF MANUFACTURING PROCESS
HUMAN FACTORS ANALYSIS
SIMULATIONS
PAYLOAD HAZARD REPORTS
REAL TIME OPERATION
PAYLOAD INTERFACES
FIGURE 4-2 Techniques for the identification of potential sources of risk in the NSTS Program (after NASA JSC
SR&QA).
IR failure mocles be present when a mission is
launched (see Sections 3.4. ~ and 5. ~ ).
Similarly, in NASA's hazard analysis process,
hazards are categorized as to level ant! status.
Hazards are definer! as either critical or cata-
strophic, clepending on whether or not there is time
for any possible emergency action to be taken.
Each "closed" hazarc! is categorized as being elim-
inated, controlled, or an "accepted risk." Ration-
ales are written to justify accepting the uncontrolled
hazards; many times the same rationale is employee]
that was used for retaining the critical failure modes
(see Section 5.3 for elaboration). However, as in
the case of the CTEs, these justifications do not
establish the risk levels of the hazards. Thus,
although the term "risk assessment" is used in
many different ways and places in NASA clocu-
ments and presentations, the Committee fount] that
nowhere was the total activity described that is
neecled to accomplish element 2 in Section 4. ~
above (i.e., a quantitative methodology for assess-
ing safety risks).
In NASA's definition of risk (above), the word
"chance" is used as the measure (or basis of
comparison ) of the risk. The definition clearly
implies evaluation of a set of risks based on the
chance of occurrence of each of the various con-
sequences clescribed. However, NASA acknowI-
eciges, and our reviews have confirmed, that these
"chances" are not formally or specifically esti-
mated; nor are they documented. Rather, STS risks
are assessed based on subjective judgments ant! the
approval of qualitative rationales by various board
and panel chairmen, and Level T! anc! ~ authorities,
as described in Section 3. However, many quanti-
tative engineering analyses and test data relevant
to risk assessment are available anct often are used
in arriving at what are finally qualitative subjective
judgements. With such a non-specific (i.e., non-
value based) risk acceptance process there is little
basis for making objective comparisons of the
several major risk categories associates! with the
STS, nor for carrying out risk evaluations by
independent agencies. Neither can one systemati-
36
OCR for page 37
cally evaluate the results of efforts to reduce the
risk of the various possible losses. Without more
objective, quantifiable measures of relative risk it
is not clear how NASA can expect to implement a
truly effective risk management program.
was described to the Committee at lSC. It is
conceived to be a synthesis of activities in four
broad categories:
Programmatic
0 Engineering/development
4.2.2 NASA Risk Management ~ Mission operations
The various NASA documents identified in Sec-
tions 3. ~ and 3.4, with some of their key provisions
noted, basically describe a framework within which
to operate an effective risk management program.
At the core of such a program is the idea of risk
management through the control of hazards. Re-
sidual hazards (risks) that cannot be designed away
would be controlled at least to levels consistent
with program objectives and cost constraints. The
definition and analysis of hazards and levels of risk
associated with a system and its operation was to
be performed within a system safety function. Since
the effective level of hazard control was not always
expected to be perfect, a "residual hazard risk
analysis" would be performed to provide the re-
tention rationale for accepting such hazards and
for continuing to operate (perhaps with con-
straints).
In parallel with and providing inputs to this
system safety function is a reliability activity. This
function was to be basically concerned with estab-
lishing a data base for selection of components
which would meet allocated failure probability
requirements; performing failure mode and effects
analyses; establishing redundancy criteria and con-
figuration definitions, maintainability criteria, and
life limits; and preparing critical items lists con-
taining items with single-point failure modes which
could cause catastrophic results.
O Product assurance
As depicted in Figure 4-3, activities in all cate-
gories are conducted throughout all phases of the
NSTS Program, from concept definition to flight
operations. The risk management process is said
to be characterized by top-down direction and
control, with "bottom-up" response and account-
ability from the staff organizations and line man-
agement at the NASA centers. The process of risk
assessment and management is described as one of
"independent but integrated participation" by Pro-
gram management, design/development (project
engineering), operations (Astronaut Office and
Mission Operations Directorate), and SR&QA.
These terms are key: the degree of independence
and integration of organizations and functions
within the overall process comprise a major, re-
curring theme of the discussion presented in the
following Section 5.
4.3 SUMMARY
The basic organizational elements are in place
within NASA for assessing and managing risk;
however, there is a need for a change in the scope
of functions and the way that they are carried out.
Certain shortcomings in process and methodology
, exist which are discussed in the following section.
A third element in the overall safety and risk In particular, there is a fundamental problem in
management program is quality assurance. This the nature of and the methods used to develop the
function, as defined by NASA, would be responsible overall assessments on which NASA line manage-
for assuring that the hardware and software pro- ment bases its decisions about how to reduce and
duced for the system was produced in a controlled control risk in the STS. Also, it appears to the
way end met allrequirementsof the qualify control Committee that there is no clear, formal, and
criteria documents. This assurance role also in- rigorous view among NASA line managers—at
eludes supervision of personnel certification and least on any consistent basis—of the nature and
establishment of non-destructive testing methods goals of risk management.
to detect flaws in components and non-conforming
materials.
These functions provide the basic staff capability
which line management can bring to bear on the
management of risk in the NSTS Program. NASA's
own explicit view of risk management for the NSTS
To reiterate what was said earlier, the Committee
believes that risk management for any system
involving complex engineering must be the respon-
sibility of line management i.e., (in the case of
the NSTS) the system Program Manager, the As-
sociate Administrator for Space Flight and, ulti-
37
OCR for page 38
i
l
c)
cC
~ -
L'
o
l Hi
:~ ~
cO
I
CD
J
LL
r
ct 1
..o 1
] 1
....1 ~
Am_
_~
L_~..
:=
:CE
:0
C)
-:W
Z
.~
' ':—' '
Hi_
-
Ct:
z
o
N
O.....
.~
'L :
.~
cn ~ ~ ~ ~ cat
~ z z En cat
— LLI Let i_ I_ (D ~
~ ~ CL ~ ~ ~ a: LL
Cl) JO O ~ ~9 LL (9 O
Z Lll Lt O ~ ~ Cl
E c: ~ ~ m
i~ ~ z 0
~ O ~
~ 7
O ~
z
I CJ)
111 ~
6
~_
~. ~ r
. ..... ... ! ;
.. ~ , ;
.. ......... , ....
.: .... :.: ! :;
. . . . , . . . . .
.... ............. ~ .....
....... ~ : !
....... z ' ....
... , , 0 ~ , :-
~ ~ 1 ~ I'ZI;( ~1
~ Qo <~! ~ `iZ~ ~
~ :.,.. ., ~n ... ~ :....
z ::.., ...: (I) ... O - L~
C' .,.,.,. ~ '.. ~ .... ~ :'.,::
~ ~'~N ~ ~ ;~ ~ ~ ~
~ ~ ~' ~ Q ~ Z ;~
O .: ~ ' ' .,~ CL , ~ ,
Ct: . - - ~ ~:. ~ , ~ :~ . _
~ : :- ~ :':. ' ~ .,. ~ ~'
:,.: : : ~ ~ , ,, ,,` ~.
.-~.::~ .::.. ~ ~ : - ::\ ..
.- : .:~'~, ~.:'. : 2 ` :'.
, ~ : : -.~- .: ~ , \ ~ ._ _
.,, .N . , .- , .: . ~ ~. ~ -., , ~ : \
:. ~: :: ~ : '' :' :: ': ~ ~ . ~ .. ...,.., ., . ~
: : :. : :. -: :.:: ~ ::: ::~: . ~. :: l :~
:: : ~ ::: ~: :::::: ::: :::: : ::: :: :: :: :::: : ::: :. ~ :~::: :s
z ~
[E (5
z ~
- -
cr)
cO
z
- -
a)
~n
c)
a
~ -
a
a
~ -
a
a
a)
o
Q
a)
Ci)
a)
Q
Ct)
a'
. _
CD
a
C~
e'
cr:
lL
OCR for page 39
mately, the Administrator of NASA. Only this
program management, not the safety organizations,
can make juclicious use of the means available to
achieve the operational goals while evolving the
safety risks clown to acceptable levels, as described
earlier. The safety organizations at NASA centers
and Headquarters are staff organizations i.e., they
can and should! be responsible for providing the
assessments of the system's risks. They should also
~ .
r
be responsible for assuring that the activities as-
sociatec! with controlling the risks to the levels
assessed have been carried out and clocumentecI.
Safety organizations cannot, however, assure safe
operation; they can only assure that the safety risks
have been evaluates! by approved, proper, rigorous,
quantitative, and objective methods, and that the
system configuration and its operation are being
controller! to those risk levels.
39
Representative terms from entire chapter:
line management