| Copyright © 2012. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 40
5 National Space Transportation
System Risk Assessment and Risk
Management: Discussion and
Recommendations
-
.1. CRITICAL ITEMS LIST RETENTION
RATIONALE REVIEW AND WAIVER
PROCESS
The Committee views the NASA critical
items list (CIL) waiver decision making process
as being subjective, with little in the way of
formal and consistent criteria for approval or
rejection of waivers. Waiver decisions appear
to be driven almost exclusively by the design-
based~ FMEA/CTE retention rationale, rather
than being based on an integrated assessment
of all inputs to risk management. The retention
rationales appear biased toward proving that
the design is "safe," sometimes ignoring sig-
nificant evidence to the contrary.
Although the Safety, Reliability, and Quality
Assurance (SR&QA) organizations of NASA
collect, verify, and transmit all data related to
FMEA/CIL and hazard analysis results, the
Committee has not found an independent,
detailed analysis or assessment of the CTE
retention rationale which considers all inputs
to the risk assessment process.
As set forth in the NASA documents identified
in Section 3.l, both the performance of the Failure
Modes and Effects Analysis (FMEA) and the iden-
tification of critical items are intended to be carried
out under the aegis of the reliability function. In
principle, the FMEA shouic! be both a design too'
to provide an impetus for design change, and a
too' for the evaluation of the final configuration in
order to define the necessary control points on the
40
hardware. The identified critical items would re-
quire supporting retention rationale and waivers
as appropriate in order to be included in the overall
as-flown system configuration. How this retention
rationale was to be generated, who developed it
and who evaluated it against what safety criteria
became crucial questions for the Committee's re-
view of the whole process.
According to prescribed procedures, the hazard
analyses being performed by the safety function of
SR&QA, and the FMEA and CTE identification
performed by the reliability function, were to come
together in the generation of Mission Safety As-
sessment (MSA) reports which would contain
analyses and justification of the retention rationale
for the critical items and their associated "hazards",
as well as a safety-risk assessment of the resulting
units, subsystems, and systems. The hazard analysis
and Mission Safety Assessment parts of this overall
safety and risk assessment process as it was sup-
posed to be done prior to 1986 are shown in Figure
5-l, obtained from ISC's SR&QA.
As Figure 5-1 indicates, according to specified
NASA procedure the CIL retention rationale is to
be used as one of many inputs to the more com-
prehensive hazard analysis. In reality, however, the
hazard analysis is often simply a derivative of the
CTE and its retention rationale, and is not used as
a major basis for waiver decisions. Examination
by the Committee showed that often these retention
rationales were simply discussions of the hard-
ware's specifications, design, and testing. They were
generated primarily by the functional development
engineers responsible for the design. They are
intended to be justifications, an(l do not, in our
OCR for page 41
t
1
In
a
a
F
fir
o
I
.4
Z Z
o Z
Z
U) ~
.
V)
In
LIZ
~ C , ~ ~ ~ ~ , ~
-
Ct:
In
In
< Z o
IL ~ In Z
NolS3a
41
OX'
a: ~ ~
1~ Oozes, I' --
\ ~< /
\ /
~ i
. ~
1
fir:
· ~
o
fir {c
IL
Z
~ o
o F
3<
_ J
_
J Z
Z
_ ~
Z
1—
~ O
In
z -
~r
.
~ -
L m
U. ~
.
C}
~n
~2
~n ~
o
o
. _
Q
. _
o
Q
6
cn
z
>-
a
Q
. _
C~
Ct)
a)
Q
CO
CO
CO
o
~+
~n
a)
. _
>
. _
O
a)
c: 6
o
~ CO
o ~
._ ,
Ct
.o
. _
CD
z
_'
Ct —
I
_
U'
—
a)
OCR for page 42
view, provide a true assessment of the risk of the
hazards.
Sometimes the rationale appears to be simply a
collection of judgments that a design shouIc! be
safe, emphasizing positive evidence at the expense
of the negative, and thus cloes not give a balanced
picture of the risk involved. For example, the CIL
retention rationale of December 1982, for the Solid
Rocket Motor (SRM) inclicated in support of re-
tention that: there had been no failures in three
qualification, five development, and ten flight mo-
tors; there hacT been no leakage in eight static firings
and five STS flights; 1076 Titan TIT joints (presum-
ably of similar clesign) were tester] successfully; etc.
Missing from the retention rationale was, among
other points, any discussion of the dissimilarities
between the SRM and Titan Ill (e.g., insulation
design and combustion pressure on the O-ring);
the O-ring erosion observed in the Titan IT! program
and on the second STS flight; a failure during an
SRM burst test; and, since the rationale was not
updated, all of the O-ring anomalies seen after
December 1982. Furthermore, in many cases we
reviewed:
O No specific methoclology or criteria are estab-
lished against which these justifications can
be measured.
· The true margins against the failure modes
often are not defined or explicitly validated.
The probability of the failure mode is never
establisher! quantitatively.
Design "fixes" are accepted without being
analyzed and compared with the configuration
they are replacing on the basis of relative risk.
The point is worth reiterating: The retention ra-
tionale is user! to justify accepting the design "as
is"; Committee audits of the review process dis-
coverec! little emphasis on creative ways to e~imi-
nate potential failure modes.
Since 51-L, there has been a major increase in
the attention and resources given to STS SR&QA
and risk assessment and management functions at
all levels of NASA ant] its contractors. In 1986,
NASA appointed an Associate Administrator at
Heaclquarters for Safety, Reliability, Maintainabil-
ity, and Quality Assurance (SRM&QA) an(l charged
him with establishing a NASA-wide safety an(l risk
management program. To implement this program,
policy directives are being cleveloped relating to
various procedures ant! operational requirements.
Specific instructions and methoclologies to be used
in the conduct of various analyses ant! assessments,
such as hazard analyses, are being clevelopecI.
Independent institutional assessments ant! audits
will be macle of SR&QA activities ant! technical
effectiveness at each NASA center.
Some important elements of this revamped NASA
safety program—inclucling hazard analysis and
mission safety assessment are depicted in Figure
5-2, which was obtained from the ISC SR&QA
organization in May 1987. Several things shown
in the figure shouic! be noted. First, there is now a
specific new set of NSTS instructions to all con-
tractors and NASA organizations for conducting
hazard analyses, and for preparing FMEAs and
CTEs for the NSTS (these new instructions affect
the activities in the boxes in Figure 5-2 marked A.
Second, it can be seen that the FMEA/CTE docu-
ments are intended to be one of many inputs into
the hazard analysis and Hazard Report, which in
turn are shown as an input into the Mission Safety
Assessment.
However, since (as discussed in Section 4.2) the
Hazard Reports do not provide a comprehensive
risk assessment, nor are they even required to be
an independent evaluation of the retention rationale
stated in the CTEs, the Committee believes that
NASA plans at least for the near term to con-
tinue using the retention rationale of the CILs
directly and individually as the basis for Criticality
~ and IR waiver justifications to Levels IT and T.
We have indicated this by ad(ling the Criticality ~
and IR waiver path within the dashed lines on the
left side of Figure 5-2. The current plan is to take
the critical item waiver requests to the PRCB and
Level ~ via a data package prepared by DISC SR&QA.
It is our impression, however, that most of the
arguments in this data package will still basically
be those contained in the original CTE retention
rationale. Thus, we see too little in the way of an
independent detailed analysis, critique, or assess-
ment of the risk inherent in Engineering's rationale.
Since mid-1986, NASA and its contractors have
been performing a massive rework of all STS
program FMEAs, updating the resulting Clips, and
reviewing all prior HAs. This new FMEA/ClI effort
has had value in identifying new failure modes that
were missed earlier or introduced through past
changes, and those resulting from new changes
made mandatory before next flight. However, the
new NSTS instructions for preparing FMEA/ClI s
42
OCR for page 43
NSTS 01700
DELIVERABLES
_ SD-77-St—0113
_ RISO DELIVERABLES
_ ~
ROCKWELL HA
OES INSTRUCTl[lN Ann 7e
;- ~
N PD 1 700 1
BASIC POLICY OH SAFETY
_ ~
NHB 17001 (Vl-A)
BASIC SAFETY MANUAL
_ . .
NHB 5300 4 ( 1 0-2)
SR&QA AND MAINTAINABILITY
PROVISIONS FOR THE SPACE
SHUTTLE
NHB 1 700 1 (V3)
SYSTEM SAFETY METHODOLOGY
._
NSTS 22254*
METHODOLOGY OF CONDUCT OF
NSTS HAZARD ANALYSIS
. _
NSTS 22206*
INSTRUCTIONS FOR PREPARING
FMEA/CIL
CILS
~ ___~___ l r
l dSC SR DA l
1 1
l _ _ ~ _ _ _ J
'1 ~
ll OATA PACKAGE l
——t—- -
r—- - -a
L.
I LEVEL I
I AUTHORITY
Len_. _ _
, ~ ,,,,,Ls ~ ,
FMEA CIL _
I DOCUMENTS ~ ~
r — CRIT I&JR J l | PREPARED HAZARD REPORTS l
RISO SHUTTLE HAZARDS ~
INFORMATION MANAGEMENT I ~1 RISO ERB
l SYSTEM (SHIMS) l — -— rat
| SUBSYSTEM MANAGER 1: ~ SAFETY l
MISSION OPERATIONS ' _ .
DIRECTORATE l
. LEVEL Il |
PRCB I
I [ _ _ . TY S U B PA N E L l l
1 ' ~
—————J I r - 1
l l | PROJECT MANAGER | ~
i___ 1 1
| ORBITER CONFIGURATION CONTROL |
| BOARD (CCB BASELININGI*
WAIVER REQUEST
OATA PACKAGE
| MISSION SAFETY ASSESSMENT |
Dashed boxes added by the Committee
* New procedures added since 51-L
I NSTS 0700 1
I TECHNICAL RFnillRFll.F~Tc I
· PREVIOUS EXPERIENCE
· DESIGN ENGINEERING STUDIES
· SAFETY ANALYSES
· SAFETY STUDIES
· CRITICAL FUNCTIONS
ASSESSMENTS
· FMEA S/CIL S
· CERTIFICATION PROGRAM
· SNEAK ANALYSES
. MILESTONE REVIEW DATA/RID S
. PANEL MEETINGS
. CHANGE EVALUATIONS
· FAILURE INVESTIGATIONS
· WAIVERS/DEVIATIONS
· OMRSD S/OMI S
· WALKDOWN INSPECTIONS
· MISSION PLANNING ACTIVITIES
· FLIGHT ANOMALIES
. ASAP INPUTS
· INDIVIDUAL INPUTS l
· PAYLOAD INTERFACES l
| SENIOR SAFETY REVIEW BOARD | ~
·1
| LEVEL II PRCB BASELINING l
~-
| NASA SPACE SHUTTLE | ~—
| HAZARDS OATA BASE 1
FIGURE 5-2 NASA JSC safety analysis, hazard reports, and safety assessment process in 1987, as modified
by the Committee (adapted from NASA JSC SR&QA).
(NSTS 22206) have also resulted in a large increase
in the number of Criticality ~ and IR items. The
Committee believes this new complexity will pose
additional severe problems for both the mechanics
and credibility of the CIL and waiver processes.
The strong dependence on the CIL retention
rationales in waiver (recisions makes it critical that
they be comprehensive and up to Late. It is not
clear to the Committee whether, in the pre-51L
environment, changes in the STS configuration or
OCR for page 44
the operational experience base led directly and
surely to review and appropriate updating of the
relevant CIL retention rationale. In the wake of
the 5 l-L accident, the NSTS program issued a
document (NSTS 22206) which is intended to
strengthen the process for updating the retention
rationale. Once a retention rationale has been
accepted and a waiver granted for a critical item,
any changes to the item itself, the FMEA, or the
CIL that could affect the retention rationale mean
that the CIL must be resubmitted to the Level Il/l
PRCB for its approval (NSTS 22206, p.2-7,
para.2.2.61. Any change, whether it be to the test
environment, level, procedures, methods, or fre-
quency, is to be reflected in changes to the retention
rationale. If crew procedures are changed to reduce
risk, corresponding changes are also to be made in
the retention rationale.
The question is whether this updating is con-
ductect regularly and in a consistently rigorous
fashion. Although this policy is new and may not
yet have been fully imposed in all quarters, NASA
and contractor personnel interviewed by the Com-
mittee seemed variously uncertain about or una-
ware of these requirements and how they are met.
Updating the retention rationale seems to many to
be considered a routine bookkeeping chore, of
secondary importance, yet these rationales are the
· r · ,
primary casts tor granting waivers.
During its audit the Committee developer! a
concern that the FMEA and associated retention
rationale on a given crltlca Item may sometimes
fait to provide data in various important categories
of information, such as the effects of environmental
parameters. The lack of data in a certain case may
or may not be significant with respect to the threat
that item represents. Yet the absence of such data,
even though it resulted in uncertainty, in the past
has sometimes had the effect of bolstering the
rationale for retention and providing unwarranted
confidence in readiness reviews. This problem was
especially in evidence with Mission 5 I -it. Data
suggesting that temperature was a factor in the
erosion of the O-rings did exist, but (according to
the Rogers Commission) the relevant analyses ap-
parently were considered to be inconclusive by
those responsible, and these data did not appear
in the retention rationale. Thus, the rationale im-
plied that there were no data to suggest that
temperature was a problem. Strengthening and
closing the problem reporting loop since the a`.ci-
dent may well reduce the likelihood of sim itar
future occurrences. Still, we note that the "negative
answer" indicates uncertainty about the issue at
hand. If the uncertainty is crucial to the decision
process, then it implies the need for more experi-
ments, tests or analyses to reduce the uncertainty.
(Appendix E includes an analysis of the O-ring
temperature effect and the uncertainty implied by
extrapolation to low temperatures.)
Thus, the Committee's central concerns here are
the reliance on and quality of the retention ration-
ale, and the fact that we can perceive no clocu-
mented, ob jective criteria for approving or rejecting
proposed waivers. CIL waiver decision making
appears to be subjective, with no consistent, formal
basis for approval or rejection of waivers. At! items
are considered and discussed at length during the
CCB and PRCB reviews. It appears that, if no
action item is generated as a result of the review,
the critical item waiver is approved. There was no
formal "approved or disapproved" step in meetings
audited by the Committee, although we are in-
formed that such approvals do appear in the
minutes of the meetings. NASA managers empha-
size that Level Ill engineers and their "Leve! IV"
contractors are accorded a high level of responsi-
bility and accountability throughout the program,
and that their opinions and analyses are the real
bases for making retention decisions; these engi-
neers bear the burden of proving that the rationale
is strong enough to justify retention and waiver of
the item.
However, the Committee believes that engineer-
ing judgment on these matters is not enough. Such
judgment is crucial, but it is often too susceptible
to vagaries of attention, knowledge, opinion, and
extraneous pressures to be the sole foundation for
decision making. We are concerned that, for all
the reasons discussed above, without professional,
detailed evaluation against specific criteria for re-
ducing risk (not just review by panels and boards),
the retention rationales can be misleading or even
incorrect regarding the true causes and probabilities
of the failure modes for which retention waivers
are being requested (see discussion of probabilistic
risk assessment in Section 5.6~.
Recommendations (1~:
The Committee recommends that NASA estab-
lish an integrated review process which provides a
comprehensive risk assessment ancl an inclepen(lent
evaluation of the rationale justifying the retention
44
OCR for page 45
of Criticality ]/]R and 2/2R items. This integrated
review should include detailed consideration of the
results of hazard analyses and all other inputs to
the risk assessment process, ir' addition to the
FMEA/CIL retention rationale. Further, the review
process should assure that the waivers and sup-
porting analyses fully reflect current data and
designs. Finally, NASA should develop formal,
objective criteria for approving or rejecting critical
item waivers.
5.2 CRITICAL ITEMS LIST
PRIORITIZATION AND DISPOSITION
At present, in NASA instructions all Criti-
catity ~ and ~ R items are formally treated
equally, even though many differ substantially
from each other in terms of the probability of
failure or malperformance, ant! in terms of the
potential for the worst-case effects postulated
in the FMEA to be seen if the particular failure
occurs.
The large number of Criticality ~ ant] JR
items at the time of the 5 I-L accident has since
been substantially increased clue to changes in
grounc! rules for classification and the complete
reevaluation of the entire STS.
The Committee believes that giving equal
management attention to all Criticality ~ ant!
IR potential failures could be cietrimental to
safety if, as is the case, some are extremely
unlikely tO occur, or if the probability is very
low that the postulates] worst-case conse-
quences of the failures will result. Treating all
such items equally will necessarily detract from
. . . .
the attention senior management can give to
the most likely and most threatening failure
mocies.
Critical items in the Shuttle system are catego-
rized according to the consequences of worst-case
failure of that item. However, it has been the case
that within each criticality category no further
ranking is formally macle. In practice, managers
do sometimes discriminate within a category, e.g.,
in their decisions regarding those STS items which
should be fixed prior to next flight. Prior to the
51-L accident there were aireacly 2369 Criticality
and iR items (the most critical) present in the
Shuttle system. There has been a substantial in-
crease in the number of such items, now estimates!
by NASA to be 4686, of which 2148 have been
approved by the PRCB (Director, ~SC/SR&QA,
personal communication, November ~ 0, ~ 9 8 7) .
This increase resulted from the reevaluation of the
entire Space Shuttle system and the new ground
rules specified for the preparation of FMEAs- e.g.,
the carrying of analyses down to the indiviclual
component level (even where multiple, identical
components are involved) and the inclusion of
pressure vessels which were formerly exclucled (see
Section 3.5.2~. To take just one example, the
number of Criticality ~ and IR items in the SSME
turbomachinery rose from 8 to 67 uncler the new
ground] rules. In view of this problem, NASA is
now taking steps to prioritize the most critical
items and will reevaluate the current scheme for
defining levels of criticality.
Initially, the reassessment process seemed to the
Committee to be tOO heavily focused on Level I.
The presence of a very large number of Criticality
~ and IR items—even admitting that many are
clustered with identical items—obviously places a
heavy clemand on the time and attention of key
NASA decision makers and could prevent their
penetrating deeply enough into the analyses sur-
roun(ling each item to make a valic] decision on all
of them. We were concerned! not only about the
workload placed on Level ~ management, but also
about the danger that crucial technical details might
be lost or obscured as the rationale for retention
was presented at successively higher levels. Al-
though the same information is presented at the
Level T! and ~ PRCBs, it seemed entirely possible
that technical debates occurring at lower levels
might not be adequately relayed to Level I.
A post-51 L organizational change that shifter]
the Level I] NSTS Program Director at JSC to Level
at Headquarters has Deviated these concerns to
some extent. NASA recognized that the waiver
clecision-making flow was not icleal especially
from I=eve! I! to Level I. Consequently, the Level
NSTS Director (who also chairs the Level ~ PRCB)
now participates in the Level Il reviews as a basis
for sign-off at Level I. Thus, there is now a more
direct "hand-off" of concerns and rationales from
Level Ill to Level I, via Level Il. Nevertheless, the
process still places a heavy workloacl on Level T.
and there is still a cianger that important technical
information might be Lost in transmission.
The organizational change streamlinecl the waiver
decision-making process, but it flick not help in
45
OCR for page 46
i ' ~ ?%
handling the large number of Criticality ~ and JR
items. Many of these items differ substantially from
each other in terms of the probability of failure or
malperformance, and in terms of the possibility
that the worst-case effects postulated in the FMEA
will be seen in the event the particular failure does
occur. (In this connection it might be noted that,
prior to 51-L, 56 Criticality ~ failures occurred] on
the Orbiter during flight without any of the pos-
tulated worst-case effects resulting.) Thus, the items
vary considerably in their potential impact on
Shuttle operational safety i.e., on risk.
Early in its audit the Committee began urging
NASA to find a way to prioritize the Criticality ~
and IR items (see Appendix C, first interim report).
NASA managers tendec] to assert that, since all
Criticality ~ and IR items are (by definition) equally
catastrophic in their consequences, all shouic! be
treated equally and, indeed, we saw evidence in
our audits that they were handler! with equal
attention. But it is the position of the Committee
that giving equal management attention to all such
items could be detrimental to safety if (as is the
case) some are extremely unlikely to fait, or the
probability is very low that the postulated worst-
case consequences of the failures will result. The
most likely ant] most threatening failure mocles
merit the most attention. It is illogical to dissociate
the probability of an event or its consequences
from decisions about the management of risk.
For example, in the development of a probabil-
istic risk assessment for a modern nuclear power
plant, fault tree and event tree analyses typically
identify several million potential sequences of events
(including multiple independent failures and cas-
cading failures) that can lead to core melt-down.
However, only 20 to 50 of these sequences con-
tribute significantly to the risk, with five to ten of
them contributing 90°/O of the risk. These particular
sequences are exhaustively analyzed to identify
ways to substantially reduce the overall risk.
A secondary consideration of the Committee was
the possible impact of the disclosure that, as the
resumption of Shuttle operations nears, there are
more Criticality ~ ant] IR items (with all of them
being waived) than there were before the accident.
That perception would not be justifier] by, and
would not fairly reflect, the real strides in system
safety that have been macle since 51-~.
Responding to suggestions on the part of the
Committee, NASA developed and tested a number
of techniques that could be used to prioritize the
CIL on the basis of the relative risk each item
represents. One such scheme—termec! the Critical
Item Risk Assessment (CTRA) procedure—was se-
lectec! and instructions for its implementation have
now been promulgates] throughout the NSTS pro-
gram (NSTS 2249 I, June ~ 9, ~ 9 8 7) .
The CTRA procedure is currently qualitative in
nature although it employs reliability and test
data to some extent. It is based instead on judg-
ments about the degree of threat inherent in dif-
ferent risk factors. The Committee is concerned
about the potential negative impact on the CIRA
of ambiguous measures of risk and probability.
However, the technique does lend itself to the
incorporation of more rigorous quantitative meas-
ures of risk and probability of occurrence as these
measures are developed for use within NASA. (See
Appendix E for a discussion of CTRA and one
approach to quantitative measures suggested by
the Committee.)
Current plans for the implementation of CIRA,
spelled out by the NSTS Deputy Director (Program)
in a memorandum dated July 2l, 1987, are for
STS project managers to prioritize the Criticality
1, TR,and ISitemsin each project after completing
the FMEA/CIL reevaluation and presenting the CTE
at the Level TIT CCB. By two weeks before Design
Certification Review, each project manager wit!
provide the NSTS Deputy Director (Program) with
a list of "the 20 items in his project that represent
the greatest risk to the program." The Deputy
Director will then compile and distribute a report.
This assessment effort will run parallel to, and may
not actually affect, the preparations for STS-26
(the next schecluled Shuttle flight). However, "an
alternate course of action" may be chosen for
subsequent missions. The Committee views this
implementation procedure with concern. It does
not appear to reflect a serious concern on the part
of the NSTS Program for the need to prioritize the
CIL by assessing relative risks.
Recommendations (2~:
The Committee recommends that the formal
criteria for approving waivers include the proba-
bility of occurrence and probability that the worst-
case failures will result. We further recommend
that NASA establish priorities now among Criti-
callity ~ and ~ R items, taking care not to use
ambiguous measures of risk anal probability. NASA
should also modify the definitions of criticality in
46
OCR for page 47
terms of the probability of failure and probability
of worst-case effects. Finally, we recommend that
NASA Leve/ ~ management pay special attention
to those items identified as being of highest priority,
along with the rationale that produced the priority
rating. Responsibility for attending to lower-prior-
ity items 'within the present Criticality ~ and JR
categories, when reclassified, should be distributed
to Levels I! and II] for detailed evaluation and
, . .
ctectslon.
5.3. HAZARD ANALYSIS AND MISSION
SAFETY ASSESSMENT
-
NASA hazard analyses currently do not
address the relative probabilities of a particular
hazardous condition arising from failure modes,
human errors, or external situations.
The hazard analysis ant! the mission safety
assessment clo not: address the relative prob-
abilities of the various consequences which
may result frown hazardous conditions; provide
an independent evaluation of the retention
rationales stated in the input CILs; or provide
an overall risk assessment on which to base
the acceptance and control of residual hazards.
Hazard analysis (HA) is intenclec! to be a key
part of NASA's safety and risk management proc-
ess. Because it considers hazardous conditions,
whatever their source, it is a top-down analysis
that shouIc] encompass the FMEA ant] other bot-
tom-up analyses and cover the safety gaps that
these other analyses might leave. In reality, how-
ever, the HA has not player] the central role it was
designed to play. Instead, the main focus has been
on the FMEA and its corresponding CTE retention
rationale. These are design-based analyses, pre-
pared by the project engineering staff. (See Section
5.~.)
The Committee's audit of the FMEA/CTE re-
evaluation and hazard analysis review produced,
at first, a somewhat confusing and contradictory
set of perceptions about the relationships between
these safety analyses and the nature of the overall
risk assessment and management process of which
they are a part. Gradually, it became clear that
there were differences between the officially pre-
scribed process and the real process, as well as
differences in the way the process is perceived by
various NASA personnel, clepencling on their func-
tion and point of view. Beyond that, there were
also differences among the NASA centers in the
implementation at the detail level.
Figure 5-l (shown earlier), which was prepared
by the Safety Division at iSC, depicts fairly accu-
rately the process, as the Committee has come to
understand it, that was prescribed by NASA policy
at the time of the Challenger accident. Here, the
HA is clearly an important element, buttressed by
a number of complementary analyses including the
FMEA/CIL. The ultimate product of the safety
analysis is the Mission Safety Assessment (MSA),
feecling into the deliberations of the various engi-
neering and readiness review boards. Figure 5-3,
also prepared by the Safety Division at JSC, shows
the process from the perspective of that Division,
focusing on the HA as the central activity. Note
that the FMEA/CIL is listed as one of many inputs
to the hazard analysis. The actual process appears
to be quite different from the one suggested by the
preceding two figures.
During the latter part of 1986 anct the first few
months of 1987, our audit led to the impression
that, although some of the FMEA/CTEs were inputs
into the HA function, the real risk acceptance
process within NASA operated essentially as shown
in Figure 5-4 (obtained from ISC). One can see
from the diagram that the "Hazarcl Analysis As
Required" is a deacI-end box, with inputs but no
output with respect to waiver approval decisions.
Our impression was supported by subsystem proj-
ect managers, engineers anc] their functional man-
agement at ISC. Many of them believer! that the
CIL path shown in Figure 5-4 was the actual
approval route for retention of designs with Crit-
icality ~ and IR failure mocles.
A key problem, in our view, is that the risk
assessment shown in the box entitled "Retention
Rationale and Risk Assessment" was not really an
independent assessment of the risk levels by profes-
sional system safety engineers; such indivicluals
(and they are few in number within NASA) were
"left out of the loop." Neither did the assessment
contain an evaluation of how system hazards re-
sulting from critical item failure modes wouicl be
controlled. In practice, in most cases reviewed by
the Committee, the retention rationales written on
the CIL forms were simply transferred to the hazard
analysis reports and became the basis for final
acceptance of resiclual hazarcls, and for decision-
making at Flight Readiness Reviews (FRRs).
47
OCR for page 48
Cot
~ -
c]
Al
- ~ o
#I
z
-
UJ
c:)
J
0~
3.smo
a:
I o ~
~ O J
< lit O ~
5m
.` in ¢,
O CO
ka
in ~ '~_
b
, in
. ~
tic L
To
,
it
in mo
o ~ o
in
in
o
~o^
. ,
in
o
CL
Z ~
D: J
Z
Z ~
_ C)
Z aS
s ~ 3~1nHS
Oz~
00~O
I SOVOlAYd
1 9uS
1 13
H3119~0
z
cn
UJ
tn
~n
~L
~n
z
o
co
~n
, , 1
, I ~
11V
' a
. ~ ~
6
c _ ~ j: :; ~ ~ ~ °t 0 ~ ~ C ,` 0t ~ ~ a 0 ~ ,,, ~ a, °- g 8 t°; Z c 0 _ S
c~ u~ ~n ~n ~ ~n ~n ~ ~ ~ ~ ~ c~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Z ~ ~ o
· ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ..
48
C]
tn
z
U]
z
o
C~
~n
~n
~n
a)
.
CD
N
Cl)
a)
CO
CD
._ ~
CE
CO
a)
. ~ ~
tn
-
z
o
~L
Fz
~n
Z ~ Z
_ [L _
.
1
C:
U?
. _ _,
. _
a) c~
<(
.o
a)
cn
° Cf)
c~ a)
CO
o
_ ._
~ cn
~ (n
. _
LL ~
OCR for page 49
- -
\ to
~ If
(
-
\ G G A, /
Amoco
LO ~ - LO
J
to - LucO
- -
~ '
=
- ~
z
z z cO
~ O G L&'
LU," CO
G ~ CO
49
~e`.
co
at:
an
G TIC IN
OG ~ UJ
cn
CO
a)
~0
Q
a)
l
~n
o
. _
C)
Ct
o
Q
a
a~
J Z
Ul O
3 m co
c~ ~ c~
o Ul
o
~ /
c~
L~
3
it
z LU
c~ ~
_ z
co as
LU
o c~
a~
co
0
/
L~
~ _
cn >
a) cn
cn
=
a
u' ~
IIJ o
ct ~
au
- ~
OCR for page 50
NASA does not use the HAs and (in turn) the
MSAs as the basis for the Criticality ~ and IR
waivers. In fact, HAs for some important subsys-
tems were not upciated for years at a time even
though design changes had occurrec! or dangerous
failures were experienced in subsystem hardware.
(An example is the ~ 7-inch disconnect valves
between the ET and Orbiter.) The Committee's
auclit showed that standards and detailec! instruc-
tions for the conduct of HAs were not found to be
consistent throughout the STS program; NSTS
22254 was issued to correct that problem.
In summary, the Committee found in its review
of the HA process that:
1.
HAs were clone for only the largest subsystems
of the STS; they acIdressed certain overlays
of hazards but were not traceable to all
failures in units within the subsystems.
2. HAs were not clone routinely for each major
subsystem.
The HA assumer] worst-case consequences
and simply categorizes! hazard levels (cata-
strophic or critical) based on whether there
. , .
was time tor counter-act~ons.
4. The HA process called for an independent
evaluation of the HA results. Analyses of
catastrophic ant! critical hazards were to be
verified using risk assessment techniques.
However, the HAs did not address the relative
probability of occurrence of various failures,
basec! on actual flight ant] test information,
nor clic! they evaluate the validity of the CIL
retention rationale against any formal set of
. .
crlterla.
We found that many engineering personnel,
functional managers, anc! some subsystem man-
agers were unaware of what tasks must be done
to complete the hazarc! analysis, die! not know
whether they hac! actually been done, ant! clid not
contribute to them. Some, in fact, believed that
HAs were just an exercise done by reliability and/
or safety people and that they were reclunciant to
the FMEA/CTEs. Their belief appears to be justified,
in that these HA activities die! not seem to be
authoritatively in-line as part of a true hazard
control and risk management process. It appears
they were carried out in a relatively sterile envi-
ronment outside the mainstream of engineering.
The safety personnel did use the HAs along with
the FMEA/CIEs to create Mission Safety Assess-
ments for the major elements of the STS ant! for
the overall missions. These MSAs were to provide
"a formal, comprehensive safety report on the final
design of a system." However, in practice, the MSA
reports essentially served as process assurance re-
ports. They listec! the hazards ant! statec! whether
they were eliminated or controlled; compared harcI-
ware parameters with safety specifications; speci-
fied precautions, procedures, training or other safety
requirements; anct generally documented compli-
ance with the various reliability and safety tasks.
They did not provide in-clepth quantitative risk
assessments, and relief] almost exclusively on the
CILs and HA reports for justification of acceptable
risks.
New design changes and/or flight data were
"examinecl" and "jucigect" for safety by various
personnel and boards at NASA Levels TIT, If, and
I; the vehicles for the approval of changes appear
to have been the FRRs and various special reviews.
The HA anc! MSA reports were not viewer! as
controlling documents on a specific system config-
uration which was judged to be safe by the safety
organizations. The initial waivers to fly Criticality
~ anc! IR items were not always redone in a timely
way after new data were obtained. Thus, our audit
supports the impression that the hazarc! analysis is
not used to its fullest advantage and that overall
system safety assessments, based on test and flight
data and on quantitative analyses, are not a part
of the process of accepting critical failure modes
and hazards.
Since the Hazard Report does not provide a
comprehensive risk assessment, or even an incle-
pendent evaluation of the retention rationale stated
in the input CTEs, we believe the overall process
shown in Figure 5-2, representing NASA's current
plans, has serious shortcomings. The isolation of
the hazard analysis within NASA's risk assessment
and management process to date can be seen as
reflecting the past weakness of the entire safety
organization. For that reason, this issue of the role
of hazard analysis drives to the heart of our most
sweeping conclusion, which is that the information
flow, task descriptions, and functional responsibit-
ities implied by Figure 5-2 must be mo(lifiec! if
NASA is to achieve a truly effective risk manage-
ment process. The reordering of functions which
the Committee recommends is Ascribed in detail
in Section 5.~.
50
OCR for page 68
6
rently too long, but a gradual reduction in flow
. .
tulle IS expecter to occur.
Recommendations (9c):
The Committee recommends that NASA main-
tain its current intense attention toward reducing
cannibalization of parts to an acceptable level. We
further recommend that adequate funds for the
procurement and repair of spare parts be made
available by NASA to ensure that cannibalization
is a rare requirement. Finally, we recommend that
NASA include cannibalization, with its attendant
removal and replacement operations, as a potential
producer of failure in the integrated risk assessment
recommended earlier (Section 5.~J.
5.10. OTHER WEAKNESSES IN RISK
ASSESSMENT AND MANAGEMENT
5.10.1 The Apparent Reliance on Boards and Panels
for Decision Making
The multilayerec! system of boards and panels
in every aspect of the STS may lead individuals
to defer to the anonymity of the process ant!
not focus closely enough on their individual
responsibilities in the decision chain. The sheer
number of STS-relatec! boards and panels seems
to produce a mindset of "collective responsi-
bility."
The NSTS Program is a large organization whose
mission involves the development, deployment, ant!
operation of a complex space vehicle in a wide
range of missions. Associated with each milestone
in the development of any NASA space system ant]
its constituent parts, or in the preparation for a
space mission, are one or more reviews. These
reviews may be made from the standpoint of
requirements, engineering design, (levelopment sta-
tus, safety, flight readiness, or resource require-
ments. Conducting each review is a team, panel,
or boarcl, which may or may not be permanently
empanelecI. As described in Section 3.2.2, in the
NSTS Program there are review groups at every
level of management, including the contractor or-
. .
ganlzatlons.
Figure 5-l ~ depicts the review groups associated
with the NSTS FMEA/CIL and hazard analysis
processes alone. There are also boards to review
design requirements and certification, software, the
Operations and Maintenance Requirements ant!
Specifications Document (OMRSD) ant! the Op-
erations and Maintenance Instructions (OMI), the
Launch Commit Criteria, ant! mission rules. There
are flight readiness reviews at each stage of prep-
aration, with a Launch System Evaluation Advisory
Team to assess launch conditions and a Mission
Management Team to oversee the actual mission.
The Committee cleveloped a concern about a
possible attitudinal problem regarding the decision
process on the part of the NASA personnel engaged
in it. Given the pervasive reliance on teams and
boards to consider the key questions affecting
safety, "group clemocracy" can easily prevail, with
the result that inclividual responsibility is diluted
anti obscured. Even though presumably the chair-
man of each group has official responsibility for
the decision, most decisions appear to be highly
participatory in nature. In a CCB review auditec!
by the Committee, for example, there were 25-35
people present and the role of the chairman was
not especially distinct. Each action appearec! to be
a consensus action by the board.
It is possible that this is a factor in the problem
iclentified by the Rogers Commission: " . . . a NASA
management structure that permitted internal flight
safety problems to bypass key Shuttle managers"
(Vol. I, p. 82~. For example, the Level II PRCB
conducts daily and weekly meetings usually via
teleconference—in which as many as 30 people
participate. It is certainly conceivable that inclivicl-
uals might be reluctant to express their views or
objections fully under such circumstances. Also,
passing ciecisions upwarc] through the ranks of
review boards may reduce each chairman's sense
that his decisions are crucial. As a case in point, it
is clear from the report of the Rogers Commission,
and from statements made to the Committee by
NASA personnel involved, that the lines of au-
thority and responsibility in the flight readiness
review decision-making chain had become vague
by the time of mission 51-~.
In discussing this issue, NASA's Associate Ad-
ministrator for SRM&QA pointed to the SR&QA
directors at the field centers as the inclividuals with
primary responsibility for the safety of the Shuttle
system. They are said] to have full "responsibility,
authority, and accountability." Nevertheless, these
in(lividuals do make inputs to larger and higher
boards, so that in the en(l all decisions become
68
OCR for page 69
-
J
J
LL
-
6
I tin
aC
o ~
a: ~ m I
_ _. ~ _
_ ~ m
3 o c~ (I)
O _
j;T
o o o
oom
_ ~
,2
£
~ _
~ ~ 0
- 1 -'
in
a)
a'
en
Cal
~ 1
- 0
c
C ~I~ c:
o. I
u, ' ~ U) a,
g
e E 2
~ mO a'
'a ~ 0 ~
— 0 — ~
0 ~ 0 o
us in tar m
o
:~
0 0 0
L)
.=
ct)
a)
. _
au
a)
. _
a)
o
in
o
. _
.cn
a)
o
~4
o
Q
cn
a)
Q
o
1
^. a)
cn
>~
cn
a
a
>~
-
o
CO
a
. _
a)
tn
C!
~r
1
69
OCR for page 70
collective ones, lacking the crucial minciset of
individual accountability.
It is possible that a semantic problem is partly
at fault here, in that NASA managers often refer
to "the board" as being synonymous with its
chairman, with respect to decision authority.
Nevertheless, a mindset is thereby establisher] in
which it is not clear whether these are inclivi`dual
or group decisions.
The Committee contrasted the NSTS system with
tha. of the U.S. Air Force, in which the board
(including its chairman) makes recommendations
to the decision maker. One positive point in favor
of NASA's system is that, there, the chairman (who
is the decision maker) is requires] to listen "in
public" to all dissenting views.
The Committee recognizes the important role
played by the many panels and hoards in the NSTS
program in providing coordination, resolving prob-
lems and technical conflicts, and reviewing and
recommending actions. These entities allow the
different interests and skill groups to bring forward
their inputs, contribute their knowledge, and thus
minimize the risk that a proposed action will
negatively affect some aspect of the STS.
Recommendation ( 1 Oa):
The Committee recommends that the Adminis-
trator of NASA periodically remind all NASA
personnel that boards and panels are advisory in
nature. He should specify the individuals in NASA,
by name and position, who are responsible for
making final decisions while considering the advice
of each pane! and board. NASA management
should also see to it that each individual involved
in the NSTS Program is completely aware of his/
her responsibilities and authority for decision mak-
ing.
5
10.2 Adequacy of Orbiter Structural Safety Margins
The primary structure of the STS has been
excluded, by definition, from the FMEA/CIL
process, based on the belief that there is an
adequate positive margin of safety. However,
the Committee questions whether operating
structural safety margins have actually been
proven adequate.
Completion of the Mode! 6.0 loads stucly
and the reevaluation of margins of safety based
on these loads will significantly improve
NASA's grasp of actual operating margins of
safety.
NASA groundrules exclucle primary structure
from the FMEA/CIL process. NASA has apparently
assumed that the structural reliability of the STS
Including the Orbiter, External Tank, ant] Solid
Rocket Boosters) is close to 1.00, because the
operating Toads are believer! to be less than the
proof load to which the vehicle has been subjected.
It is true that some structures have reliability
approaching 1.00; examples include briclges, build-
ings, and even commercial airliners. But there is a
considerable difference between the Shuttle, a first-
of-its-kinc! vehicle operated uncler unique condi-
tions and challenging environments, and a com-
mercial airliner, which is designed and tested to
loads and conditions that are well understood. In
addition, in the case of a commercial airliner the
certifying agency (FAA) and operator organizations
act as independent rule makers and aucTitors. No
such indepenclent check and balance exists for the
STS, where NASA controls all functions in-house
(inclucling requirements, analysis methocis, testing,
and certification)—primarily within the NSTS pro-
gram.
The original development plans for the Orbiter-
the most complex and vulnerable element, and the
only manned element included a conventional
structural test program for certification of the
structural integrity. A complete, full-scale structural
test article (an Orbiter vehicle) was to be incluclec!
which was to be loaded to I.4 times the operating
limit Toad in the most critical conditions. (This
compares to the conventional value of I.5 used by
the military and the FAA.) Due to budget problems
NASA decided to eliminate one of the planner!
flight vehicles and convert the static test article
(#099, Challenger) to a flight vehicle after a series
of proof tests to only 1.20 times the limit loacI.
Some loading conditions actually clid not exceed]
1.15 times the limit load. Therefore, the tests die]
not even verify a 1.4 strength margin over limit
loads. Subsequent flight test data ant! calculations
show that in some areas the maximum operating
loads are actually 15°/O to 20% higher than those
originally postulatecI, so that the static proof load-
ing tests clemonstrated only approximate limit
conditions. Thus, today there is no clemonstrate
70
OCR for page 71
verification of safety margins for critical elements
of the Orbiter.
The mocle} of loads and stresses on the Orbiter
used in its original design has been revised once.
By 1983 even these data hac] become suspect, ant!
another complete revision of loads using the latest
test and analysis ciata was begun. Calculates] strength
margins from this study (called Mode] 6.0) are
expected to be available by November 1987.
The Committee believes that the margin of actual
strength over maximum expecter! limit loac! for
critical areas of the Orbiter structure is not well
known. Partly this is because loading conditions
are complex and unprececlented, ant! partly it is
because very little (if any) of the flight structure
was actually tested to failure. The Committee agrees
with the decision not to use the FMEA/CIL process
on STS structures. However, we remain concerned
about the uncertainty in the actual strength margins
of safety. The Mocie! 6.0 loads calculation now
nearing completion should correct the known clis-
crepancies in external loads. Verification of the
Mode! 6.0 loacis by data routinely gatherer! from
an instrumentec! and calibrates] flight vehicle, be-
ginning with the next flight, can help verify the
mode! and establish the margins of safety more
clefinitively. This knowledge will greatly improve
NASA's ability to keep Shuttle operations within
a safe envelope of structural loads.
Implicit in the safe operation of any such struc-
ture is a monitoring system to assure that deteri-
oration of structural integrity floes not occur. An
effort now underway could adc! materially to
NASA's ability to operate the Orbiter's structure
safely over its service life. People with airline
experience, working uncier Rockwell International,
are cleveloping a maintenance ant] inspection plan
for the structure. A well-plannec! periodic inspec-
tion of this sort is essential, and is the best preven-
tive for unpleasant occurrences clue to structural
deterioration or other causes.
Recommendations (lOb):
The Committee recommends that NASA place a
high priority on completion of the Mode! 6.0 /:oads,
the reevaluation of safety margins for these lo ads,
and the early verification and continued monitoring
of the morle! 6.0 loads by permanently instru-
menting and calibrating at least the next full scale
STS vehicle to fly. We further recommend that
NASA complete and implement a comprehensive
plan for conducting periodic inspection and main-
tenance of the structure of the Orbiters throughout
the service life of each vehicle.
5.10.3 Software Issues
NASA FMEAs do not assess software as a
possible cause of failure modes.
There is little involvement of DISC Safety,
Reliability and Quality Assurance in software
reviews, resulting in little independent quality
assurance for software.
A large amount of data much of it flight
specific must be loaded for each Shuttle mis-
sion but it is not subjected to validation as
rigorous as that for the software.
The Shuttle onboard data processing system
consists of five general purpose computers (GPCs)
with their input and output devices, and memory
units. Four of the five GPCs contain the primary
software system, known as the Primary Avionics
System Software (PASS); the fifth is a redundant
computer which contains the Backup Flight System
(BFS). The PASS is developed by IBM, and the BFS
is built by Rockwell.
In addition to flight software code, there are also
flight software initialization data, called "I-Ioads",
which are mission-unique parameter values. The
basic code is reconfigured for specific missions,
with about two such "reconfigured flight loads"
per flight. After the software requirements are
approved, three levels of development tests are
performed leading to the First Article Configuration
Inspection, or FACI. At the FACI milestone, the
software package is handed off to the contractor's
. ~ . . . .
verlhcatlon organization or 1nc epenc lent testing,
called Independent Validation and Verification
(IV&V), which leads to the Configuration Inspec-
tion (CI) and delivery to NASA. (The degree of
independence of the IV&V was discussed in Section
5.8.) Following mission-specific reconfiguration and
testing in the SAIL and other JSC laboratories, the
package is ready for Flight Readiness Review.
A Shuttle Avionics System Control Board (SASCB)
is the I=eve! II flight software control board, to
which the Program Requirements Control Board
has delegated responsibility for software configu-
ration control. The Manager of the NSTS Engi-
neering Integration Office chairs this board and
signs the flight readiness statement on software;
thus he is the focus of configuration control and
71
OCR for page 72
management authority for software. At Level Ill
there is a Software Control Board, corresponding
to the Configuration Control Boars] for hardware
issues.
The testing, control, and performance of STS
software seem quite goocI. Out of some half-million
lines of code in the Shuttle flight software, typically
an average of one error is cliscovered beyond the
CI. With the emphasis placer] on early detection
of errors, error rates are quite low throughout the
total 10 million-line Shuttle software system. Only
once has a software problem disrupted a mission
(on STS-7, uncertainty about the effect of installed
software code on a particular abort scenario causer!
a launch scrub). Both the developers ant! the
`'inclepenclent" certifiers perform their own inspec-
tions of the cocle. Special "code audits" are also
carried out to reinspect targetec! aspects of the code
on a one-time basis, based on criticality, complex-
ity, Discrepancy Reports (DRs), ant] other consi`~-
erations. Software quality control includes weekly
tracking of DRs through the Configuration Man-
agement database (which tracks all faults, their
causes and effects, and their disposition); trencis of
DRs are reported quarterly.
Although generally impressed with the Shuttle
software development and testing process, the
Committee made a number of specific finclings.
First, we note that software is not a FMEA/CTE
item. NASA personnel state that all software is
consiclerect to be Criticality I, with each problem
being fixer! as soon as it is cletected through testing
ant! simulation. The Committee believes that iclen-
tification ant] precliction of software faults or error
modes may be feasible by dividing the software
into functional modules ant] then considering the
various possible failures (e.g., improper constants,
cliscretes or algorithms, missing or superfluous
symbols).
There is little involvement of the ISC SR&QA
organization in software reviews, due to the limi-
tations on staff. As a result, there is little incle-
pendent quality assurance for software.
Finally, we note that a large amount of data—
much of it flight specific must be loader! for each
Shuttle mission. However, the data ant! its entry
are not validated with the same rigor as in the
IV&V of the software.
Recommendations (lOc):
The Committee recommends that NASA: explore
the feasibility of performing FMEAs on software,
including the efficacy of identifying and predicting
fault and error modes; request DISC SRdrQA to
provide periodic review and oversight of software
from a quality assurance point of view; provide
for validation of input data in a manner similar to
software validation and verification.
5.10.4 Differences in Procedures Among NASA
Centers
Differences in the procedures being usec! by
the main NASA centers involved in the NSTS
Program may reflect an imbalance between
the authority of the centers and that of the
NSTS Program Office. The Committee is con-
cernec] that such an imbalance can leac! to
serious problems in large programs where two
or more centers have major roles in what must
be a tightly integrates! program, such as the
NSTS and Space Station. Without strong,
central program direction and integration, the
success and safety of these complex programs
can be placed in jeopardy.
In March 1986, the NASA Associate Aclminis-
trator for Space Flight ant! the Manager of the
Level Il NSTS Program issues] memoranda setting
forth NASA's strategy for returning the Space
Shuttle safely to flight status. Their orclers rescinder!
all Criticality I, IR, and IS waivers anc! required
that they be resubmitted for approval. The process
also required the reevaluation of all FMEA/ClI s
and retention rationales, as well as hazard analyses.
Other instructions required that a contractor be
selected for each STS element (that contractor not
otherwise being involved in work on the element)
to conduct an inclepenclent FMEA/CIL. No specific
guidelines were issued by the NSTS Office for the
conduct of the inclepenclent evaluations; the meth-
ods to be used were determined by the NASA
centers concernecl. Also, the FMEA/CIL reevalua-
tions were initiated using pre-51 L FMEA/CTE in-
structions, in which there were differences in ground
rules between ISC and MSFC. (In October 1986,
the NSTS Program Office issued new uniform
instructions, NSTS 22206, for the preparation of
FMEA/CILs, but it took several months for revised
directions to reach the STS contractors.) Thus,
some differences emerged in the nature and results
of the reevaluation conducted] by different con-
tractors.
72
OCR for page 73
; ~ /,
These differences are especially noticeable with
respect to the FMEA/CIL reevaluation procedures.
The Committee found that, at MSFC, all contrac-
tors had been instructed to conduct a new FMEA,
"from scratch." At ISC, the independent contrac-
tors were told to prepare a new FMEA, but the
prime contractors were instructed to reevaluate the
existing FMEA. At KSC, where E;MEAs are con-
ducted only on ground support equipment, a single
group (not the original designer) was reevaluating
each category of FMEA, working with the existing
FMEA. Procedures with respect to the independent
reviews also differed. At MSFC, the independent
contractor first performed its FMEA and developed
any necessary retention rationales; it then com-
pared those results with the FMEAs and retention
rationales prepared by the prime contractor and
wrote specific Review Item Discrepancies (RlDs)
on points of difference or disagreement. At ISC,
no RlDs were written and no retention rationales
were prepared by the independent contractor. Fur-
thermore, some Orbiter subsystems were initially
excluded front the review.
Initially, the Committee was concerned that these
differences in procedure Night recluce the valiclirv
and effectiveness of the F-MEA/CIL reevaluation
process. However, an audit by the Committee of
the documentation ant] review process used by ISC
in the case of the Orbiter inclicatec! that it is a
reasonable alternative to the RID process employed
by MSFC. Nevertheless, the Committee suggested
in its second interim report to NASA (see Appendix
C) that the NSTS Program Office "review the
FMEA/CTE reevaluation processes as implemented
for each STS element to assure itself that any
differences will not compromise the quality and
completeness of the overall STS FMEA/CTE effort."
This more specific concern for procedural dif-
ferences led, moreover, to a broader concern over
the nature of management control within NASA.
Differences in procedures used by the NASA centers
in this context and others (e.g., with respect to the
independence of STS certification, as discussed in
Section 5.8) lead the Committee to suspect that an
imbalance may exist between the authority of the
centers and that of the NSTS Program Office. The
Committee is concerned that such an imbalance
can lead to serious problems in large programs
where two or more centers have major roles in
what must be a tightly integrated program, such
as the NSTS and Space Station. Without strong,
central program direction and integration, the suc-
cess and safety of these complex programs can be
placed in jeopardy.
Recommendation ( 103~:
The Administrator should ensure that strong,
central program direction and integration of all
aspects of the STS are maintained via the NSTS
Program Office.
5.10.5 Use of Non-Destructive Evaluation Techniques
Non-destructive evaluation (NDE) tests on
the Solid Rocket Motor (SRM) are performed
at the manufacturing plant. Subsequent trans-
portation ant] assembly introduce a risk of
cleboncling and other damage which may not
be apparent upon visual inspection. No NDE
is done on the SRMs in the "stacked" config-
uration at the launch facility.
New NDE techniques now being developed
have potential applicability to the STS.
I'roble~ns have been cletected by NASA and its
contractor on the STS Solid Rocket Motor (SRM)
with clebonding between the propellant, liner, in-
sulation, and case. In April 1986, a USAF Titan
34D (comparable in design to the SRM) experi-
encecl a destructive failure shortly after launch, due
to debonding. No such severe consequences have
been seen from SRM debonding, but bone] line
problems are nevertheless viewed as critical failure
mocles, especially given the redesign of the SRM
joints. Voids within the propellant mass are also
of concern. Destructive inspection of the SRM (e.g.,
cutting and probing) is not feasible, so non-(lestruc-
tive methods must be used. On the SRM, most of
these tests are performed at the manufacturing
plant; later transportation ancI assembly introduce
a risk of deboncling and other damage which may
be more difficult to detect at the launch site.
There are essentially two issues here: the tech-
niques employed and the location where inspection
is clone. Shuttle SRM NDE assessment to date has
employecl a combination of visual, ultrasonic, and
radiographic techniques. The range of NDE tech-
niques considered by NASA (but not necessarily
tested) as of lanuary 1987 is shown in Table 5-~.
According to NASA's Aerospace Safety Advisory
Panel, acoustic and thermographic techniques are
7~
OCR for page 74
TABLE 5-1 Non-Destructive Evaluation Methods Considered By NASA
Method Looks For Remarks
Ultrasonics Unbonds: case/insulation, inhibitor/propellant, and propel- Propellant/liner to be confirmed.
Iant/liner
Radial radiography Propellant voids/inclusions
Tangential Gapped unbends: Propellant/liner, flap bonds, and flap
radiography bulb configuration
Thermography Unbonds: case/insulation inhibitor/propellant, and propel- Limited experience base;
lant/liner prop./liner to be confirmed
Mechanical Unbonds: near joint end case/insulation Complex insulation geometry
Oblique-light Gapped edge unbends: case/insulation and inhibitor/pro- Magnifies and automates visual
video pellant unbend inspection
Computed Gapped unbends: all intersecting interfaces, propellant Long term
tomography voids/inclusions
Holography Unbonds: near joint end case/insulation Excitation and scale concerns
Acoustic emission Unbonds: case/insulation Long term
(Source. NASA MSFC)
thought to be those with the greatest near-term
potential for improving NDE capabilities with
respect to the SUM. Another promising group of
techniques is based on X-ray technology. The
USAF, in its Titan recovery program, has empha-
sized NDE techniques including ultrasonic, ther-
mographic, an(l X-ray.i Sin~ilar efforts are being
pursuer! in the Navy's Triclent progran~. ~ ~
With respect to the issue of location, NASA has
cletermined that the ' stackecI" configuration of tile
SRM is not amenable to NDE of critical areas
using available methods. However, NASA engi-
neers believe that the assembly, rollout, and pad
hoicI-down loads on the SRM will not cause de-
boncling. Therefore, inspections are conducted at
key processing points in the plant and at critical
SRM segment locations before stacking at Kennecly
Space Center. Nevertheless, the Committee remains
concerned! about the possibility of damage resulting
from transportation, assembly, and rollout.
We recognize that NASA is (anal has been) paying
serious attention to the NDE issue. However, we
believe that the technologies are developing rapidly
enough that continued close attention is warranted.
Recommendation (lOe):
The Committee recommends that NASA apply
all practicable NDE techniques to the SRM at the
launch facility, at the highest possible level of
assembly (e.g., SRMs in the "stacked" configura-
12 NASA: Aerospace Safety Advisory Panel, Annual Report for 19X6
(February 1987).
~ Lt. Col. Frank Gayer, USAF Space Division, personal cc~mmunica-
tion.
i~ Dale Kenemuth, SP-273, Dept. of the Navy, personal communica-
t~c~n.
74
tiong, and em phasize (levelo pment of im proved
NDE methods.
5.~1 FOCUS ON RISK MANAGEMENT
The current safety assessment processes used
by NASA do not establish objectively the levels
of the various risks associates! with the failure
anodes an(l hazards.
It is not reasonable to expect that NASA
management or its panels and boards can
provide their own detailed assessments of the
risks associated with failure modes and haz-
ards presented to them for acceptance.
Validation and certification test programs
are not planner! or evaluates! as quantitative
inputs to safety risk assessments. Neither are
operating conclitions ant! environmental con-
straints which may control the safety risks
adequately definer] and evaTuatecI.
In the Committee's view, the lack of objec-
tive, measurable assessments in the above areas
hinders the implementation of an effective risk
management program, including the reduction
or elimination of risks.
Throughout its audit the Committee was shown
an extensive amount of information related to
program flow charts, organizations, review panels
and boards, information transmission, and reports.
But the Committee did not become aware of an
organization and safety-engineering methodology
that could effectively provide an objective assess-
ment of risk, as described in Section 4. Throughout
the flow of NASA reports and approvals, both
OCR for page 75
before the 51-L mission ant] after, judgments are
macie and statements of assurance given by persons
at every level which are based on data and assertions
having a wicle range of validity. The Committee
believes that it is not reasonable to expect program
management or NASA Level ~ management to
provide its own in-clepth evaluation of presented
hazarc! risks. Nor will other panels or boards be
able to clo so without the necessary professional
staff work being done. That work, in turn, cannot
be performed without methods for assessing risk
ant! controlling hazards. The methods must include
tile establishment of criteria for design margins
which are consistent with the acceptable levels of
risk.
The Associate Administrator for SRM&QA, in
his new plan for management of NASA's SR&QA
activities, stipulates that the SR&QA directors of
the NASA centers are responsible for assuring the
safety of their Center's products and services.
However, we conclude that unless the safety or-
ganizations at the centers have (~) the appropriate
methodology an(l tools (both analysis programs
and personnely, ant! (2) the authority to establish
criteria for safety margins, specific requirements on
verification test programs, environmental con-
straints on operations, ant] total flight configuration
validation, they cannot be held responsible for
assuring an-.acceptable level of safety of flight
systems. (In fact, they can never "assure safety,"
but only assure that the risks have been assessed
objectively by approved methodologies, ant! that
they are being controlled to the levels accepted by
the appropriate NASA authorities.)
Figure 5-12 shows that even in the current post-
51-L planning, the final result of the hazard analysis
and safety assessment process is a NASA Space
Shuttle Hazards Data Base. Having an approved
list of accepted, identified hazards ant] a sophisti-
cated closed-Ioop accounting and review system
(the SNAP) may be useful. However, nearly every
catastrophic accident since the beginning of the
missile and space programs was caused by some
aireacly-identifiec] hazarc! related to potential failure
modes. The essence of safety-risk management, in
the Committee's view, is not just the identification
ant! acceptance of potential hazards, nor even the
performance of a risk assessment for each failure
mocle and hazard; it is getting control of the
conditions which turn potential into real. The
FMEAs, CILs, hazard reports, and safety assess-
ments identify risks, summarize information, ref-
erence data, provide status, etc. They do not analyze
or establish the risk levels. Neither do they assess
quantitatively the val:i(lity of the test programs in
establishing failure margins, or (refine the operating
conditions or environmental constraints which af-
fect the risk levels.
We believe that the key requirements and con-
cepts contained in various relevant NASA clocu-
ments (see Section 3, for example) provide a good
overall framework within which a comprehensive
systems safety and risk management program couIcl
be cleaner! ant] implementecI. It is the opinion of
the Committee that such a program wouic! require
bringing together appropriate activities into a fo-
cused "Systems Safety Engineering" (SSE) function
at both Headquarters ant! the centers. This SSE
function would apply across the entire set of (resign,
development, qualification and certification, and
operations activities of the NSTS. These activities
would be an integral engineering element of the
NSTS Program. They would involve more than just
the preparation of reviews, reports, or data pack-
ages. Instead, systems safety engineering would
combine the functions of reliability and systems
safety analysis. It should be responsible for defining
the requirements and procedures, ant] performing
or managing, as appropriate, at least the following
functions which comprise the basis of a risk as-
sessment and risk management system:
I. Identification of failure mo(les and effects
A. Establishment of design criteria for redun-
dancy
3. Identification of hazards and their potential
consequences
4. Identification of critical items
5. Evaluation of the probability of occurrence
of causes and consequences of failure mo(les
anc! hazards
6. Establishment of safety-risk level criteria for
design margins ant] hazarc] controls
7.
8.
Design of qualification and certification test
programs
. Objective assessment of safety risks
9. Development of acceptance rationale for
retained hazards and hazard reports
10. Specification of environmental and operat-
ing constraints at all levels (parts, subsystem,
75
OCR for page 76
Rae
N PD 1 700 1
BASIC POLICY ON SAFETY
NHB 1 700 1 (V 1 -A)
BASIC SAFETY MANUAL
_t
NOR RlNn d 1 1 n 71
-
NSTS 07700
DELIVERABLES
50 77-SH 0113
r RISD DELIVERABlES
SR&OA AND MAINTAINABILITY
PROVISIONS FOR THE SPACE
SHUTTLE
' 1
~ r
ROCKWELL HA
DES INSTRUCTION 400-24
NHB 1700 1 (V3)
SYSTEM SAFETY METHODOLOGY
NSTS 22254*
METHODOLOGY OF CONDUCT OF
NSTS HAZARD ANALYSIS
NSTS 22206*
INSTRUCTIONS FOR PREPARING
FMEA CIL
.
I _ | CONDUCT HAZARD ANALYSIS |<
FMEA CIL
DOCUMENTS ~ ~
__ | PREPARED HAZARD REPORTS l
l
(KEY)
_
RISD SHUTTLE HAZARDS
INFORMATION MANAGEMENT ~ I RISD ERB
SYSTEM (SHIMS)
_
SUBSYSTEM MANAGER
MISSION OPERATIONS
DIRECTORATE
FIGURE 5-12
JSC SR&QA).
NSTS 0700
TECHNICAL nEQUIREMENTS
~ ~ JSC SAFETY
| SYSTEM SAFETY SUBPANEL |
| MISSION SAFETY ASSESSMENT 1
I ~
| PROJECT MANAGER r
, ~ ,
ORBITER CONFIGURATION CONTROL
BOARD (CC8 BASELINING)*
| SENIOR SAFETY REVIEW BOARD l
LEVEL II PRCB BASELINING
.l
* New procedures
added since 51—L NASA SPACE SHUTTLE
HAZARDS DATA BASE
· PREVIOUS EXPERIENCE
· DESIGN ENGINEERING STUDIES
. SAFETY ANALYSES
. SAFETY STUDIES
. CRITICAL FUNCTIONS
ASSESSMENTS
· FMEA S CIL S
· CERTIFICATION PROGRaM
. SNEAK ANALYSES
. MILESTONE REVIEW DATA RIO S
. PANEL MEETINGS
. CHANGE EVALUATIONS
· FAILURE INVESTIGATIONS
. WAIVERS DEVIATIONS
. 0MRS0 S OMI S
· WALKDOWN INSPECTIONS
· MISSION PLANNING ACTIVITIES
. FLIGHT ANOMALIES
· ASAP INPUTS
. INDIVIDUAL INPUTS
. PAYLOAD INTERFACES
NASA NSTS safety analysis, Hazard Reports, and safety assessment process in 1987 (NASA
element, and system) to assure that valiclatec}
margins are not violates]
11. Quantitative evaluation of flight data to
update safety margin validations
12. Oversight of quality assurance functions to
control safety risks
76
13. Overall system safety risk assessment and
definition of the potential to reduce the level
of risk.
All of the above systems safety engineering func-
tions (elaborate(l upon in Appendix F) are necessary
both for achieving creclible risk assessment and for
OCR for page 77
defining the risk controls required to justify ac-
ceptance of critical failure mocles and other hazards.
During design ant] development, the quantitative
evaluation of relative risks for each design against
acceptable criteria for levels of risk should be
considered as an integral part of the systems en-
gineering activity. These activities also wouIc] pro-
vide a definitive basis for establishing the design
margins and operational constraints neecled to
reduce the overall risk to the accepted] level anc!
subsequently control the risk.
Function 13 above Definition of the potential to
reduce the level of risk) is an essential input to risk
management. The Committee has the impression
that changes to the STS often are considered only
if they will improve its performance or reduce risks
to that level which has previously been accepted
in the program. The Committee believes that such
risks, accepted in the past, logical as that may have
appeared to be at the time, shout not continue to
be accepted without a concentrated effort to plan
ant] implement a program to remove or reduce
these risks.
The magnitude of the preceding tasks point to
the neec! for a large number of highly qualified
professional systems safety engineers (i.e., systems
engineers with a safety orientation) at NASA anc!
at its major contractors. We were disturbed to
learn from rhe Director of the Safety Division at
Headquarters SRM&QA that, as of April 25, ~ 987,
he had only one professional systems safety engi-
neer in his division, and that he expects to add
only two more in the near term and four additional
ones in the Tong term. It is troubling to the
Committee that this important and extremely com-
plex systems engineering function should be so
severely constrained by staff limitations, in light of
the cost of the Shuttle ant] the risk to its crew.
Taken together, the tasks listec! above have the
highest leverage on overall risk assessment and the
control of the causes of hazard. Only professionally
dedicated systems safety engineers working to-
gether can develop the expertise and motivation to
carry out these functions properly. They can per-
form their control of validation and certification
programs in an objective way (if not functionally
assignee] to program organizations). The need for
independent entities to perform certification and
software IV&V to provide substantiation and con-
fidence was discussed in Section 5.8. This risk-
managed approach to the validation and certifi-
cation functions, including the feedback of flight
data, shouIc! not be done by those responsible for
design ant] development. They are performance
orientecl; they generally do not design hardware
configurations to facilitate margin validation, and
their proposed certification programs usually are
not oriented to the demonstration of failure mar-
gins.
Finally, it seems to the Committee that it is not
managerially reasonable to make an organization
responsible for holding system safety to an agreed
level of risk without according it responsibility and
authority over all of the above functions, which
actually control the risks.
Another major element of an overall risk man-
agement program is the quality assurance (QA)
function. Quality assurance certifies that the har`cl-
ware anc! software have been procluced to the exact
designs which clescribe the vaiiciateci ant! quail
system. The "configuration" includes all aspects of
the hardware and software, including the environ-
ments which in any way influence the properties
of materials, stress margins, or temporal behavior
of parts, subsystems, ant] elements.
In 1986, responsibility for policy and oversight
of the quality assurance function was assignee! to
the new office of the Associate Administrator for
SRM&QA. (his is appropriate, because overall
risk management anct total systems safety are
clependent on the quality assurance function
throughout NASA. The QA function shouicl be
performed separately from the systems safety en-
gineering functions (although there is certainly a
strong oversight interaction between the two).
Quality assurance should be a responsibility of
each NASA center (and, of course, each contractor).
Its purpose is not to design but to control ant]
assure. As part of this function it shouIcl eontro!
the entire set of final released engineering cloeu-
ments describing the complete configuration of the
system. As the Committee unclerstands it, that is
precisely NASA's current practice.
Recommendations (11~:
The Committee recommends that NASA con-
sider establishing a focused agency-wicle Systems
Safety Engineering (SSEJ function, at both Head-
quarters anti the centers, which would:
be structured so as to be integrally involved" in
the entire set of (resign, development, valirlation,
qualification, anti certification activities;
provide a full systems approach to the continuous
77
OCR for page 78
l
identification of safety risks (not just failure
modes and hazardsJ and the objective (quanti-
tative) evaluation of such safety risks;
provide the output of this function to the NASA
Program Directors in support of their risk man-
agement;
support the Program Directors by providing
assurance that their systems are ready for final
G
safety certification to the risk levels established
by the NASA Administrator.
The Committee also recommends that the STS
risk management program, based in part on the
definition of the potential to reduce the level of
risk (levelope(1 by the system safety risk assessment,
include a concerted effort to remove or reduce the
risks.
78
Representative terms from entire chapter:
hazard analysis