Below are the first 10 and last 10 pages of uncorrected machineread text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapterrepresentative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 17
Examples of Combining Information
Prior information is critical in planning and designing efficient op
erational tests and in the evaluation of system performance when
used in combination with information from tests. In this chapter,
first we illustrate the importance of exploiting prior knowledge in the test
design phase of the operational evaluation process in an example closely
related to the Stryker operational test. We then discuss its use more gener
ally in planning the test, selecting the experimental design, and selecting
sample sizes for testing. Following this, we explore a variety of techniques
in which prior information can be used in combination with test data to
provide assessments of system performance.
COMBINING INFORMATION TO IMPROVE TEST DESIGN
In our example, a slightly simplified version of the current operational
test plan for Stryker would compare the baseline and Stryker systems across
a range of scenarios involving four factors, each at two levels: mission (raid
vs. perimeter defense), intensity (medium vs. high), terrain (urban vs. ru
ral), and company pair (A vs. B). A complete factorial design involving all
four factors requires testing both the baseline and Stryker systems at 24 = 16
combinations, for a total of 32 test cases. While this allows for estimation
of the main effects and interactions of all orders, depending on availability
of resources (number of test replications), it may be infeasible. Prior infor
mation about the nature and direction of the interactions would allow use
17
OCR for page 17
18
IMPROVED OPERATIONAL TESTING AND EVALUATION
of fractional factorial designs to reduce the number of test combinations.
Box, Hunter, and Hunter (1978:375) observe that "there tends to be a
redundancy in Efull factorial designs] redundancy in terms of an excess
number of interactions that can be estimated and sometimes in an excess
number of variables Ecomponents] that are studied. Fractional factorial de
signs exploit this redundancy."
In the example presented here, prior knowledge that the thirdorder
interaction mission X intensity X terrain is not likely to be large and that
company pair is not likely to have a strong interaction with any of the other
factors would permit use of a fractional factorial experiment with eight
runs (for each system) to test all of the relevant combinations. This would
be a 241 Resolution IV design in which the factor company pair is aliased1
with the thirdorder interaction mission X intensity X terrain. As a conse
quence, the following sets of twofactor interactions are aliased with each
other:
. . . . . . .
· mission X intensity Wit. ~ terrain X company pair
· mission X terrain with intensity X company pair
. . . . . . .
· terrain X intensity wit. :1 mission X company pair
Since prior knowledge suggests that company pair is not likely to inter
act with any of the factors, the 8run fractional factorial design presented in
Table 21 can be used to safely estimate the three twofactor interactions of
interest: mission X intensity, mission X terrain, and terrain X intensity. This
achieves reduction of the total number of possible test combinations by
half, saving costs and time during the operational testing phase.
Another way of using prior information to reduce the number of test
replications is to use knowledge of where changes in the levels of test factors
result in more substantial changes in the response under study (e.g., in the
current context, the performance of a defense system). Through the adapted
use of these factor levels, one can reduce the number of test replications
because the response of interest is (approximately) maximized (assuming
the information used is correct).
1The term "aliased" means that the linked effects are not individually estimable given
the reduced set of test events, and so one estimates the sum of their joint effects. Given the
assumption of company pair not interacting with the other factors, all but one of the aliased
effects are assumed to equal zero, thereby permitting the estimation of the remaining effect.
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
TABLE 21 241 Resolution IV Fractional Factorial Design
19
Run Intensity Mission Terrain Company Pair
1 Medium Raid Rural A
2 Medium Raid Urban B
3 Medium PD Rural B
4 Medium PD Urban A
5 High Raid Rural B
6 High Raid Urban A
7 High PD Rural A
8 High PD Urban B
NOTE: PD represents perimeter defense.
Test Planning
Operational testing and evaluation of military systems involve sub
stantial resources and time, and the decisions to be made have important
consequences for national security. Given the high stakes, it is critical that
operational testing be planned and executed carefully and systematically
and that as much relevant prior information as possible be taken into ac
count in designing efficient test plans. It is difficult, and in some cases
impossible, to generate useful information from a poorly designed test plan.
Effective test design relies on the crucial prior step of test planning.
Within the statistical community there has been much more attention paid
to the development of efficient techniques for the design of experiments
than on the planning process that precedes it. Hahn (1993) notes:
Experimental design is both an art and a science. The science deals with the
mathematics and formalities of developing experimental plans. This is what
most of the literature, including numerous articles in this journal, is about.
The art of experimental design provides the framework for an effective test
program that is maximally responsive . . . to the questions that the investiga
tors wish to answer. It deals with important but seemingly nonstatistical
topics such as defining the goals of the Etest] program, establishing the proper
response and control variables, assuring proper scope and breadth, under
standing the various sources of experimental error, appreciating what can and
cannot be randomized, and so forth.
Related studies, subjectmatter expertise, modeling and simulation, results
of developmental testing, and pilot studies all play a major role in this
planning process.
OCR for page 17
20
IMPROVED OPERATIONAL TESTING AND EVALUATION
Many industrial organizations have recently instituted systematic pro
cesses for planning and executing largescale experiments based on quality
management principles such as six sigma. A key component of this process
is the use of templates for systematic elicitation and incorporation of prior
information. The process involves, for example, developing consensus in
identifying key response variables, target values, and ideal functions (i.e.,
functions that specify the relationship between signals and responses); and
documenting subjectmatter knowledge and relevant background from past
studies. Factors that affect the response variables are similarly identified and
classified into control factors and noise variables. Subjectmatter expertise
or past studies are used to determine the range of values and their predicted
impact on the response variables, identify constraints such as costs and the
feasibility of varying the factors during experiments, and develop strategies
for measuring noise variables or for introducing and systematically varying
them in the experiment. Some industrial organizations make use of
predesign master guide sheets (see, e.g., Coleman and Montgomery, 1993)
that query the test designers to specify the objectives of the test, any rel
evant background issues, response variables, control variables, factors to be
held constant, nuisance factors, strong interactions, any further restrictions
on test inputs, design preferences, analysis and presentation techniques,
and responsibility for coordination.
The systematic processes and the use of prior knowledge are also needed
in selecting the design factors to be studied, their levels, and possible inter
actions. All of these decisions need to be made before selecting an appropri
ate experimental design.
Selecting the Experimental Design
There are many approaches to designing experiments. For the applica
tions considered in this report, by far the most useful of these are factorial
and fractional factorial designs (for details, see Box and Hunter, 19611.
This class of experimental designs has very good statistical properties, in
cluding balance and robustness, in a broad range of situations. Full factorial
designs, however, involve testing all possible combinations, which can lead
to an excessive number of test scenarios when the number of factors, or
levels per factor, is large. For that reason, fractional factorial designs that
examine a carefully selected subset of all possible combinations of design
factors are much more cost efficient. There is an extensive literature on this
topic (Box, Hunter, and Hunter, 1978; Wu and Hamada, 20001. However,
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
21
as mentioned above, prior information about which higherorder interac
tions are sufficiently small must be used when selecting appropriate frac
tions of the full factorial designs. Sequential followup strategies can verify
the validity of these assumptions, although they may not be as useful in the
operational test context, given the various constraints on use of military
personnel, test ranges, and other resources.
There is also a large literature on socalled optimal designs. In this
approach, the assumption is that the response model is known up to some
parameters, and the goal is to estimate either the unknown parameters or
the response surface at some design point. An illustrative example is the
linear model with explanatory variables X, and X;:

Y ~0 + ~1Xl + ~2X2 + £
The goal of optimal design in this example is to collect various observations
of Yat specific design points (Xl, \) that are chosen optimally to maximize
either the precision in estimating the regression coefficients (the It's) or the
expected response at selected values of X1 and X;, assuming that the linear
model is correctly specified. Other optimal designs that correspond to the
. . . . . . . ~ . . ~ . . .
maximization or mlnlmlzatlon ot ot. per criteria ot Interest require prior
information about the form of the model, such as the above linear model
with no interaction term. In the case of a linear model, the optimal design
for estimating the regression coefficients requires testing only at the ex
tremes of the design space. While this leads to good precision if the linear
model is a close approximation to the truth, the design is very nonrobust to
violations of this assumption. This property of nonrobustness, more gener
ally, is why optimal designs are not used extensively, except in cases where
one is very confident about prior information. Related discussions of Baye
sian optimal designs examine formal incorporation of prior information
about model parameters (Chaloner, l 9841.
Selecting Sample Sizes
Selection of sample sizes is dependent on the objective of the opera
tional test. Is the objective to estimate system performance for specific types
of environments of use, or to estimate the average performance across envi
ronments of use? Larger samples are needed for the former test objective. If
a confirmatory hypothesis test is to be used as a basis for a decision on
system promotion, the statistical power of the test against important alter
native hypotheses concerning system performance (such as modestly failing
OCR for page 17
22
IMPROVED OPERATIONAL TESTING AND EVALUATION
to meet a requirement) needs to be computed and related to the costs and
benefits of making incorrect decisions regarding promotion. The statistical
power will be a function of the significance level of the hypothesis test in
question, but, more importantly, it will be a function of the variance of the
test statistic (e.g., average failure rate). The variance of the test statistic is
not directly measured prior to carrying out the operational test; however, it
can often be indirectly estimated through use of development test informa
tion, pilot studies, or variances estimated for similar systems and adjusted
through the use of engineering judgment. Such indirect estimates are valu
able in judging, prior to an operational test, whether the test size will be
adequate to support significance testing used for this confirmatory pur
pose. When such an analysis suggests that test sizes sufficient for this pur
pose are not likely to be feasible given costs, models for combining infor
mation should be examined as a method for reducing variances.
COMBINING INFORMATION TO IMPROVE ESTIMATION
Combining Information by Pooling
It is difficult to draw useful conclusions from data sets with small
sample sizes because the signal contained in the data (e.g., the difference in
performance between two defense systems) is fixed, while the variability of
the signal estimate is relatively high for small data sets (but decreases as the
sample size increases). To address this difficulty, much ingenuity has been
applied to developing methods for borrowing strength from several small
samples by combining or pooling them. The methods include pooling K
samples (where Kis some number larger than one), pooling Ksamples with
different means and common variances, pooling using linear or quadratic
regression, and various generalizations of pooling with regression, includ
ing various nonparametric fitting algorithms and hierarchical and random
effects models.
Before discussing some of these methods, we first point out that even
viewing a collection of numbers as a simple random sample represents a
form of combining information. The random sample model, viewing a
collection of data as coming from a common distribution, is so commonly
applied that it is usually not considered as relying on any assumptions, but
this is not the case. The consideration of a data sample say a group of
times to first failure as generated from a common distribution represents
a form of combining information, in that individual data values are grouped
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
23
into one collective, and this combining requires justification, which could
include consideration of whether the data were obtained through the use of
sufficiently similar processes. In addition, it would be necessary to argue
that the individual data values were independently generated (or at least
exchangeable). Through the empirical distribution function, such a sample
provides a much better description of the underlying distribution and asso
ciated features including the mean of the underlying distribution than
any one of the numbers by itself would be able to provide.
"Pooling samples" is most often understood to mean that one has two
or more samples (in this discussion referred to as having Ksamples), typi
cally of small sizes, and there are reasons to believe that these samples come
from populations having the same distribution function. For example, one
might have collected times to first failure for several systems in develop
mental testing and for an additional, smaller number of systems in opera
tional testing. If all samples are pooled into one large sample regardless of
where they came from, the required assumption is that the origin of each
sample has no impact on the distribution of sample values. Diagnostic
checks should be run to show that the samples do not contradict this un
derlying assumption. Unfortunately, when diagnostic checks are based on
small samples, they tend to be somewhat forgiving; i.e., even moderate
differences in the sampled populations are not easily discernible. From a
pragmatic point of view, these moderate differences in the generating dis
tributions often do not matter, but this inability to discriminate needs to be
analyzed and if necessary addressed through the use of nonparametric tech
niques.
Diagnostic checks can include many possibilities, ranging from infor
mal graphical box plots or pairwise quantilequantile plots to formal para
metric or nonparametric hypothesis tests. In an example of the parametric
approach, we assume that the individual samples come from normal popu
lations, and so the decision to pool depends only on whether the sample
means and variances are homogeneous. This could be tested using the clas
sical Ftest for homogeneity of means and Bartlett's test for the homogene
ity of variances. The assumption of normality, in addition to the assump
tion of the homogeneity of the first two moments, requires a check of the
normality of the individual samples. In small samples such a check would
reveal only gross violations.
Nonparametric tests for the homogeneity of multiple samples avoid
the assumptions of normality or of other specific distributions. Examples
of such tests include the KolmogorovSmirnov, Cramervon Mises, and
OCR for page 17
24
IMPROVED OPERATIONAL TESTING AND EVALUATION
AndersonDarling tests as generalized to multiple samples by computing
appropriate discrepancy measures that compare the empirical distribution
functions of the individual samples with that of the pooled sample (see
Kiefer, 1959, and Scholz and Stephens, 1987, for details). Such tests are
rank tests and are sensitive to a wide range of differences in the individual
empirical distribution functions, in contrast to the analysis of variance F
test for equality of means (assuming common variances and normality) and
the KruskalWallis rank test, which are sensitive to differences in means but
can be quite weak otherwise.
In pooling K normal samples (often transformations can be used to
produce data that more closely approximate a normal distribution) that
have shown strong evidence of having different means, the Bartlett test can
be used to check whether the samples share a common variance. If there is
good evidence of homogeneous variances, one can pool them to obtain a
much more accurate assessment of the common variance. This in turn has
beneficial consequences for confidence intervals for the means, which, if
based on the pooled variance estimate, would be narrower on the average.
The benefit can be substantial when the sizes of the individual samples are
small.
Sometimes the means of underlying samples vary according to func
tions of covariates that were observed in conjunction with each sample
value. For example, the failure rate of a system might be a simple function
of some measure of stress to which the systems have been exposed. Absent a
model linking the various samples, one could view the sample values with
common covariate values as a collection of single samples and proceed ac
cordingly. Of course, the sample sizes at individual covariate values are likely
to be extremely small. However, when a useful model can be identified, a
stronger form of pooling, using multiple regression, can be exploited if one
can closely approximate the means of the response of interest as a linear
function of the known covariates. (The assessment of the validity of regres
sion models has been well studied; see, e.g., Belsley et al., 19801. In particu
lar, the residuals are useful to examine to assess conformity with assump
tions of linearity, homogeneous variances, and existence of outliers. Such a
model would be determined by a small number of parameters, which can
be estimated using all sample values simultaneously (by the method of least
squares, for example). The influence of all sample values is therefore pooled,
i.e., used jointly, in estimating these few parameters. The accuracy of such
estimates of the conditional means provided by the fitted values from the
regression model is much greater than that afforded by just using the mean
OCR for page 17
.
EXAMPLES OF COMBINING INFORMATION
25
of all sample values for data collected at the covariates of interest, if they
were even available. The pooling here therefore has the additional benefit
of providing estimates for covariates for which no sample values were avail
able.
In addition to this pooled (structural) model for estimating the mean
function, there is the option of assuming constant variances of the sample
values across all covariates. This extension of the pooling idea estimates a
pooled variance from all the residuals and thus increases the degrees of
freedom in the pooled variance estimate, in turn improving the accuracy
assessment of the mean estimates as it is reflected in the confidence inter
vals. This pooling, as usual, depends on the validity of the various assump
tions, and diagnostic checks including residual analyses need to be made
before building on them.
Pooling using regression is a special case of a more general approach,
including generalized linear models and various nonparametric fitting tech
niques, which can be applied to normal, count, and other forms of data.
Although many textbooks on regression do not emphasize the interpreta
tion of regression as pooling, the pooling perspective provides a strong un
derlying theme in discussions of regression. The pooling occurs through
the use of structural models that are characterized by a few unknown pa
rameters and that allow analysis, using covariates, of pooled data collected
under various conditions. All the data simultaneously influence the model
fit, and as a result more accurate estimates of the conditional means can be
obtained.
Bayesian Inference with Binary Data
Dichotomous measures are relatively typical in defense testing. Success
or failure of an offensive system is, for example, generally measured using
assessments of the number of hits in a given number of trials. (We do not
address here the point that the measure of distance from a target often may
have advantages over the dichotomous measure.) Use of a Bayesian ap
proach for dichotomous measures can be illustrated as follows: An opera
tional test of a defense system includes 20 trials with dichotomous (success/
failure) outcomes with interest in estimating the probability of failure, A.
The probability of failure has been presumed to be small so that the num
ber of failures in 20 trials is not likely to be large. For example, if the
number of failures were k = 2, the maximum likelihood estimate of p would
be 0. 10, but the associated standard error would be around 0.07, leading to
OCR for page 17
26
IMPROVED OPERATIONAL TESTING AND EVALUATION
a very weak inferential conclusion. The option of running more test trials is
assumed to be impossible due to logistical or budgetary constraints (e.g.,
the system is being tested under a number of scenarios, and therefore the
number of replications for a given scenario is limited; or the system is suffi
ciently costly that testing until there were a large number of failures would
be wasteful). In such a situation it might be useful and appropriate to in
clude other information in the analysis of operational test results.
The previous discussion of pooling identifies several ways in which
other information might be incorporated, if there are previous trials of a
sufficiently similar system or if a statistical model (perhaps regression) could
be used to render trials of other systems comparable. The current example
assumes that pooling is not possible and instead considers the possibility of
combining expert opinion with the results of the field trial. The example
also assumes that a check with system experts suggests a consensus assess
ment that p is approximately 0.05 with reasonable confidence that p is no
higher than 0.25 (see below for a discussion of methods that should be used
to obtain such assessments).
A statistical approach for combining prior information with the test
results is possible if the prior information is expressed in the form of a prior
probability distribution for the unknown p. In the current example, the
expert opinion (mean of .05, high percentile of .25) is consistent with a
Beta(2,38) distribution with a mean .05 and almost all of its probability
concentrated between O and .25. The prior distribution is presented as the
continuous curve in Figure 21.
Given this prior distribution and a statistical model for the data, Bayes
Theorem produces the posterior distribution that represents the subjective
probabilities for different values of p based on both the observed data and
the prior information.2 In this case it is natural to assume for the statistical
model that the observed number of failures y is distributed as a binomial
random variable with 20 trials, each having failure probability p. The re
sulting posterior distribution can provide an estimate of p and a probabilis
tic upper bound, or any other summary of uncertainty about p, based on
the data and prior information.
2The posterior distribution is subjective even though it can be represented as a mixture
r . . r
or emplrlca . trequenaes.
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
27
20 
15—
10—
5
O
~ Posterior (k= 0)
x
~1
x]
E 
so _
FAX
0.O 0.2
0.4 0.6
Failure Probability
FIGURE 21 Prior distribution and posterior distribution (given k= 0~.
0.8 1.0
To illustrate the approach, Table 22 presents, for several possible out
comes for the operational test, the conclusions one might draw by combin
ing information. One posterior distribution is shown as the dotted line in
Figure 21, corresponding to the case where one observes k = 0 failures in
20 trials. The table gives a point estimate for the median and the 95th
percentile of the posterior distribution. For purposes of comparison, the
table also shows the uncombined maximum likelihood point estimate for p
and upper confidence limits based on the binomial model and operational
test data alone.
The results illustrate the benefits of combining information. Particu
larly if the number of failures is small, as expected, then combining infor
mation yields sharper conclusions regarding the upper limit for the failure
probabilityp, especially the 95 percent upper limit. In the special case where
no failures are observed, the Bayesian approach yields a much more sensible
point estimate as well, because an estimate off = 0 is not reasonable in this
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
29
context.3 If the observed data are not consistent with the prior information,
then the conclusions regarding p will be intermediate between the two in
r
formation sources.
In the current example, when there are 10 failures in 20 trials, the
results from combining information suggest much lower values of p than
the observed data. These results reflect the relatively strong influence of
expert opinion (the experts were nearly certain that the failure probability
was below .25) and emphasize both the importance of considering the sen
sitivity of conclusions to a range of plausible interpretations of the prior
information and the danger of using prior information that is not well
founded. In this situation, the prior information seems to have been inap
propriate, and the process by which it was generated should be examined.
This short example demonstrates a way to quantify and combine ex
pert opinion with observed data in a relatively simple setting. Evaluations
of complex systems would require combination of data from a number of
subsystems using a similar approach, as discussed below.
Combining Information for Assessing Reliability:
Sensitivity Analysis Versus Probabilistic Treatment of Uncertainty in
Estimating the Reliability of a Bearing Cage
In this section, we discuss different methods for combining informa
tion in estimating the reliability of a bearing cage. Abernethy et al. (1983)
present field data on a bearing cage, a component in a jet engine. A popula
tion of 1,703 units had been introduced into service over time, and there
had been 6 failures. The reliability goal for the bearing cage was fewer than
10 percent failing in 8,000 hours of service (in engineering notation, that
means B10 life the time at which 10 percent fail is greater than 8,000
hours). For display purposes, units surviving for various lengths of time
were grouped into intervals of 100 hours' length. Figure 22 is an event plot
showing the structure of the available multiplycensored data, in which
failures are indicated by a row ending in an asterisk Ail. Figure 22 shows,
in row 1, that 288 units were in service for about 100 hours and none
3In many applications, the upper confidence bound on failure probability is more im
portant, and in this situation it would be relatively well estimated without the use of prior
r
1ntormatlon.
OCR for page 17
30
Row
1 ~
2 ,
3 ><
4 ,
5 >K
6 ,
7 >K
8
9
10
11 >
12 >
13 ,
14 >1<
15
16
17
18
19
20 '
21 >K
22 >
2
24 ,
25 >
j I , , , , I , , , , I j , , , I
0 500 1,000 1,500 2,000
Hours of Service
IMPROVED OPERATIONAL TESTING AND EVALUATION
Count
288
148
1
124
1
1 1 1
1
106
99
1 1 0
1 1 4
1 1 9
127
1
1
123
93
47
41
27
1
11
6
1
2
FIGURE 22 Event plot showing the multiplycensored bearing cage failure data.
SOURCE: Abernethy et al. (1983~.
experienced a failure. Proceeding to row 2, there were 148 units in service
for about 200 hours and none experienced a failure. In row 3, there was a
failure at around 300 hours, indicated by the asterisk. In row 4, there were
125 units in service for around 300 hours.
Figure 23 presents a Weibull probability plot of the same bearing cage
data, showing the maximum likelihood estimate of fraction failing, the
reliability target, and approximate confidence limits. The plotted points are
based on nonparametric estimates (i.e., estimates computed without mak
ing any assumption about the underlying failuretime distribution) of the
failure rate at each point in time. The points fall along a roughly straight
line, indicating that the Weibull distribution provides a reasonable descrip
tion for the failure process. The straight line through the points is the
Weibull maximum likelihood estimate of the fraction failing as a function
of hours in service, assuming the Weibull model is correct. The pointwise
approximate 95 percent confidence limits indicate the large amount of sta
tistical uncertainty in the estimate, owing to the small amount of informa
tion from the few failures that were observed and the extrapolation in time.
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
1
.05
.e
. _
IL
~ .02
.0 ,01
~ .005
AL 003
.001
.0005
0002
0001
.00005
.00003
31
77= 11792
p=2.035
200 500 1,000 2,000 5,000 10,000
Hours of Service
FIGURE 23 Weibull probability plot of the bearing cage failure data showing the
Weibull maximum likelihood estimate of the fraction failing, the reliability target, and
approximate pointwise 95 percent confidence limits. In the figure, the dots represent
the bearing cage observed data, straight line (a) represents the maximum likelihood
estimate of fraction failing, intersection (b) represents the reliability target, and curved
lines (c) and (d) represent the 95 percent upper and lower pointwise confidence limits.
The point where the horizontal and vertical lines meet is the reliability
target.
Since the maximum likelihood estimate of B 10 life is 3,900 hours, and
an approximate 95 percent confidence interval for B10 is between 2,100
and 22,100 hours, there was a concern that the B10 design life specifica
tion of 8,000 hours was not being met. On the other hand, because of the
limited information in the data, it might be argued from the upper bound
of the confidence interval that B10 could be as large as 22,100 hours.
Figure 24 is a contour plot of the Weibull relative likelihood function
(a function that is proportional to the probability of the data) as a function
of B10 and the Weibull shape parameter ,0. The maximum likelihood esti
mator is shown at the intersection of the horizontal and vertical lines. The
probability of the data at the maximum likelihood estimate is, for example,
5 times higher than at points on the .2 contour. This function shows clearly
why the upper endpoint of the B10 confidence bound is so large: small
uncertainties in ,0 are associated with a wide variety of values of B 10.
OCR for page 17
32
4.0 
3.5 
3.0 
2.5 
2.0 
1 .5 
1.0 
IMPROVED OPERATIONAL TESTING AND EVALUATION
N\~
>` "TIC ~ ,
 0.1
o 5,000
10,000 15,000 20,000
B10 Hours
FIGURE 24 Weibull distribution relative likelihood for the bearing cage failure data.
SOURCE: Abernethy et al. (1983~.
Abernethy et al. (1983) show that using historical or other informa
tion to fix the value of the Weibull shape parameter ~ reduces by a large
factor the amount of statistical uncertainty in estimating Blife (quartiles)
outside the range of the data. Nelson (1985) also suggests using given val
ues for the Weibull shape parameter ~ when there are few failures in cen
sored life data, but strongly encourages using sensitivity analysis to assess
the effect of the uncertainty in the Weibull shape parameter because the
value is never in practice known with certainty. The range of evaluation can
be determined from past experience with the same failure mode in similar
materials or components. A fatigue failure mechanism, because of its
wearouttype behavior, would have a shape parameter greater than 1, and
previous experience might suggest, for example, that ~ should be in the
range of 1.5 to 3. Appendix A contains probability plots that are similar to
Figure 23, but with the Weibull shape parameter ~ fixed at 1.5, 2, and 3.
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
33
The overall conclusion suggested by these figures is that the bearing
cage is, most likely, not meeting its reliability goal.
An alternative to the sensitivity analysis procedure is to use a prior
distribution to describe engineering knowledge of the Weibull parameters.
(For details, see chapter 14 of Meeker and Escobar, 1998, who use the
simple graphical and simulationbased approach for Bayesian analysis sug
gested in Smith and Gelfand, 19921. This alternative can be illustrated by
the following situation. In this example the engineers responsible for the
reliability of the bearing cage have useful prior information on the Weibull
shape parameter, which they quantify with a lognormal distribution with
lower and upper 99 percent limits (1.5, 31. For the B10 parameter itself
there is little prior information, so a diffuse prior distribution is used by
specifying a loguniform distribution with lower and upper limits (500,
20,0001. The Bayes rule computation of the posterior distribution involves
multiplying the sampling function and the prior, and the computation can
be considered a linear combination of the contours of the prior and the
sample points. All inferences are based on samples generated from the pos
terior distribution, such as the posterior median. Figure 25 is a plot of the
marginal posterior distribution of B 10.
ct
ct
\
\
\
\
\
o 5,000 10,000 15,000 20,000 25,000
B10
FIGURE 25 Weibull marginal posterior distribution for B 10 of bearing cage life and
95 percent credibility intervals.
OCR for page 17
34
IMPROVED OPERATIONAL TESTING AND EVALUATION
Contrasting Figure 25 with Figure 24 shows that combining the in
formation that ~ > 1.5 with the data allows a much more precise assessment
of B 10. If the prior information is reliable, the impact on the inference can
be substantial and important to exploit.
Combining Data from Multiple Sources
Here we consider an example of a more general situation in which the
aim is to estimate the reliability of a motor component as a function of
time. The example includes the following two assumptions: (1) the true
reliability as a function of time can be represented as a member of a family
of cumulative distribution functions indexed by a single parameter 0, which
is the mean time to failure for each member of this family of distribution
functions; and (2) we have useful information about ~ from two experts on
this component, three computer simulations, and five sets of data from
physical experiments. How might these three disparate sources of informa
tion be combined to provide the analyst with both an estimate of ~ and
estimates of the uncertainty of our estimate?
Expert A believes that ~ follows a normal distribution with mean 80.0
and standard deviation 4.0, while expert B believes that it follows a normal
distribution but with mean 73.0 and standard deviation 4.0. Three com
puter simulations have been used to simulate the functioning of the motor
component. The first simulation shows that estimates of ~ are centered at
78.0 with standard deviation of 6.3, the second shows estimates of ~ cen
tered at 69.0 with standard deviation of 10.8, and the third shows estimates
of ~ centered at 67.0 with standard deviation of 6.5. Five types of develop
mental testing have been carried out on five sets of motors. For each set of
motors, the means and standard deviations of the failure times were ob
served as follows:
Mean Standard Deviation
Test 1 87.0 5.0
Test 2 83.0 3.5
Test3 67.0 3.0
Test 4 77.0 4.0
Tests 70.0 5.0
Classically, these various sources of information would be joined using
a linear combination of the separate estimates weighted inversely propor
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
, .
35
tionally to their variances (i.e., the square of their standard deviations).
(There is a further complication if the estimates are not independent.) In
this approach, the computer simulations would be considered subject to
betweensimulation variance, as well as the withinsimulation variance in
dicated above, which would be estimated and then added to each
simulation's standard variance in calculating the optimal linear combina
tion.
An alternative way of combining this information is through use of
Bayesian prediction. Using Bayes' Theorem, prior probabilities are updated
to posterior probabilities through use of the likelihood function, as in the
above example on dichotomous outcomes where the likelihood was mod
eled using the binomial distribution. The prior is determined using the
three simulations and the two experts, and the likelihood is based on the
results from the five experiments. To determine the prior, as in the classical
framework, one could use a linear combination of the five subjective infor
mation sources. One might provide each expert and simulation with
weights that vary inversely according to their supplied variances, though a
number of other approaches are also possible. One might also downweight
estimates based on their distance from the estimated center of the five esti
mates (this could be iterated until convergence).
To build the likelihood from the experiments, we assume that the fail
ure times have mean ~ and standard deviations that we will estimate using
the data (though combining information approaches to determine the stan
dard deviations could also be used if there were relevant prior information).
Using the assumption (based on expert judgment from previous experi
ments) that the estimates for ~ from the five experiments have nonzero
correlations ranging from 0.19 to 0.90, the five experiments support the
model that individual failure times are normally distributed with mean
78.4 and standard deviation of 1.9.
The prior and the likelihood, using Bayes' Theorem, can then be used
to produce the posterior distribution, which would now reflect the infor
mation from the experts, the simulations, and the developmental test
results.
A number of assumptions were made to arrive at the final result, and at
each stage sensitivity analyses should be used to assess the impact of diver
gences from these assumptions. For example, the assumption of normality
is unlikely to be satisfied for failure times, but this discrepancy can be ad
dressed by a number of modifications to the above procedure, such as trans
forming the data to enhance the fit to normality. Any assumptions not
OCR for page 17
36
IMPROVED OPERATIONAL TESTING AND EVALUATION
supported by the data and to which the final estimates were determined to
be overly sensitive should be investigated.
A Treatment of Separate Failure Modes
Information from developmental testing can be used to make opera
tional test evaluation more efficient when there are separate failure modes
with varying failure characteristics. ATEC combines information from en
gineering judgment, analysis of data from developmental and other tests,
training exercises, modeling and simulation, knowledge of redesign activi
ties that occur after developmental testing, and other sources to create an
operational test that will expose these failure modes. Through analysis of
this information, situations can be identified, with associated prior prob
abilities, that indicate which of the failure modes in the developmental test
remain active in the operational test. (In a less simplistic situation, one
would, of course, be concerned with failure modes appearing in opera
tional testing that did not appear in the developmental test.) The opera
tional test data can then be used to update the estimated probabilities of
these situations. This method is particularly helpful when trying to assess
the properties of a large number of failure modes that either are statistically
dependent or have, individually, low failure rates.
In principle, the computation is straightforward. In practice, however,
a considerable level of expertise is needed to develop suitable prior informa
tion and combine it appropriately with experimental data. The following
simplistic example demonstrates one approach.
During developmental testing a vehicle has exhibited two critical fail
ure modes, mode 1 and mode 2. Both involve components with relatively
mature designs, so infant mortality is not present. The vehicles have experi
enced relatively low usage, so wearout is not likely. For these reasons, or
perhaps because the failures are due to external stressors exceeding a certain
limit, it is assumed that each mode exhibits exponentially distributed times
to failure. However, the failure rates, R1 and R2, are not known. Therefore
we need to make statistically supportable statements about three perfor
mance measures when the system enters operational testing after modifica
tions based on developmental testing:
· Ai = vehicle (total) failure rate per mile due to mode i in operational
use,
· MDTFi= mean distance to failure due to mode i, and
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
37
· helm) = mmile reliability = probability that a vehicle will survive
m miles without failure.
Using engineering judgment and the results of developmental testing, it is
assumed that we are able to consider four possible different situations:
· SO = no failure modes remain after developmental testing is con
cluded,
· S1 = only failure mode 1 remains after developmental testing,
· S2 = only failure mode 2 remains after developmental testing, and
· S3 = both failure modes 1 and 2 remain after developmental testing.
We assume that we are comfortable in assessing a priori probabilities
Po' P1' P2' an] p3 respectively for these situations, and our uncertainty about
the failure modes and the associated MDTF can be expressed by assessing
the expected value E(MDTF') and standard deviation STDEV(MDTF') for
each mode.
Note that under this framework, the mean distance to failure is ran
dom, since it is unknown. We can update a prior distribution about the
mean distance to failure, using operational test data, to arrive at a posterior
distribution. This posterior distribution will itself have a mean, the ex
pected mean distance to failure, and a standard deviation.
Now suppose that, after an exposure of t total vehicle miles in the
operational test, n1 failures of type 1 and n2 failures of type 2 are observed
(where n1 and n2 can be 01. Appendix B shows the development and spe
cific equations that allow calculation of the three performance measures, as
well as their uncertainty, expressed by their posterior standard deviations.
For example, suppose expert information based on developmental testing
and other activities provides us with the following prior values:
E(MDTF1) = 2,500; STDEV(MDTF1) = 2,000;
E(MDTF2) = 3,000; STDEV(MDTF2) = 3,500;
and
E(MDTFo) = 1 00,000; STDEV(MDTFo) = 0;
where the 100,000 mile (certain) MDTF value reflects a practical assess
ment of the situation "no failure modes remaining." Using scenario prob
abilities p0 = 005' P1 = .10, P2 = .15, p3 = .745, Table 23 shows various
performance measures for three potential values of (n1, n2) failures in t=
20,000 total exposure miles.
OCR for page 17
38
IMPROVED OPERATIONAL TESTING AND EVALUATION
TABLE 23 Three Potential Values for the Number of Failures of Two
Types Observed in 20,000 Miles, and the Resulting Impact on Reliability
Estimates and Their Uncertainty
(nl,n2) (°,°) (0,1) (1,1)
Posterior E(~)x1,000 .145 .187 .321
Posterior STDEV(~)x1,000 .102 .105 .111
Posterior E(MDT~ 17,073 7,744 3,543
Posterior STDEV(MDT~ 26,360 6,417 1,412
Posterior E(Rel(1,000~) .868 .834 .730
Posterior STDEV(Rel(1,000~) .085 .083 .079
This example reflects the use of weak prior information, in that
STDEV(MDTF1) and STDEV(MDTF2) are about as large as their respec
tive mean values. Therefore, the reported performance measures are rela
tively objective in that they depend mostly upon the operational test re
sults. It is also possible to compute posterior probabilities for the four
situations (see Appendix B) that show the same relative insensitivity to
prior assessments.
This general approach can be extended to account for more complex
situations, as in the following example. A system has 40 type A vehicles and
30 of type B. A developmental test has been run with miles of operation per
vehicle ranging from 1,000 to 15,000 miles, with 10 failure modes discov
ered at various mileages. An operational test is then run with 24 vehicles, all
of type A, with miles of operation now ranging from 500 to 2,000 miles.
Four of the original 10 failure modes are observed, occurring at varying
mileages but with a higher rate than in the developmental test. In addition,
a failure mode is seen that was not present in the developmental test. The
operational test is set in three different environments of use, and the devel
opmental test has been exclusively in a fourth environment of use, a test
track.
Although this approach to combining information from developmen
tal and operational testing is a tempting means to increase the efficiency of
operational test results, a number of potential difficulties remain. To the
extent that an analyst must speculate about possible situations that have
not been realized, an assessment of their probabilities may be more vulner
able to cognitive biases than the better understood assessment of distribu
OCR for page 17
EXAMPLES OF COMBINING INFORMATION
39
tions of more intrinsically engineering or physically based parameters. In
addition, the analysis necessary for the more complex combinations of fail
ure modes implied in the more realistic example above will require exper
tise not necessarily resident at the test agency. Not inherently suitable for
encapsulation in manuals or training courses, the methodology would re
quire nonstandard certification for each use.
On the other hand, sensitivity analysis with respect to prior assess
ments can be readily performed using simple spreadsheet software models.
Moreover, inferences made about performance measures are couched in
language appropriate for decision making.
In summary, inferences about the number of failure modes that have
been fixed prior to OT, the number of new failure modes that OT has
introduced, and related problems can be addressed using combining infor
mation techniques. These techniques are strongly dependent on assump
tions, and therefore their proper application requires the use of sensitivity
analyses to determine dependence on various assumptions.