National Academies Press: OpenBook
« Previous: 1. Introduction to Combining Information
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 17
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 18
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 19
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 20
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 21
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 22
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 23
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 24
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 25
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 26
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 27
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 28
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 29
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 30
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 31
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 32
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 33
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 34
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 35
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 36
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 37
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 38
Suggested Citation:"2. Examples of Combining Information." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.
×
Page 39

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Examples of Combining Information Prior information is critical in planning and designing efficient op- erational tests and in the evaluation of system performance when used in combination with information from tests. In this chapter, first we illustrate the importance of exploiting prior knowledge in the test design phase of the operational evaluation process in an example closely related to the Stryker operational test. We then discuss its use more gener- ally in planning the test, selecting the experimental design, and selecting sample sizes for testing. Following this, we explore a variety of techniques in which prior information can be used in combination with test data to provide assessments of system performance. COMBINING INFORMATION TO IMPROVE TEST DESIGN In our example, a slightly simplified version of the current operational test plan for Stryker would compare the baseline and Stryker systems across a range of scenarios involving four factors, each at two levels: mission (raid vs. perimeter defense), intensity (medium vs. high), terrain (urban vs. ru- ral), and company pair (A vs. B). A complete factorial design involving all four factors requires testing both the baseline and Stryker systems at 24 = 16 combinations, for a total of 32 test cases. While this allows for estimation of the main effects and interactions of all orders, depending on availability of resources (number of test replications), it may be infeasible. Prior infor- mation about the nature and direction of the interactions would allow use 17

18 IMPROVED OPERATIONAL TESTING AND EVALUATION of fractional factorial designs to reduce the number of test combinations. Box, Hunter, and Hunter (1978:375) observe that "there tends to be a redundancy in Efull factorial designs] redundancy in terms of an excess number of interactions that can be estimated and sometimes in an excess number of variables Ecomponents] that are studied. Fractional factorial de- signs exploit this redundancy." In the example presented here, prior knowledge that the third-order interaction mission X intensity X terrain is not likely to be large and that company pair is not likely to have a strong interaction with any of the other factors would permit use of a fractional factorial experiment with eight runs (for each system) to test all of the relevant combinations. This would be a 24-1 Resolution IV design in which the factor company pair is aliased1 with the third-order interaction mission X intensity X terrain. As a conse- quence, the following sets of two-factor interactions are aliased with each other: . . . . . . . · mission X intensity Wit. ~ terrain X company pair · mission X terrain with intensity X company pair . . . . . . . · terrain X intensity wit. :1 mission X company pair Since prior knowledge suggests that company pair is not likely to inter- act with any of the factors, the 8-run fractional factorial design presented in Table 2-1 can be used to safely estimate the three two-factor interactions of interest: mission X intensity, mission X terrain, and terrain X intensity. This achieves reduction of the total number of possible test combinations by half, saving costs and time during the operational testing phase. Another way of using prior information to reduce the number of test replications is to use knowledge of where changes in the levels of test factors result in more substantial changes in the response under study (e.g., in the current context, the performance of a defense system). Through the adapted use of these factor levels, one can reduce the number of test replications because the response of interest is (approximately) maximized (assuming the information used is correct). 1The term "aliased" means that the linked effects are not individually estimable given the reduced set of test events, and so one estimates the sum of their joint effects. Given the assumption of company pair not interacting with the other factors, all but one of the aliased effects are assumed to equal zero, thereby permitting the estimation of the remaining effect.

EXAMPLES OF COMBINING INFORMATION TABLE 2-1 24-1 Resolution IV Fractional Factorial Design 19 Run Intensity Mission Terrain Company Pair 1 Medium Raid Rural A 2 Medium Raid Urban B 3 Medium PD Rural B 4 Medium PD Urban A 5 High Raid Rural B 6 High Raid Urban A 7 High PD Rural A 8 High PD Urban B NOTE: PD represents perimeter defense. Test Planning Operational testing and evaluation of military systems involve sub- stantial resources and time, and the decisions to be made have important consequences for national security. Given the high stakes, it is critical that operational testing be planned and executed carefully and systematically and that as much relevant prior information as possible be taken into ac- count in designing efficient test plans. It is difficult, and in some cases impossible, to generate useful information from a poorly designed test plan. Effective test design relies on the crucial prior step of test planning. Within the statistical community there has been much more attention paid to the development of efficient techniques for the design of experiments than on the planning process that precedes it. Hahn (1993) notes: Experimental design is both an art and a science. The science deals with the mathematics and formalities of developing experimental plans. This is what most of the literature, including numerous articles in this journal, is about. The art of experimental design provides the framework for an effective test program that is maximally responsive . . . to the questions that the investiga- tors wish to answer. It deals with important but seemingly non-statistical topics such as defining the goals of the Etest] program, establishing the proper response and control variables, assuring proper scope and breadth, under- standing the various sources of experimental error, appreciating what can and cannot be randomized, and so forth. Related studies, subject-matter expertise, modeling and simulation, results of developmental testing, and pilot studies all play a major role in this planning process.

20 IMPROVED OPERATIONAL TESTING AND EVALUATION Many industrial organizations have recently instituted systematic pro- cesses for planning and executing large-scale experiments based on quality management principles such as six sigma. A key component of this process is the use of templates for systematic elicitation and incorporation of prior information. The process involves, for example, developing consensus in identifying key response variables, target values, and ideal functions (i.e., functions that specify the relationship between signals and responses); and documenting subject-matter knowledge and relevant background from past studies. Factors that affect the response variables are similarly identified and classified into control factors and noise variables. Subject-matter expertise or past studies are used to determine the range of values and their predicted impact on the response variables, identify constraints such as costs and the feasibility of varying the factors during experiments, and develop strategies for measuring noise variables or for introducing and systematically varying them in the experiment. Some industrial organizations make use of predesign master guide sheets (see, e.g., Coleman and Montgomery, 1993) that query the test designers to specify the objectives of the test, any rel- evant background issues, response variables, control variables, factors to be held constant, nuisance factors, strong interactions, any further restrictions on test inputs, design preferences, analysis and presentation techniques, and responsibility for coordination. The systematic processes and the use of prior knowledge are also needed in selecting the design factors to be studied, their levels, and possible inter- actions. All of these decisions need to be made before selecting an appropri- ate experimental design. Selecting the Experimental Design There are many approaches to designing experiments. For the applica- tions considered in this report, by far the most useful of these are factorial and fractional factorial designs (for details, see Box and Hunter, 19611. This class of experimental designs has very good statistical properties, in- cluding balance and robustness, in a broad range of situations. Full factorial designs, however, involve testing all possible combinations, which can lead to an excessive number of test scenarios when the number of factors, or levels per factor, is large. For that reason, fractional factorial designs that examine a carefully selected subset of all possible combinations of design factors are much more cost efficient. There is an extensive literature on this topic (Box, Hunter, and Hunter, 1978; Wu and Hamada, 20001. However,

EXAMPLES OF COMBINING INFORMATION 21 as mentioned above, prior information about which higher-order interac- tions are sufficiently small must be used when selecting appropriate frac- tions of the full factorial designs. Sequential follow-up strategies can verify the validity of these assumptions, although they may not be as useful in the operational test context, given the various constraints on use of military personnel, test ranges, and other resources. There is also a large literature on so-called optimal designs. In this approach, the assumption is that the response model is known up to some parameters, and the goal is to estimate either the unknown parameters or the response surface at some design point. An illustrative example is the linear model with explanatory variables X, and X;: - Y ~0 + ~1Xl + ~2X2 + £ The goal of optimal design in this example is to collect various observations of Yat specific design points (Xl, \) that are chosen optimally to maximize either the precision in estimating the regression coefficients (the It's) or the expected response at selected values of X1 and X;, assuming that the linear model is correctly specified. Other optimal designs that correspond to the . . . . . . . ~ . . ~ . . . maximization or mlnlmlzatlon ot ot. per criteria ot Interest require prior information about the form of the model, such as the above linear model with no interaction term. In the case of a linear model, the optimal design for estimating the regression coefficients requires testing only at the ex- tremes of the design space. While this leads to good precision if the linear model is a close approximation to the truth, the design is very nonrobust to violations of this assumption. This property of nonrobustness, more gener- ally, is why optimal designs are not used extensively, except in cases where one is very confident about prior information. Related discussions of Baye- sian optimal designs examine formal incorporation of prior information about model parameters (Chaloner, l 9841. Selecting Sample Sizes Selection of sample sizes is dependent on the objective of the opera- tional test. Is the objective to estimate system performance for specific types of environments of use, or to estimate the average performance across envi- ronments of use? Larger samples are needed for the former test objective. If a confirmatory hypothesis test is to be used as a basis for a decision on system promotion, the statistical power of the test against important alter- native hypotheses concerning system performance (such as modestly failing

22 IMPROVED OPERATIONAL TESTING AND EVALUATION to meet a requirement) needs to be computed and related to the costs and benefits of making incorrect decisions regarding promotion. The statistical power will be a function of the significance level of the hypothesis test in question, but, more importantly, it will be a function of the variance of the test statistic (e.g., average failure rate). The variance of the test statistic is not directly measured prior to carrying out the operational test; however, it can often be indirectly estimated through use of development test informa- tion, pilot studies, or variances estimated for similar systems and adjusted through the use of engineering judgment. Such indirect estimates are valu- able in judging, prior to an operational test, whether the test size will be adequate to support significance testing used for this confirmatory pur- pose. When such an analysis suggests that test sizes sufficient for this pur- pose are not likely to be feasible given costs, models for combining infor- mation should be examined as a method for reducing variances. COMBINING INFORMATION TO IMPROVE ESTIMATION Combining Information by Pooling It is difficult to draw useful conclusions from data sets with small sample sizes because the signal contained in the data (e.g., the difference in performance between two defense systems) is fixed, while the variability of the signal estimate is relatively high for small data sets (but decreases as the sample size increases). To address this difficulty, much ingenuity has been applied to developing methods for borrowing strength from several small samples by combining or pooling them. The methods include pooling K samples (where Kis some number larger than one), pooling Ksamples with different means and common variances, pooling using linear or quadratic regression, and various generalizations of pooling with regression, includ- ing various nonparametric fitting algorithms and hierarchical and random effects models. Before discussing some of these methods, we first point out that even viewing a collection of numbers as a simple random sample represents a form of combining information. The random sample model, viewing a collection of data as coming from a common distribution, is so commonly applied that it is usually not considered as relying on any assumptions, but this is not the case. The consideration of a data sample say a group of times to first failure as generated from a common distribution represents a form of combining information, in that individual data values are grouped

EXAMPLES OF COMBINING INFORMATION 23 into one collective, and this combining requires justification, which could include consideration of whether the data were obtained through the use of sufficiently similar processes. In addition, it would be necessary to argue that the individual data values were independently generated (or at least exchangeable). Through the empirical distribution function, such a sample provides a much better description of the underlying distribution and asso- ciated features including the mean of the underlying distribution than any one of the numbers by itself would be able to provide. "Pooling samples" is most often understood to mean that one has two or more samples (in this discussion referred to as having Ksamples), typi- cally of small sizes, and there are reasons to believe that these samples come from populations having the same distribution function. For example, one might have collected times to first failure for several systems in develop- mental testing and for an additional, smaller number of systems in opera- tional testing. If all samples are pooled into one large sample regardless of where they came from, the required assumption is that the origin of each sample has no impact on the distribution of sample values. Diagnostic checks should be run to show that the samples do not contradict this un- derlying assumption. Unfortunately, when diagnostic checks are based on small samples, they tend to be somewhat forgiving; i.e., even moderate differences in the sampled populations are not easily discernible. From a pragmatic point of view, these moderate differences in the generating dis- tributions often do not matter, but this inability to discriminate needs to be analyzed and if necessary addressed through the use of nonparametric tech- niques. Diagnostic checks can include many possibilities, ranging from infor- mal graphical box plots or pairwise quantile-quantile plots to formal para- metric or nonparametric hypothesis tests. In an example of the parametric approach, we assume that the individual samples come from normal popu- lations, and so the decision to pool depends only on whether the sample means and variances are homogeneous. This could be tested using the clas- sical F-test for homogeneity of means and Bartlett's test for the homogene- ity of variances. The assumption of normality, in addition to the assump- tion of the homogeneity of the first two moments, requires a check of the normality of the individual samples. In small samples such a check would reveal only gross violations. Nonparametric tests for the homogeneity of multiple samples avoid the assumptions of normality or of other specific distributions. Examples of such tests include the Kolmogorov-Smirnov, Cramer-von Mises, and

24 IMPROVED OPERATIONAL TESTING AND EVALUATION Anderson-Darling tests as generalized to multiple samples by computing appropriate discrepancy measures that compare the empirical distribution functions of the individual samples with that of the pooled sample (see Kiefer, 1959, and Scholz and Stephens, 1987, for details). Such tests are rank tests and are sensitive to a wide range of differences in the individual empirical distribution functions, in contrast to the analysis of variance F- test for equality of means (assuming common variances and normality) and the Kruskal-Wallis rank test, which are sensitive to differences in means but can be quite weak otherwise. In pooling K normal samples (often transformations can be used to produce data that more closely approximate a normal distribution) that have shown strong evidence of having different means, the Bartlett test can be used to check whether the samples share a common variance. If there is good evidence of homogeneous variances, one can pool them to obtain a much more accurate assessment of the common variance. This in turn has beneficial consequences for confidence intervals for the means, which, if based on the pooled variance estimate, would be narrower on the average. The benefit can be substantial when the sizes of the individual samples are small. Sometimes the means of underlying samples vary according to func- tions of covariates that were observed in conjunction with each sample value. For example, the failure rate of a system might be a simple function of some measure of stress to which the systems have been exposed. Absent a model linking the various samples, one could view the sample values with common covariate values as a collection of single samples and proceed ac- cordingly. Of course, the sample sizes at individual covariate values are likely to be extremely small. However, when a useful model can be identified, a stronger form of pooling, using multiple regression, can be exploited if one can closely approximate the means of the response of interest as a linear function of the known covariates. (The assessment of the validity of regres- sion models has been well studied; see, e.g., Belsley et al., 19801. In particu- lar, the residuals are useful to examine to assess conformity with assump- tions of linearity, homogeneous variances, and existence of outliers. Such a model would be determined by a small number of parameters, which can be estimated using all sample values simultaneously (by the method of least squares, for example). The influence of all sample values is therefore pooled, i.e., used jointly, in estimating these few parameters. The accuracy of such estimates of the conditional means provided by the fitted values from the regression model is much greater than that afforded by just using the mean

. EXAMPLES OF COMBINING INFORMATION 25 of all sample values for data collected at the covariates of interest, if they were even available. The pooling here therefore has the additional benefit of providing estimates for covariates for which no sample values were avail- able. In addition to this pooled (structural) model for estimating the mean function, there is the option of assuming constant variances of the sample values across all covariates. This extension of the pooling idea estimates a pooled variance from all the residuals and thus increases the degrees of freedom in the pooled variance estimate, in turn improving the accuracy assessment of the mean estimates as it is reflected in the confidence inter- vals. This pooling, as usual, depends on the validity of the various assump- tions, and diagnostic checks including residual analyses need to be made before building on them. Pooling using regression is a special case of a more general approach, including generalized linear models and various nonparametric fitting tech- niques, which can be applied to normal, count, and other forms of data. Although many textbooks on regression do not emphasize the interpreta- tion of regression as pooling, the pooling perspective provides a strong un- derlying theme in discussions of regression. The pooling occurs through the use of structural models that are characterized by a few unknown pa- rameters and that allow analysis, using covariates, of pooled data collected under various conditions. All the data simultaneously influence the model fit, and as a result more accurate estimates of the conditional means can be obtained. Bayesian Inference with Binary Data Dichotomous measures are relatively typical in defense testing. Success or failure of an offensive system is, for example, generally measured using assessments of the number of hits in a given number of trials. (We do not address here the point that the measure of distance from a target often may have advantages over the dichotomous measure.) Use of a Bayesian ap- proach for dichotomous measures can be illustrated as follows: An opera- tional test of a defense system includes 20 trials with dichotomous (success/ failure) outcomes with interest in estimating the probability of failure, A. The probability of failure has been presumed to be small so that the num- ber of failures in 20 trials is not likely to be large. For example, if the number of failures were k = 2, the maximum likelihood estimate of p would be 0. 10, but the associated standard error would be around 0.07, leading to

26 IMPROVED OPERATIONAL TESTING AND EVALUATION a very weak inferential conclusion. The option of running more test trials is assumed to be impossible due to logistical or budgetary constraints (e.g., the system is being tested under a number of scenarios, and therefore the number of replications for a given scenario is limited; or the system is suffi- ciently costly that testing until there were a large number of failures would be wasteful). In such a situation it might be useful and appropriate to in- clude other information in the analysis of operational test results. The previous discussion of pooling identifies several ways in which other information might be incorporated, if there are previous trials of a sufficiently similar system or if a statistical model (perhaps regression) could be used to render trials of other systems comparable. The current example assumes that pooling is not possible and instead considers the possibility of combining expert opinion with the results of the field trial. The example also assumes that a check with system experts suggests a consensus assess- ment that p is approximately 0.05 with reasonable confidence that p is no higher than 0.25 (see below for a discussion of methods that should be used to obtain such assessments). A statistical approach for combining prior information with the test results is possible if the prior information is expressed in the form of a prior probability distribution for the unknown p. In the current example, the expert opinion (mean of .05, high percentile of .25) is consistent with a Beta(2,38) distribution with a mean .05 and almost all of its probability concentrated between O and .25. The prior distribution is presented as the continuous curve in Figure 2-1. Given this prior distribution and a statistical model for the data, Bayes Theorem produces the posterior distribution that represents the subjective probabilities for different values of p based on both the observed data and the prior information.2 In this case it is natural to assume for the statistical model that the observed number of failures y is distributed as a binomial random variable with 20 trials, each having failure probability p. The re- sulting posterior distribution can provide an estimate of p and a probabilis- tic upper bound, or any other summary of uncertainty about p, based on the data and prior information. 2The posterior distribution is subjective even though it can be represented as a mixture r . . r or emplrlca . trequenaes.

EXAMPLES OF COMBINING INFORMATION 27 20 - 15— 10— 5- O- ~ Posterior (k= 0) x ~1 x] E | so _ FAX 0.O 0.2 0.4 0.6 Failure Probability FIGURE 2-1 Prior distribution and posterior distribution (given k= 0~. 0.8 1.0 To illustrate the approach, Table 2-2 presents, for several possible out- comes for the operational test, the conclusions one might draw by combin- ing information. One posterior distribution is shown as the dotted line in Figure 2-1, corresponding to the case where one observes k = 0 failures in 20 trials. The table gives a point estimate for the median and the 95th percentile of the posterior distribution. For purposes of comparison, the table also shows the uncombined maximum likelihood point estimate for p and upper confidence limits based on the binomial model and operational test data alone. The results illustrate the benefits of combining information. Particu- larly if the number of failures is small, as expected, then combining infor- mation yields sharper conclusions regarding the upper limit for the failure probabilityp, especially the 95 percent upper limit. In the special case where no failures are observed, the Bayesian approach yields a much more sensible point estimate as well, because an estimate off = 0 is not reasonable in this

28 a US ~ ~ c~ a.~ ~ ~ .= o ._ ._ ._ o o . - C~ to .= ._ o o ._ .° 0 ~ 5: ._ . - be B .= ~ ._ ~ D . - ~0 ~ .= o by .= ~ ._ a ~ ~ 0 ~ ~ ~ Ma a ~ O ·— _ 0 Cal Ill o ._ ~ _ ~ ~ . ° ~ o 3 ~ ~ . US o . ~ GN == o . . - ~ ~ . ° ~ ~ . o o . ~ US == ~ a_ — o _ 0 =i c4 0 .= ~ ~ 00 0 . . . . ~ 00 ~ ~ 0 0 ~ US . . . . O US O O 0 0 . . . . Go O ~ GN 0 ~ ~ ~ . . . . ~ US ~ 0 0 0 0 ~ . . . . us GN 0 0 0 ~ . . . . O ~ ~ 0

EXAMPLES OF COMBINING INFORMATION 29 context.3 If the observed data are not consistent with the prior information, then the conclusions regarding p will be intermediate between the two in- r formation sources. In the current example, when there are 10 failures in 20 trials, the results from combining information suggest much lower values of p than the observed data. These results reflect the relatively strong influence of expert opinion (the experts were nearly certain that the failure probability was below .25) and emphasize both the importance of considering the sen- sitivity of conclusions to a range of plausible interpretations of the prior information and the danger of using prior information that is not well founded. In this situation, the prior information seems to have been inap- propriate, and the process by which it was generated should be examined. This short example demonstrates a way to quantify and combine ex- pert opinion with observed data in a relatively simple setting. Evaluations of complex systems would require combination of data from a number of subsystems using a similar approach, as discussed below. Combining Information for Assessing Reliability: Sensitivity Analysis Versus Probabilistic Treatment of Uncertainty in Estimating the Reliability of a Bearing Cage In this section, we discuss different methods for combining informa- tion in estimating the reliability of a bearing cage. Abernethy et al. (1983) present field data on a bearing cage, a component in a jet engine. A popula- tion of 1,703 units had been introduced into service over time, and there had been 6 failures. The reliability goal for the bearing cage was fewer than 10 percent failing in 8,000 hours of service (in engineering notation, that means B10 life the time at which 10 percent fail is greater than 8,000 hours). For display purposes, units surviving for various lengths of time were grouped into intervals of 100 hours' length. Figure 2-2 is an event plot showing the structure of the available multiply-censored data, in which failures are indicated by a row ending in an asterisk Ail. Figure 2-2 shows, in row 1, that 288 units were in service for about 100 hours and none 3In many applications, the upper confidence bound on failure probability is more im- portant, and in this situation it would be relatively well estimated without the use of prior r 1ntormatlon.

30 Row 1 ~ 2 , 3 >< 4 , 5 >K 6 , 7 >K 8 9 10 11 > 12 > 13 , 14 >1< 15 16 17 18 19 20 ' 21 >K 22 > 2 24 , 25 > j I , , , , I , , , , I j , , , I 0 500 1,000 1,500 2,000 Hours of Service IMPROVED OPERATIONAL TESTING AND EVALUATION Count 288 148 1 124 1 1 1 1 1 106 99 1 1 0 1 1 4 1 1 9 127 1 1 123 93 47 41 27 1 11 6 1 2 FIGURE 2-2 Event plot showing the multiply-censored bearing cage failure data. SOURCE: Abernethy et al. (1983~. experienced a failure. Proceeding to row 2, there were 148 units in service for about 200 hours and none experienced a failure. In row 3, there was a failure at around 300 hours, indicated by the asterisk. In row 4, there were 125 units in service for around 300 hours. Figure 2-3 presents a Weibull probability plot of the same bearing cage data, showing the maximum likelihood estimate of fraction failing, the reliability target, and approximate confidence limits. The plotted points are based on nonparametric estimates (i.e., estimates computed without mak- ing any assumption about the underlying failure-time distribution) of the failure rate at each point in time. The points fall along a roughly straight line, indicating that the Weibull distribution provides a reasonable descrip- tion for the failure process. The straight line through the points is the Weibull maximum likelihood estimate of the fraction failing as a function of hours in service, assuming the Weibull model is correct. The pointwise approximate 95 percent confidence limits indicate the large amount of sta- tistical uncertainty in the estimate, owing to the small amount of informa- tion from the few failures that were observed and the extrapolation in time.

EXAMPLES OF COMBINING INFORMATION 1 .05 .e . _ IL ~ .02 .0 ,01 ~ .005 AL 003 .001 .0005 0002 0001 .00005 .00003 31 77= 11792 p=2.035 200 500 1,000 2,000 5,000 10,000 Hours of Service FIGURE 2-3 Weibull probability plot of the bearing cage failure data showing the Weibull maximum likelihood estimate of the fraction failing, the reliability target, and approximate pointwise 95 percent confidence limits. In the figure, the dots represent the bearing cage observed data, straight line (a) represents the maximum likelihood estimate of fraction failing, intersection (b) represents the reliability target, and curved lines (c) and (d) represent the 95 percent upper and lower pointwise confidence limits. The point where the horizontal and vertical lines meet is the reliability target. Since the maximum likelihood estimate of B 10 life is 3,900 hours, and an approximate 95 percent confidence interval for B10 is between 2,100 and 22,100 hours, there was a concern that the B10 design life specifica- tion of 8,000 hours was not being met. On the other hand, because of the limited information in the data, it might be argued from the upper bound of the confidence interval that B10 could be as large as 22,100 hours. Figure 2-4 is a contour plot of the Weibull relative likelihood function (a function that is proportional to the probability of the data) as a function of B10 and the Weibull shape parameter ,0. The maximum likelihood esti- mator is shown at the intersection of the horizontal and vertical lines. The probability of the data at the maximum likelihood estimate is, for example, 5 times higher than at points on the .2 contour. This function shows clearly why the upper endpoint of the B10 confidence bound is so large: small uncertainties in ,0 are associated with a wide variety of values of B 10.

32 4.0 - 3.5 - 3.0 - 2.5 - 2.0 - 1 .5 - 1.0 - IMPROVED OPERATIONAL TESTING AND EVALUATION N\~ >` "TIC ~ , - 0.1 o 5,000 10,000 15,000 20,000 B10 Hours FIGURE 2-4 Weibull distribution relative likelihood for the bearing cage failure data. SOURCE: Abernethy et al. (1983~. Abernethy et al. (1983) show that using historical or other informa- tion to fix the value of the Weibull shape parameter ~ reduces by a large factor the amount of statistical uncertainty in estimating B-life (quartiles) outside the range of the data. Nelson (1985) also suggests using given val- ues for the Weibull shape parameter ~ when there are few failures in cen- sored life data, but strongly encourages using sensitivity analysis to assess the effect of the uncertainty in the Weibull shape parameter because the value is never in practice known with certainty. The range of evaluation can be determined from past experience with the same failure mode in similar materials or components. A fatigue failure mechanism, because of its wearout-type behavior, would have a shape parameter greater than 1, and previous experience might suggest, for example, that ~ should be in the range of 1.5 to 3. Appendix A contains probability plots that are similar to Figure 2-3, but with the Weibull shape parameter ~ fixed at 1.5, 2, and 3.

EXAMPLES OF COMBINING INFORMATION 33 The overall conclusion suggested by these figures is that the bearing cage is, most likely, not meeting its reliability goal. An alternative to the sensitivity analysis procedure is to use a prior distribution to describe engineering knowledge of the Weibull parameters. (For details, see chapter 14 of Meeker and Escobar, 1998, who use the simple graphical and simulation-based approach for Bayesian analysis sug- gested in Smith and Gelfand, 19921. This alternative can be illustrated by the following situation. In this example the engineers responsible for the reliability of the bearing cage have useful prior information on the Weibull shape parameter, which they quantify with a lognormal distribution with lower and upper 99 percent limits (1.5, 31. For the B10 parameter itself there is little prior information, so a diffuse prior distribution is used by specifying a loguniform distribution with lower and upper limits (500, 20,0001. The Bayes rule computation of the posterior distribution involves multiplying the sampling function and the prior, and the computation can be considered a linear combination of the contours of the prior and the sample points. All inferences are based on samples generated from the pos- terior distribution, such as the posterior median. Figure 2-5 is a plot of the marginal posterior distribution of B 10. ct ct \ \ \ \ \ o 5,000 10,000 15,000 20,000 25,000 B10 FIGURE 2-5 Weibull marginal posterior distribution for B 10 of bearing cage life and 95 percent credibility intervals.

34 IMPROVED OPERATIONAL TESTING AND EVALUATION Contrasting Figure 2-5 with Figure 2-4 shows that combining the in- formation that ~ > 1.5 with the data allows a much more precise assessment of B 10. If the prior information is reliable, the impact on the inference can be substantial and important to exploit. Combining Data from Multiple Sources Here we consider an example of a more general situation in which the aim is to estimate the reliability of a motor component as a function of time. The example includes the following two assumptions: (1) the true reliability as a function of time can be represented as a member of a family of cumulative distribution functions indexed by a single parameter 0, which is the mean time to failure for each member of this family of distribution functions; and (2) we have useful information about ~ from two experts on this component, three computer simulations, and five sets of data from physical experiments. How might these three disparate sources of informa- tion be combined to provide the analyst with both an estimate of ~ and estimates of the uncertainty of our estimate? Expert A believes that ~ follows a normal distribution with mean 80.0 and standard deviation 4.0, while expert B believes that it follows a normal distribution but with mean 73.0 and standard deviation 4.0. Three com- puter simulations have been used to simulate the functioning of the motor component. The first simulation shows that estimates of ~ are centered at 78.0 with standard deviation of 6.3, the second shows estimates of ~ cen- tered at 69.0 with standard deviation of 10.8, and the third shows estimates of ~ centered at 67.0 with standard deviation of 6.5. Five types of develop- mental testing have been carried out on five sets of motors. For each set of motors, the means and standard deviations of the failure times were ob- served as follows: Mean Standard Deviation Test 1 87.0 5.0 Test 2 83.0 3.5 Test3 67.0 3.0 Test 4 77.0 4.0 Tests 70.0 5.0 Classically, these various sources of information would be joined using a linear combination of the separate estimates weighted inversely propor-

EXAMPLES OF COMBINING INFORMATION , . 35 tionally to their variances (i.e., the square of their standard deviations). (There is a further complication if the estimates are not independent.) In this approach, the computer simulations would be considered subject to between-simulation variance, as well as the within-simulation variance in- dicated above, which would be estimated and then added to each simulation's standard variance in calculating the optimal linear combina- tion. An alternative way of combining this information is through use of Bayesian prediction. Using Bayes' Theorem, prior probabilities are updated to posterior probabilities through use of the likelihood function, as in the above example on dichotomous outcomes where the likelihood was mod- eled using the binomial distribution. The prior is determined using the three simulations and the two experts, and the likelihood is based on the results from the five experiments. To determine the prior, as in the classical framework, one could use a linear combination of the five subjective infor- mation sources. One might provide each expert and simulation with weights that vary inversely according to their supplied variances, though a number of other approaches are also possible. One might also downweight estimates based on their distance from the estimated center of the five esti- mates (this could be iterated until convergence). To build the likelihood from the experiments, we assume that the fail- ure times have mean ~ and standard deviations that we will estimate using the data (though combining information approaches to determine the stan- dard deviations could also be used if there were relevant prior information). Using the assumption (based on expert judgment from previous experi- ments) that the estimates for ~ from the five experiments have non-zero correlations ranging from 0.19 to 0.90, the five experiments support the model that individual failure times are normally distributed with mean 78.4 and standard deviation of 1.9. The prior and the likelihood, using Bayes' Theorem, can then be used to produce the posterior distribution, which would now reflect the infor- mation from the experts, the simulations, and the developmental test results. A number of assumptions were made to arrive at the final result, and at each stage sensitivity analyses should be used to assess the impact of diver- gences from these assumptions. For example, the assumption of normality is unlikely to be satisfied for failure times, but this discrepancy can be ad- dressed by a number of modifications to the above procedure, such as trans- forming the data to enhance the fit to normality. Any assumptions not

36 IMPROVED OPERATIONAL TESTING AND EVALUATION supported by the data and to which the final estimates were determined to be overly sensitive should be investigated. A Treatment of Separate Failure Modes Information from developmental testing can be used to make opera- tional test evaluation more efficient when there are separate failure modes with varying failure characteristics. ATEC combines information from en- gineering judgment, analysis of data from developmental and other tests, training exercises, modeling and simulation, knowledge of redesign activi- ties that occur after developmental testing, and other sources to create an operational test that will expose these failure modes. Through analysis of this information, situations can be identified, with associated prior prob- abilities, that indicate which of the failure modes in the developmental test remain active in the operational test. (In a less simplistic situation, one would, of course, be concerned with failure modes appearing in opera- tional testing that did not appear in the developmental test.) The opera- tional test data can then be used to update the estimated probabilities of these situations. This method is particularly helpful when trying to assess the properties of a large number of failure modes that either are statistically dependent or have, individually, low failure rates. In principle, the computation is straightforward. In practice, however, a considerable level of expertise is needed to develop suitable prior informa- tion and combine it appropriately with experimental data. The following simplistic example demonstrates one approach. During developmental testing a vehicle has exhibited two critical fail- ure modes, mode 1 and mode 2. Both involve components with relatively mature designs, so infant mortality is not present. The vehicles have experi- enced relatively low usage, so wearout is not likely. For these reasons, or perhaps because the failures are due to external stressors exceeding a certain limit, it is assumed that each mode exhibits exponentially distributed times to failure. However, the failure rates, R1 and R2, are not known. Therefore we need to make statistically supportable statements about three perfor- mance measures when the system enters operational testing after modifica- tions based on developmental testing: · Ai = vehicle (total) failure rate per mile due to mode i in operational use, · MDTFi= mean distance to failure due to mode i, and

EXAMPLES OF COMBINING INFORMATION 37 · helm) = m-mile reliability = probability that a vehicle will survive m miles without failure. Using engineering judgment and the results of developmental testing, it is assumed that we are able to consider four possible different situations: · SO = no failure modes remain after developmental testing is con- cluded, · S1 = only failure mode 1 remains after developmental testing, · S2 = only failure mode 2 remains after developmental testing, and · S3 = both failure modes 1 and 2 remain after developmental testing. We assume that we are comfortable in assessing a priori probabilities Po' P1' P2' an] p3 respectively for these situations, and our uncertainty about the failure modes and the associated MDTF can be expressed by assessing the expected value E(MDTF') and standard deviation STDEV(MDTF') for each mode. Note that under this framework, the mean distance to failure is ran- dom, since it is unknown. We can update a prior distribution about the mean distance to failure, using operational test data, to arrive at a posterior distribution. This posterior distribution will itself have a mean, the ex- pected mean distance to failure, and a standard deviation. Now suppose that, after an exposure of t total vehicle miles in the operational test, n1 failures of type 1 and n2 failures of type 2 are observed (where n1 and n2 can be 01. Appendix B shows the development and spe- cific equations that allow calculation of the three performance measures, as well as their uncertainty, expressed by their posterior standard deviations. For example, suppose expert information based on developmental testing and other activities provides us with the following prior values: E(MDTF1) = 2,500; STDEV(MDTF1) = 2,000; E(MDTF2) = 3,000; STDEV(MDTF2) = 3,500; and E(MDTFo) = 1 00,000; STDEV(MDTFo) = 0; where the 100,000 mile (certain) MDTF value reflects a practical assess- ment of the situation "no failure modes remaining." Using scenario prob- abilities p0 = 005' P1 = .10, P2 = .15, p3 = .745, Table 2-3 shows various performance measures for three potential values of (n1, n2) failures in t= 20,000 total exposure miles.

38 IMPROVED OPERATIONAL TESTING AND EVALUATION TABLE 2-3 Three Potential Values for the Number of Failures of Two Types Observed in 20,000 Miles, and the Resulting Impact on Reliability Estimates and Their Uncertainty (nl,n2) (°,°) (0,1) (1,1) Posterior E(~)x1,000 .145 .187 .321 Posterior STDEV(~)x1,000 .102 .105 .111 Posterior E(MDT~ 17,073 7,744 3,543 Posterior STDEV(MDT~ 26,360 6,417 1,412 Posterior E(Rel(1,000~) .868 .834 .730 Posterior STDEV(Rel(1,000~) .085 .083 .079 This example reflects the use of weak prior information, in that STDEV(MDTF1) and STDEV(MDTF2) are about as large as their respec- tive mean values. Therefore, the reported performance measures are rela- tively objective in that they depend mostly upon the operational test re- sults. It is also possible to compute posterior probabilities for the four situations (see Appendix B) that show the same relative insensitivity to prior assessments. This general approach can be extended to account for more complex situations, as in the following example. A system has 40 type A vehicles and 30 of type B. A developmental test has been run with miles of operation per vehicle ranging from 1,000 to 15,000 miles, with 10 failure modes discov- ered at various mileages. An operational test is then run with 24 vehicles, all of type A, with miles of operation now ranging from 500 to 2,000 miles. Four of the original 10 failure modes are observed, occurring at varying mileages but with a higher rate than in the developmental test. In addition, a failure mode is seen that was not present in the developmental test. The operational test is set in three different environments of use, and the devel- opmental test has been exclusively in a fourth environment of use, a test track. Although this approach to combining information from developmen- tal and operational testing is a tempting means to increase the efficiency of operational test results, a number of potential difficulties remain. To the extent that an analyst must speculate about possible situations that have not been realized, an assessment of their probabilities may be more vulner- able to cognitive biases than the better understood assessment of distribu-

EXAMPLES OF COMBINING INFORMATION 39 tions of more intrinsically engineering- or physically based parameters. In addition, the analysis necessary for the more complex combinations of fail- ure modes implied in the more realistic example above will require exper- tise not necessarily resident at the test agency. Not inherently suitable for encapsulation in manuals or training courses, the methodology would re- quire nonstandard certification for each use. On the other hand, sensitivity analysis with respect to prior assess- ments can be readily performed using simple spreadsheet software models. Moreover, inferences made about performance measures are couched in language appropriate for decision making. In summary, inferences about the number of failure modes that have been fixed prior to OT, the number of new failure modes that OT has introduced, and related problems can be addressed using combining infor- mation techniques. These techniques are strongly dependent on assump- tions, and therefore their proper application requires the use of sensitivity analyses to determine dependence on various assumptions.

Next: 3. Combining Information in Practice »
Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report Get This Book
×
Buy Paperback | $67.00 Buy Ebook | $54.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The U.S. Army Test and Evaluation Command (ATEC) is responsible for the operational testing and evaluation of Army systems in development. ATEC

requested that the National Research Council form the Panel on Operational Test Design and Evaluation of the Interim Armored Vehicle (Stryker). The charge to this panel was to explore three issues concerning the IOT plans for the Stryker/SBCT. First, the panel was asked to examine the measures selected to assess the performance and effectiveness of the Stryker/SBCT in comparison both to requirements and to the baseline system. Second, the panel was asked to review the test design for the Stryker/SBCT initial operational test to see whether it is consistent with best practices. Third, the panel was asked to identify the advantages and disadvantages of techniques for combining operational test data with data from other sources and types of use. In a previous report (appended to the current report) the panel presented findings, conclusions, and recommendations pertaining to the first two issues: measures of performance and effectiveness, and test design. In the current report, the panel discusses techniques for combining information.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!