Click for next page ( 12


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 11
1 Introduction to Combining Information Com~ining information is a term that incorporates a wide range of methods and activities. It can include formal and informal meth- ods, involve the use of qualitative or quantitative variables, and apply to both the design of data collection and analysis of the collected data. This section outlines the range of methods for combining informa- tion that can be useful when testing and evaluating a defense system and discusses their benefits and requirements. The combining of information exemplifies the adage, "Necessity is the mother of invention." Much statistical activity has resulted from the neces- sity of drawing conclusions when information from a single source is not sufficient. Information combining entails more than simply viewing a col- lection of numbers in a common context. If all the data sets resulting in all the information to be combined were available in their most detailed ver- sions, one could try to view the combined data set within an appropriately wider context. (This is one purpose of regression analysis, making predic- tions in different contexts comparable through use of covariates.) However, original data sets often are not available, or are available only in the form of derived summary statistics, or contain only an informal collection of quali- tative observations, and so statisticians cannot always consider an informa- tion-combining problem as simply an exercise in estimation, especially when the data-generating mechanism is particularly complicated. The informal use of prior information collected from various sources is a hallmark of scientific study design. Information from previous studies is 11

OCR for page 11
12 IMPROVED OPERATIONAL TESTING AND EVALUATION often used to suggest suitable levels for study factors where changes in a response are expected to be the most pronounced; to determine suitable . . . .,% . . . . sample sizes to support slgnl~lcance testing, where previous stu( yes may suggest an estimate of the variability that is needed to set a test's operating characteristics; and to help select the type of statistical method for data analysis. Formal statistical models can be used to combine data from more than one study. This is common in industry, where, for example, manufacturers combine information about a vehicle part's lifetime from a variety of ve- hicle model years, as long as the difference in vehicle model year does not imply substantial systematic differences in stress experienced by that part. Along the same lines, experimentally determining the reliability of large systems composed of many subsystems by testing the entire system is often difficult or impossible, either practically or economically. However, data collected on the subsystems may be combined, using formal mathematical and statistical models and assumptions, to make reliability assessments of the full system. Informal combination of information is typically carried out in an ad hoc manner by reviewing what has been learned previously and synthesiz- ing this information for use in a current situation. Formal combination of information, on the other hand, generally involves the use of statistical models that require a number of assumptions. If the underlying assump- tions are not found to be seriously violated, then formal combination of data usually builds a stronger inference than would be possible otherwise. The most straightforward statistical approach to information combi- nation is the pooling of information from two or more comparable studies. Such an approach may be relevant if, for example, two or more studies involve the failure rate for similar devices. Then the number of failures and operating hours from all studies may be combined to provide a single esti- mate, which is improved because it is based on an increased sample size. Not only is the estimate likely to be better, but its uncertainty will also be estimated more precisely. It is often difficult to judge whether data collected from different stud- ies are sufficiently comparable to allow them to be effectively combined by pooling. Statistical tests can be used to judge whether or not to pool data (though rejection of the null hypothesis of consistency with distributional assumptions at typical significance levels is not necessarily where the line should be drawn about whether or not to pool). For example, the assump-

OCR for page 11
INTRODUCTION TO COMBININGINFORMATION 13 tion of normality supports concentration on two statistical tests, the equal- ity of means and the equality of variances, to decide whether to pool several data sets thought to obey the normal distribution. Avoiding such a strong assumption about the distribution generating the data, one may instead perform a nonparametric test, though the omnibus nature of such tests makes them somewhat less effective against individual distributional forms. Of course, it is possible that a problem may not lend itself to any method for combining information because it is not possible to identify ways of linking the information between studies. Between the extremes of being able to pool data and not finding any methods for linking studies lies the possibility of using statistical methods to combine, for disparate data sources, the appropriate parts of the available information. For example, in the case where tests of a common mean among many studies indicate that direct pooling is not appropriate, pooling of variance estimates may still be appropriate if a statistical model can be used that allows the individual studies to have different means but the same variance. Such a model yields tighter confidence intervals, on the average, than would be possible from use of the variances from each study individually. This gain is strongest in situations where the sample sizes for the individual data sets are very small. Another example is the case where in the analysis of several sets of reliability data it is assumed, on the basis of appropriate diagnostic tests, that the data sets are distributed according to a Weibull failure time model with the same shape parameter as a previously analyzed set of data but with characteristic life parameters that vary, perhaps in a manner related to study covariates, between the data sets. In this case, information would be com- bined using a parameter derived from earlier, comparable studies. This use of prior information should be accompanied by alternative analyses using a spectrum of shape parameters to determine the sensitivity of the analysis results to the assumption of a common shape parameter. (As we will em- phasize throughout this report, one should only make assumptions of com- parability with good physical or historical justification.) Hierarchical or random effects models represent another form of com- promise between complete pooling of data and no combination, offering the potential for data-determined degrees of combining information. Con- sider the Weibull example above, where the choice is between a common shape parameter for all the data sets or different shape parameters for each data set. Under a random effects model one would assume that the shape parameters were a realized sample from a population of possible shape pa-

OCR for page 11
14 IMPROVED OPERATIONAL TESTING AND EVALUATION rameters. If that population has a small variance, then the shape parameters will be essentially equivalent, which is the first of the two extremes. If the population has a large variance, then the shape parameters may differ sub- stantially, which is the second extreme. The shape parameter variance can be estimated from the data, allowing the data to determine the degree to which the different data sets reinforce one another. Hierarchical or random effects models can be used in a variety of ways; recently it has become popular to apply them (or very similar models) using a Bayesian approach (see, e.g., Gelman et al., 1995; Carlin and Louis, 1996) based on advances that facilitate computation of Bayesian estimates and the development of associated Bayesian infrastructure. The main methodological and practical requirement for combining information is that explicit judgments or assumptions be supported. The (possibly informal) judgment might be that information from earlier stud- ies is relevant to the design of an upcoming study; a formal mathematical or statistical assumption may be required to combine two data sources in a particular way. In either case, there are many caveats. Assuming that the value of a parameter, such as a standard deviation, is known based on earlier experiments or experience can be problematic, especially when the knowl- edge is based on data collected by anecdotal accounts that rely on memory. Although apparently minor violations of assumptions made to combine two data sources may, from a purist point of view, result in improper infer- ence, one may sometimes choose a more pragmatic approach. For example, combining data sets that have slightly different means to estimate an as- sumed common location parameter has the effect of translating any differ- ences in location between the two data sets into an inflated estimate of variability. In that case, an increase in effective sample size is gained at the expense of increasing the variance of some estimated parameters. Thus the estimate of the "common" mean may be compromised while accompany- ing confidence bounds are both, on the one hand, narrower due to the increased sample size, and, on the other hand, wider because of the in- creased standard deviation. The cost of such minor differences can be large when they are magnified by extrapolation. A trade-off study of this phe- nomenon may well be of general interest in the context of combining infor- mation. The defense testing environment presents an opportunity to effectively and appropriately combine information. Experience with the Stryker/SBCT test and evaluation shows that operational testing (OT) alone often does not collect enough data to permit definitive conclusions. It is therefore

OCR for page 11
INTRODUCTION TO COMBININGINFORMATION 15 necessary to also use data from developmental testing, 1 training, and field experience of the given system and of related systems. Combining opera- tional test data with developmental test or other data is possible and poten- tially useful and effective, but it requires careful consideration of the rela- tionships among the data sets. There is no evidence in the Test and Evaluation Master Plan (TEMP) or any other documents or information made available to the panel that the Army Test and Evaluation Command (ATEC) intends to use formal tech- niques for combining information in the final Stryker operational test evalu- ation. This report argues for the greater use of combining information methods, including the use of subjective expert opinion, as an important part of the operational assessment of complex defense systems in develop- ment, including Stryker. As pointed out in NRC (1998) and repeated here, without the use of these methods operational tests will typically fail to provide sufficient statistical power to support the confirmatory role of sig- nificance testing in judging the extent to which requirements of defense systems have been satisfied, and consequently whether the systems should be promoted to full-rate production. To address the disconnect between the role of significance testing in operational evaluation and the inherent limitations of significance testing due to the necessarily limited informa- tion that can be collected in operational tests, the panel recommends greater use of combining information in both test design and operational evalua- tion of defense systems. As will be detailed and reinforced in the following chapters, this strong advocacy of the use of these methods calls for diligence and expertise in verifying that the underlying assumptions hold to an ac- ceptable degree in order to prevent their misapplication. Since the defense acquisition process involves a number of organizations motivated by differ- ent and often competing incentives, we also stress the need to use assess- ment methods that help to ensure unbiased expert opinions. In arguing for this fundamental change to the operational evaluation of defense systems, the panel is aware of its broader nonstatistical implica- tions, which champions of these methods in the defense test and evaluation community will have to consider during implementation. Use of develop- 1Developmental testing is often typically carried out both by DoD (government) and by contractors. Because government developmental testing is usually expected to be more fully reported (and objectively summarized) than that done by contractors, the primary in- tent in this report is the use of government developmental testing for use in combining developmental and operational test data. When contractor testing is fully reported, the argu- ments provided here apply there as well.

OCR for page 11
16 IMPROVED OPERATIONAL TESTING AND EVALUATION mental test data, expert opinion, data from training exercises, and data on similar systems as part of operational evaluation will blur the boundaries of developmental and operational testing, and will clearly have potential im- pacts on the defense acquisition milestone system. In addition, information-combining techniques are sensitive to vari- ous assumptions, so that model validation is a crucial part of their proper application (see, e.g., Gelman et al., 19951. In developing models, analysts will need to represent the implications of any problems or unusual events that arose during system development or developmental testing. Therefore, we strongly urge that those involved in the application of the techniques described collaborate closely with those who have in-depth knowledge of the development of the system in question. Furthermore, the combining information methods recommended in this report are more susceptible to misapplication than the techniques cur- rently used by ATEC. For that reason, there is an important requirement that all steps in the development of these models and in the estimation of their parameters be fully documented so that they can be formally reviewed. Although the ultimate costs and potential shortfalls of such organizational changes must be considered, the panel is pleased to see evidence that these organizational changes are already under consideration. The remainder of this report is structured as follows. Chapter 2 pro- vides simple examples of methods for combining information within the weapons systems test and evaluation context to suggest approaches, explain considerations, and identify potential advantages. Chapter 3 presents more realistic examples of how modeling for combining information can be ap- plied to Army operational test and evaluation, considering the Stryker sys- tem at times as a specific application, and discusses implementation issues relating to combining information methods in the context of weapons sys- tem testing and evaluation. Chapter 4 identifies the resources, tools, and capabilities required to support the development of models for combining information in the context of defense test and evaluation. Chapter 5 dis- cusses combining information for the operational test and evaluation of the Future Combat System (FCS)/Future Brigade Combat Team (FBCT). We direct interested readers to the National Research Council report on Combining Information: Statistical Issues and Opportunities for Research (NRC, 1992), a valuable resource that provides additional technical details and useful references for methods of combining information. In addition, for other related research see Samaniego et al. (2001), Samaniego and Vestrup (1999), Arcones et al. (2002), and Gaver et al. (19971.