Read "Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report" at NAP.edu

« Previous: 2. Examples of Combining Information

Page 40 Cite

Suggested Citation:"3. Combining Information in Practice." National Research Council. 2004. Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report. Washington, DC: The National Academies Press. doi: 10.17226/10871.

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

3 Combining Information in Practice The previous chapter presented a number of examples of the use of techniques to combine information. In this chapter we discuss some considerations when implementing these techniques and the complications that often accompany analyses of operational test data in defense and related industrial applications. The panel notes that, while the operational evaluation of the Stryker/ SECT is a large and extremely complex problem, this degree of complexity is not unique within the DoD or other government agencies such as the Department of Energy (DOE). Los Alamos National Laboratory (LANL), for example, must evaluate the weapons in the aging nuclear stockpile and certify their safety, reliability, and performance even though the live test data that have traditionally been used for this evaluation can no longer be collected. For its evaluation of the nuclear weapons stockpile, the Department of Energy is developing approaches that employ formal methods for using expertise and combining information. Although live, full-system test data are no longer available, there is a great deal of relevant information in- cluding results from computer simulations, historical test data, subsystem tests, and expert judgment available through a large and multidisciplinary community that includes engineers, physicists, materials scientists, statisti- cians, and computer scientists. Traditional reliability demonstrations would be very difficult, and traditional statistical methods must be significantly expanded to include the representational methods discussed above and the 40

COMBINING INFORMS TI ON IN PRA CTI CE 41 information-combining methods discussed here and in Chapter 2. An ex- ample of how these methods might be applied to a large, complex system is given in Appendix C. COMBINING INFORMATION TO ASSESS SUITABILITY, SURVIVABILITY, AND EFFECTIVENESS The operational test for Stryker is intended to assess a large number of performance criteria. In the system evaluation plan (SEP) for Stryker the measures of performance and effectiveness (MOPs/MOEs) are grouped into three areas: suitability, effectiveness, and survivability. Suitability encom- passes issues such as transportability, maintainability, availability, and sup- portability. Measures under this broad heading are often not situation de- pendent, and so combining information from the operational test with that from training, developmental tests, and perhaps testing and field use of similar systems can often be relatively straightforward. For example, all in- stances in which Stryker is found to be transportable on a C-130 aircraft, whether from a training exercise or in developmental or operational test- ing, provide valid information about transportability. The various methods described above for combining information for use in assessing reliability (and other related methods) can be effectively applied in this area. Measures of survivability and effectiveness, on the other hand, are typi- cally situation dependent. Information from operational training missions (such as raids and perimeter defense) is not easily combined with informa- tion from operational test missions because of the many differences be- tween training and test operational situations. The approach used most often to combine information about survivability and effectiveness is the combination of information from operational tests, conducted by ATEC, and modeling and simulation efforts, such as those obtained by the U.S. Army Training and Doctrine Command's (TRADOC) Analysis Command. Methods of combining information useful for modeling measures of sys- tem survivability and effectiveness are likely to require relatively specialized models of system performance, which are typically achieved through mod- eling and simulation. The combination of information from tests and simulations is already standard DoD procedure. Modeling and simulation results play a part in designing operational tests, and the results of operational tests are used to refine and improve modeling and simulation programs through a model-

42 IMPROVED OPERATIONAL TESTING AND EVALUATION test-model approach. This existing DoD activity is an example of the wide range of methods subsumed under the rubric of combining information. The Stryker operational test will provide quantitative information that can be used in subsequent modeling and simulation efforts (though such efforts will likely not be used for the operational evaluation of Stryker). This information includes detailed performance measures such as detec- tion times, detection probabilities, time between rounds fired, and prob- ability of surviving direct hits, which can be used as direct inputs to de- tailed simulations. The operational test can also provide data on sample attrition rates that can be used as input to aggregated models. In either case, the simulations and models could then be used to augment the limited number of situations considered in the operational test by simulating other operational situations to provide a larger base of information for evaluating the survivability and effectiveness of Stryker and the SECT. There is relatively new, relevant statistical research on combining in- formation from experimental systems with that from computer models (see, e.g., Reese et al., 20001. One important, and challenging, step in carrying out this type of information-combining is to assess the variability and un- certainty in the output of the computer models that result from poor or insufficient inputs. The uniqueness of each application and the fact that the research is still evolving prevent our making any general statements about approaches that ATEC should take along these lines. ISSUES IN COMBINING INFORMATION FOR RELIABILITY ASSESSMENT Reliability is typically defined in textbooks as the probability of sur- vival (or operation without failure) for a given mission time and under specified conditions. A more practical definition would identify and care- fully characterize encountered conditions, recognizing that most systems have to operate in a complicated, dynamic environment. Customers generally desire information or assurance about the reliabil- ity of a system or product before they decide whether to purchase it and for what price. Manufacturers, for their part, need to assess a product's reliabil- ity before it is released in order to reduce the risk of serious field reliability problems and warranty costs. A purely empirical reliability demonstration typically follows the significance testing framework described in the NRC's 1998 report (pp. 88-91) and exemplified in DoD documents such as MIL- STD-690C (Failure Rate Sampling Plans and(Proced(ures), MIL-STD-781C

COMBINING INFORMS TI ON IN PRA CTI CE 43 (Reliability Design Qualifications and Prod(uction Acceptance Tests: Exponen- tial DistrilDution), and MIL-HDBK-108 (Sampling Proced(ures and Tables for Life and Reliability Testing Based on Exponential DistrilDution9. The fundamental ideas behind reliability demonstration testing are straightforward; an example in this instance is the specification that mean time to failure (MTTF) for a Stryker vehicle should be at least 10,000 miles. In order to demonstrate that this specification has been met, it is necessary to have a test that results in a lower confidence bound on MTTF that exceeds the specification. A minimum sample-size plan to make such a demonstration may have appeal, but to have a reasonable probability of successful demonstration, the actual MTTF would have to be much larger than 10,000 miles. Thus, under the simplifying assumption of an exponen- tial failure time distribution having only one unknown parameter, a dem- onstration at the 95 percent level of confidence would require testing three units for 10,000 miles and having no failures (see, for example, equation (10.01) in Meeker and Escobar, 19981. If the true MTTF is 15,000 miles, the probability of a successful demonstration (i.e., no failures) is only exp(-1/1.513 = 0.135. If the true MTTF is 30,000 miles, the probability of successful demonstration increases to exp(-1/313 = 0.368, which is still not very high. Although larger sample sizes can provide higher probabilities of success by allowing for a small number of failures during the test, these sample sizes can increase dramatically when one must estimate two parameters (e.g., fitting a more realistic Weibull distribution with an unknown shape param- eter). Thus, although these methods of reliability demonstration are useful for testing materials or components, unless the actual reliability is very much greater than the specification, they are generally impractical for large, expensive systems, because large sample sizes or unrealistically long tests are required. The previous illustration should make it clear that unless the true reli- ability of a system is overwhelmingly high, one will need very large amounts of reliability data to achieve the desired goals of reliability demonstration with some confidence. A number of information and data sources for both quantitative and qualitative information are available for such an evalua- tion of the Stryker/SBCT. The major sources include operational testing, developmental and technical testing, contractor testing, data from previous tests of similar systems, training exercises, experience of foreign armies with vaiants of the Stryker (though these systems are not very similar, which

44 IMPROVED OPERATIONAL TESTING AND EVALUATION would severely limit the value of this information), engineering judgment, military judgment, and modeling and simulation. The goal is an assessment, referred to as a reliability assurance, that is not as rigorous a confirmation as a reliability demonstration but that can still provide sufficient information on which to base a decision on promo- tion to full-rate production. In this approach, data are combined from a variety of sources, and the inference, as a result, is more model-based than in a reliability demonstration. The following discussion addresses the use of these sources and consid- ers specific formal methods. Use of Military Judgment It is always encouraging when statistical analysis of data harmonizes with the judgment obtained from insight, intuition, and experience. Of course, one should also consider how each may influence the other. Does the data analysis trigger the harmonizing after the fact? Would other results have led to other harmonies? It is much more convincing if evaluators and those providing other information write down their analysis results or in- sights and intuitions before comparing them for validation. Unfortunately, even in this case, minor differences will often be explained away if there is pressure for a certain interpretation of the results. Combining Test Data Operational testing for the Stryker will involve many vehicles over rela- tively short exposure periods. Unless one is analyzing failure modes with lifetimes that are reasonably described by an exponential distribution, the summary experience over these many short exposure periods is not equiva- lent to the summary experience of a few vehicles over long exposure peri- ods. This is the case even when the total exposure time for both sets of vehicles is the same. Data from such longer exposure periods may well be available from developmental testing but only for a few vehicles. Combining Test Data: Exponential ModLels The assumption that individual components and replaceable units (not repairable systems) have lifetimes that follow an exponential waiting time distribution may be reasonable in situations where the failures are mostly

COMBINING INFORM TI ON IN PRO CTI CE 45 due to external stressors exceeding a certain limit. Such a limit characterizes the vulnerability of the fleet of vehicles. However, before employing an exponential lifetime analysis, it should be confirmed that this vulnerability is not affected by aging. Such a confirmation almost always will involve the expert judgment of those who perform postmortem analyses of component and replaceable unit failures. The judgment to use an exponential distribu- tion is an implicit form of combining information, since one is using ex- pert opinion to stipulate a specific distributional form, in this case that the shape parameter in a Weibull model is equal to 1. When an exponential failure time model is appropriate, the combin- ing of data from two or more sources is fairly straightforward, provided the failure rates are roughly the same. The number of failures is combined into one overall count Nand the exposure times into one overall total exposure time T. and the analysis is performed using these two entities, with NIT being the maximum likelihood estimate of the failure rate. Here the two or more data sources can be operational, developmental, training, or other exposure tests or exercises, or the data may be obtained from subsystem experiences. In the latter case, the analysis is performed as though failures from all of these subsystems can be treated alike, as a common failure mode. It is essential to also compute individual failure rates together with their uncertainties to judge the assumption of homogeneity. Such a judg- ment can be informal (e.g., using a graphical technique) or formal (e.g., using significance tests). When applied to small data sets, such judgments tend to be liberal in that homogeneity will not be easily rejected unless the differences are sufficiently large. This will lead to pooling of data with mi- nor differences, and the mixed populations will exhibit somewhat higher variability characteristics than each contributing population. Such pooling of inhomogeneous exponential data gives the impression that the underly- ing failure phenomenon has a decreasing failure rate as opposed to the constant rate characterizing the exponential model (see Proschan, 19631. The result will be a better understanding of a mixed population instead of a more vague perception of many individual populations. When the failure rates under different exposure regimes (e.g., the op- erational test and developmental test) or for different categories of sub- systems show significant variations, it may still be possible to determine whether those variations are due primarily to a single factor. For example, failure rates during developmental testing may differ from the rates under operational testing, but for a particular group of failure modes the ratios of failure rates under the operational test to those under the developmental

46 IMPROVED OPERATIONAL TESTING AND EVALUATION test might be roughly constant. (This is the approach taken in Samaniego et al., 2001.) If this constant were, for example, 2, it would mean that operational test failures occur at roughly twice the rate of developmental test failures. An explanation might be that the external stressors (e.g., rug- ged terrain, wet weather, or rougher driving styles) in the operational test exceed the vulnerability limits approximately twice as often. For example, ball bearings can be damaged by sufficient shocks caused by rough terrain or unskilled driving (e.g., hitting a curb with the wheel). Even though bear- ings eventually wear out, a postmortem analysis of failures may be able to distinguish (e.g., by comparing the defective bearing with other bearings on the same vehicle) between the strong shock casualties and those that come from normal wear. This is another example of combining informa- tion obtained from engineering judgment used in conjunction with actual data. (Note that although this example is presented, for ease of explication, in the context of exponential lifetime analysis, it applies as well to other lifetime models.) If data from several previous systems are available during the develop- mental and operational tests, and if one finds that for specific components a failure rate during the operational test is roughly a certain multiple of the corresponding failure rate under the developmental test, then such a factor could be used to analyze the data for a current system for the same type of component in a combined fashion. The broader the prior experience over which this factor appears to be constant, the more confidence one can have in the use of such a factor for the situation at hand. This kind of analysis requires the foresight to have collected and archived data for easy retrieval. Unfortunately that is usually not the case in industry or in defense acquisition, because it is hard to convince the finan- cial decision makers to spend money on projects that are not immediately useful and may pay off only in the future, for a different program, after several such systems have been built and tested. The utility of establishing and maintaining a data archive is discussed in Chapter 4. The common factor approach can be extended to more complex and flexible regression models where (often the logarithm of) the failure rate is modeled as a linear combination of known factors that may influence the failure rate in some form. Such factors could identify the environmental exposure conditions or different mission scenarios during which failures occurred. As mentioned above, the exponential distribution is appropriate when failures occur due to random external shocks. Such regression mod- els, when they do not involve too many independent parameters, can lead

COMBINING INFORM TI ON IN PRO CTI CE 47 to strong pooling of information, i.e., to a great reduction in estimation uncertainty when compared to separate analyses based on data for each factor combination. If individual failure rates appear to be sufficiently different and com- bining data is not an option, this finding in itself is a form of combining information. Namely, more is learned from the collective of individual pieces of information than from each piece by itself; in this case it is learned that they are different, and the source of that difference can be investigated. This comment applies not just in the exponential lifetime context but in all others as well. Even in this situation different failure rates can be treated as random effects. By estimating the variability of these rates from the individual sources, pronouncements can be made about the collective of such rates if they can be reasonably viewed as a random collection from some popula- tion. Here there is a trade-off between a larger data collective and a some- what more uncertainly defined population, i.e., between a relatively large variance for the random effects and a relatively small variance. Combining Test Data: WeilDull Models A popular extension of the exponential model is the Weibull model, which not only describes the lifetimes of components and replaceable units that fail due to external causes, but also provides a framework for lifetimes that arise from wear-out failures or infant mortality. Wearout failures are quite common for mechanical systems, gears, axles, bearings, clutches, and brakes. Infant mortality failures arise in some electronic components and subsystems. These two kinds offailure can be effectively represented with aWeibull distribution, which is intrinsically identified by two parameters, the char- acteristic life 77 (acting as a scale parameter) and the shape parameter A, governing the skewness of the distribution. Symbolically, we have: ~ ( ~ ) On a logarithmic scale for the lifetimes this distribution becomes a location-scale family with location parameter ~ = log(77) and scale param- eter ~ = 1/~. When ~ = 1, the Weibull distribution yields the exponential

48 IMPROVED OPERATIONAL TESTING AND EVALUATION distribution as a special case. Situations with ~ > 1 are appropriate for de- scribing wearout and other phenomena (and ~ < 1 for infant mortality). As mentioned previously, estimating both Weibull parameters 77 and entails an additional uncertainty in the estimation process and therefore has more stringent data requirements. Here the case for combining infor- mation becomes even stronger than in the exponential situation. If the shape parameter ~ is approximately known from previous experience, the Weibull lifetime data individual values Xi can be transformed via Xi ~ Xi~ into exponentially distributed data, and all the methods discussed above carry over. If working with a known shape parameter is problematic, several values can be used in a sensitivity analysis, and, depending on the applica- tion, one of these can be used as a conservative choice. For example, when it is clear that the system is subject to wearout, ~ = lean be used as a lower bound on p. For some situations this will yield conservative results (see, for example, Section 10.6 in Meeker and Escobar, 19981. When ~ must be estimated as well, data can be combined using the assumption that the two sets have the same shape parameter but possibly different 77's (the assumption of common shape parameter should be checked formally through tests or informally through graphical tools). In this fashion the uncertainty in estimating A will be greatly reduced. Consid- ering the logtransform of Weibull lifetime data, this is essentially analogous to pooling variances, as discussed earlier. Further methods for combining Weibull data are similar to those de- scribed for the exponential model, culminating in a linear regression model that treats log(77) as a linear function of various known factors that vary across all lifetime data that are intended to be used in the combination effort. Here again, the underlying assumption that only 77 varies and not must be assessed. For a sequence of failures of repairable systems, the distribution of the times between failures of a particular system component often depends not only on the nature of the repair or component replacement but also on the general state of the system, which, in turn, may also involve the specifics of maintenance actions carried out over time. Even so, it may be possible to model component lifetime distributions as a function of related explana- tory variables. An alternative method for modeling reliability data from repairable systems is to use a stochastic process model for events in time. Such a pro- cess can be characterized by representing the failure intensity as a function of variables such as the age of the system, the environment in which the

COMBINING INFORMS TI ON IN PRA CTI CE 49 system operates, and other changes as they occur over that system life. Such models are especially useful when modeling system reliability and availabil- ity and when tracking costs of repair and operation. An extensive treatment of the relevant issues can be found in Ascher and Feingold (1984), Meeker and Escobar (1998, Chapter 16), and Nelson (20031. Industrial Experience and Stress Testing for Reliability Assurance Increased market competition has resulted in widespread cost cutting, which increases the likelihood of reliability problems by reducing the abil- ity to build in traditionally large factors of safety. These issues have driven some manufacturers to use new methods of manufacturing and reliability modeling, assessment, and improvement, taking advantage of new tech- nologies. Examples include monolithic (as opposed to built-up) structures, accelerated testing, robust design, computer modeling, importance sam- pling in fault tree analyses, increasing reliability through redundant system design, probabilistic design, and structured programs for design for reliabil- ity, such as design for six sigma. Reliability practices and procedures differ from industry to industry and from company to company within an industry, and often remain pro- prietary, especially with respect to the development of models that can be used to more effectively predict reliability without having to do expensive physical testing. In a reliability assurance program, the overall goal is system reliability, generally determined by past product experience and benchmarking against best-in-the-industry competitors or by a marketing need to have a warranty period of a certain length of time. Metrics used include percent of returns within the warranty period or average warranty costs per unit sold. Failure modes and effects analysis (FMEA) and reliability block diagrams are used to quantify the relationships between the system, subsystems, components, interfaces, and potential environmental effects; these quantified relation- ships are referred to as the reliability model. To meet the overall reliability goal, a reliability budget is developed to allocate reliability goals to different subsystems. For example, in the aircraft industry a 10-9 risk for a critical subsystem failure is often used as the targeted goal to maintain the industry "standard" of one critical aircraft failure in about 106 to 107 flights and the assumption that there are about 100 such subsystems to monitor. However, such 10-9 risk goals are usually established through modeling, since real experience on this order is not

50 IMPROVED OPERATIONAL TESTING AND EVALUATION attainable. Furthermore, such risk levels are often not accompanied by con- fidence bounds that reflect the uncertainty of any data utilized in such an analysis. This is due partly to the difficulty of achieving even an estimated 10-9 risk goal and also to the problem of reconciling two such disparate risks, namely 10-9 and the 5 percent chance of missing the target with the confidence bound. Even if the reliability for aircraft as high as 1-10-6 or 1-lo-7 per flight is the currently tolerated level, there are industrywide efforts under way to significantly increase this reliability level because of the anticipated growth in airline travel. At a constant accident rate the public acceptance of the resultant growth in the number of accidents is not a given. Each industry has its own considerations and sensitivities in budget- ing such subsystem reliabilities; for instance, major recalls in the automo- bile manufacturing industry are not uncommon and can be very costly. Inputs to reliability models, including associated uncertainties, need to be determined. Assuming the same or similar environmental conditions, previous experience with particular materials and components can be used directly; examples include experiences codified in MIL-HDBK-5 and MIL- HDBK-17 (handbooks for metals and composite materials) through A- and B-allowables, with 95 percent lower confidence bounds on the 1 per- cent- and 10 percent-points of the strength distribution for a given mate- rial. Because of the wide acceptance of allocating reliability as a concept in structural design, they have found use in nonstructural arenas as well. Computer modeling, along with appropriate physical testing to verify the accuracy of the model, can often be used to provide needed informa- tion on component reliability. The multitude of factors involved and the occasionally high cost of simulation runs has led to an entire subfield of design and analysis of computer experiments. Adjustments are made to critical components in each subsystem in order to meet subsystem reliability goals. Testing of a small number of prototype subsystem units at higher than typical use conditions can be done in order to discover weaknesses. These tests represent a kind of accel- erating testing, which can take various forms, some of which are described in McLean (20001. When new failure modes or weaknesses are discovered, design changes should be considered, albeit with the understanding that failure modes generated in the test might never occur in actual operation and that money spent on design changes might therefore be wasted. An- other risk is that some failure modes revealed by the accelerated testing could mask other failure modes that might not appear during the acceler- ated testing and thus remain undetected and uncorrected.

COMBINING INFORM TI ON IN PRO CTI CE 51 After the complete system is assembled, it may be necessary to conduct durability tests for certain parts of it. In some cases, this is done economi- cally by testing a small number of systems or nearly complete systems using continuous-use testing or rapid cycling, as appropriate. Separate tests may have to be conducted to excite different failure modes; for example, in automobile engine testing there is a standard test using a continuous run protocol and another that uses a start-stop-start protocol. While it is fea- sible and effective to use up-front testing of components and subsystems to assess their reliability characteristics, the same is not usually true for major systems whose reliability goals and costs are very high. Methods of strenuous testing of early production units are often em- ployed to discover reliability problems before large quantities of product have been shipped. For example, manufacturers of washing machines may have an arrangement with laundromats, and automobile manufacturers may track fleets of early production units with friendly customers. In both cases, the manufacturers track warranty returns to learn as early as possible about problems so that they can be corrected. Another example is the staggered entry into service of new aircraft for which the timing and location of first fatigue cracks or corrosion are care- fully recorded, so that succeeding aircraft of the same type can be examined and maintained more aggressively; thus past experience is used to indicate which areas to monitor for cracks and corrosion. For such an approach to be effective, proper maintenance schedules must be followed, incorporat- ing any knowledge of cracks and corrosion or other wear of materials, while also allowing for the probability of nondetection during an inspection. When sufficient information is not available from other sources, physi- cal testing (e.g., accelerated life or durability tests) may have to be con- ducted. If adequate physical testing cannot be done, then uncertainties may be addressed through the use of design safety factors, although this practice lacks scientific rigor. Usually such tests involve samples whose size is constrained by costs, and the possible variability underlying the test (be- cause of the small sample size) is absorbed or accounted for by increasing the reliability by a factor (derived mainly from engineering experience) that is considered acceptable. While use of design safety factors is an example of combining informa- tion (test results with engineering judgment or industry culture), such fac- tors are difficult to rely on since they have no probabilistic interpretation. Sometimes they are intended to implicitly account for the "unknown un- knowns" (or UNKUNK) and appear to offer insurance for unforeseen con-

52 IMPROVED OPERATIONAL TESTING AND EVALUATION tingencies. However, failure to examine the degree to which this is true empirically does not support this use or interpretation. Field experience typically validates the use of design safety factors, al- beit conservatively, because systems designed according to safety factors often satisfy their reliability requirements. Furthermore, this design process has also had additional benefits. For example, there have been incidents where aircraft were stressed far beyond design loads and survived with just the wings bent out of shape, and in one case this led to improved aerody- · · · ~ ~ tic · ~~ ~ ~ ~ ~ namlc wing properties. ~ucn success stories nave lea to a strong resistance to change among some members of the engineering design community. But while safety factors may be cheap during design, they often in- crease both the purchase cost and the costs that accrue during the lifetime of the product. In the aircraft industry, limited checks on the possibility of overdesign are carried out when a new wing design is statically loaded until it breaks, the aim being that the strength of the wing not exceed the design value by more than is necessary. Similar cyclic dynamic tests examine a new aircraft frame for fatigue failures. Because such factors typically have no known associated reliability, a major analysis of them based on analytical probabilistic design models and experience in the field could have long- range benefits.

Next: 4. Prerequisites for Combining Information »

Improved Operational Testing and Evaluation and Methods of Combining Test Information for the Stryker Family of Vehicles and Related Army Systems: Phase II Report (2004)

Chapter: 3. Combining Information in Practice

Welcome to OpenBook!

Get Email Updates