**Suggested Citation:**"Analysis of Variance Methods for Sensitivity Analysis and External Validation." National Research Council. 1991.

*Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers*. Washington, DC: The National Academies Press. doi: 10.17226/1853.

**Suggested Citation:**"Analysis of Variance Methods for Sensitivity Analysis and External Validation." National Research Council. 1991.

*Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers*. Washington, DC: The National Academies Press. doi: 10.17226/1853.

**Suggested Citation:**"Analysis of Variance Methods for Sensitivity Analysis and External Validation." National Research Council. 1991.

*Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers*. Washington, DC: The National Academies Press. doi: 10.17226/1853.

**Suggested Citation:**"Analysis of Variance Methods for Sensitivity Analysis and External Validation." National Research Council. 1991.

**Suggested Citation:**"Analysis of Variance Methods for Sensitivity Analysis and External Validation." National Research Council. 1991.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A VALIDATION EXPERIMENT WITH TRIM2 286 Analysis of Variance Methods for Sensitivity Analysis and External Validation The variability observed for these projections has a distinct structure, which can be analyzed by using analysis of variance, since the 16 observations for every response obey a 4Ã2Ã2 factorial design with one replication per cell. The dependent variable used was the projection corresponding to each version of TRIM2 minus the adjusted or unadjusted 1983 baseline estimate (depending on whether that particular version of TRIM2 used adjusted or unadjusted data) to arrive at an estimate of change. In addition, to standardize the analysis, we subtracted the comparison value for estimates of change from each dependent variable (i.e., we subtracted the difference between the 1987 and 1983 IQCS values). For independent variables the main effects of aging (four types), adjustment (on or off), and months (current or old) were used, and all two-way interactions were included in the model; the three-way interactions were implicitly assumed to be equal to zero. The objective was to test for the significance of the main effects, to measure the relative size of the main effects, and to test for the significance of the interactions. Notationally, the model was as follows: for i=1,2; j=1,2; k=1,â¦4; where Âµ denotes an intercept term; Î±i is the main effect due to use of either adjusted or unadjusted data, Î²j is the main effect due to use of either MONTHS or old MONTHS, and Î³k is the main effect due to use of one of the four aging routines; where (Î±Î²)ij, (Î±Î³)ik, and (Î²Î³)jk are interaction effects; and where it is assumed that âÎ±i=âÎ²j=âÎ³k=0, and âj(Î±Î²)ij=0 for all i, âi(Î±Î²)ij=0 for all j, and similarly for the remaining interaction effects. Since the comparison value was subtracted out, we know that effects that serve to bring the model closer to zero are acting to âimproveâ it. Therefore, we examined which main effects had this property. The complete analysis, summarized in Table 4, thus has the joint benefits of both a sensitivity analysis and an external validation. To better understand Table 4, let us discuss the results presented in the first row for ârace of headâpercent black non-Hispanic.â The first column presents the estimated intercept term which is the average of the dependent variable over all 16 versions. The value 2.16 indicates that the 16 estimates were, on average, about 2 percentage points higher than the comparison value (i.e., the 16 estimates showed an average decline of 0.68 percentage point compared with an âactualâ decline of 2.84 percentage points [obtained from Table 3] in the percentage of black non-Hispanic heads in the AFDC caseload between 1983 and 1987). Next, the four main effects for aging are provided. For no aging, the 0.29

A VALIDATION EXPERIMENT WITH TRIM2 287 TABLE 4 Analysis of Variance for Estimates of Change Response Overall Aging Adjustment Months Significant Closest-to- Variable Mean 1 2 3 4 Interactions Comparison Value Race of 2.16 {0.29 â0.8 0.18 0.40} â0.11** 0.12** None Aging (2) head (% 6 ** black non- Hispanic) Earnings 2.41 {â0.2 â0.2 â0.1 0.62} â0.08* â0.64* None None of adults 2 2 9 ** * (% none) Marital 0.02 {â2.0 â0.9 1.33 1.65} â0.27** â0.01 Adj*Aging NA status of 4 4 ** head (% no spouse) Sex of 0.03 {â1.7 â1.3 1.34 1.71} â0.27** 0.05** Adj*Aging NA head (% 0 6 ** female) Age of 0.12 {â0.3 â0.2 0.24 0.29} â0.08** 0.03** None NA head (% 4 1 ** <20) Type of 0.42 {â1.9 â1.0 1.34 1.62} â0.27** â0.02 None Adjustment AFDC 4 2 ** unit (% basic) No. of â0.81 {1.82 1.12 â1.2 â1.67 0.24** 0.03* Adj*Aging Aging (2) adults (% 6 }** 2) No. of 0.63 {â0.8 0.26 0.35 0.24} â0.11** 0.08** Adj*Aging Aging (1) children (% 6 ** >2) Total no. 0.30 {0.08 0.82 â0.2 â0.61 0.02 0.08** None Aging (3) in unit (% 9 }** >3) Age of â1.62 {0.90 â0.6 â0.2 â0.06 0.28** 0.09* None Aging (1) youngest 6 0 }** child (% <5) Total â81.5 {50.5 214.3 â94. â170. 27.9** â21.5* Adj*Aging, Aging (1) participants 5 3}** * Adj*Months (thousands) Total â86.4 {10.9 23.4 â0.1 â34.1 4.8** 19.8** None None participants }** with earnings (thousands) Total 38.5 {188.5 955.3 â427 â716. 105.3** â106.9 None None benefits ($ .3 5}** ** million) NOTES: The symbols ** and * denote effects that are collectively significant at the 0.01 level and significant at the 0.05 level, respectively. NA, not applicable.

A VALIDATION EXPERIMENT WITH TRIM2 288 indicates that the four cases that used this aging method averaged an additional 0.29 percentage point higher than the overall mean (i.e., their average was 2.45 percentage points). The â0.11 for adjustment indicates that the eight responses using adjusted data averaged 0.11 percentage point lower than the overall mean. The 0.12 for months indicates that the eight responses using the current MONTHS module averaged 0.12 percentage point higher than the overall mean. (The main effects of using unadjusted data and old MONTHS are simply the opposite of the effects of using adjusted data and MONTHS, respectively.) The symbols ** and * in Table 4 denote effects that are collectively significant at the 0.01 level and significant at the 0.05 level, respectively. In this case all three types of main effects are significant at the 0.01 level. Interactions that are significant at the 0.01 level appear in the second to last column. For race of head of unit, there were no significant interactions. The last column lists the effects that were significant at the 0.01 level, and would serve to bring the overall mean at least 33 percent closer to zero as a result of being of comparable size and opposite sign of the overall mean. This was only examined for overall means that were substantially different from zero. The hope was that a main effect associated with a superior alternative module would often push the overall mean toward zero and that as a result such a main effect would appear often in this column. For race of head of unit, demographic aging had an estimated main effect of â0.86, which partially counteracts the estimated intercept of 2.16. (In other words, the four cases that used this form of aging had a mean substantially closer to zero than the mean of all 16 versions.) More explicitly, the mean of the 16 versions minus the comparison value is 2.16. Taking each main effect individually, we obtain the following fitted values using the four aging effects, the two adjustment effects, and the two months effects, respectively. Notationally, these fitted values are and . The associated values are 2.45, 1.30, 2.34, 2.56, 2.05, 2.27, 2.28, and 2.04. The closest value to zero is 1.30, and it is more than 33 percent closer to zero than the original value, 2.16. In addition, the associated main effect, aging, is significant at the 0.01 level. Therefore the last column indicates that aging (2) is closest to the comparison value. An examination of Table 4 provides some evidence for the following tentative conclusions. First, aging clearly has a significant impact on model output for the variety of responses studied. The decision whether to use adjusted data also makes an obvious difference and, to a lesser degree, so does the decision whether to use MONTHS, with the use having a significant impact on some of the variables and not on others. If this analysis when repeated gave similar results in comparable situations, it would be reasonable to direct attention to research on aging modules, with adjustment second, and algorithms for allocating income on a monthly basis third. The only significant interaction

A VALIDATION EXPERIMENT WITH TRIM2 289 is that of aging with adjustment, and this is present only intermittently. (We note that for a similar analysis of level there was no evidence of interaction.) Therefore, the benefit of being able to separately analyze the effects of changing individual modules is often present. In addition, when interaction effects are not significant, this frees up more degrees of freedom for estimating the mean square error, thus increasing the reliability of the inference for the main effects. Finally, there is no module change that generally brings the average over the models closer to the comparison values. There is some justification for concern about the advisability of using full aging since it never appears in the last column of Table 4. Similarly, there is no evidence that either the use of adjusted data or the use of MONTHS has any clear advantage. It is important to distinguish between the various types of variability that arise in the model validation problem, and how they relate to errors in the analysis of variance model, since this has a bearing on the question of whether a model's estimates are significantly different from a set of comparison values. (For the present discussion we will assume the comparison values are fixed, though there is no complication if they are random as long as the variance of the comparison values is estimable.) Output from a model can be thought of as subject to intermodel varianceâvariance due to the choice of various alternatives for certain componentsâand intramodel varianceâvariance due to use of replicate sample input data sets or, more generally, replicate modeling situations. Intermodel variance is important to assess to understand how much variability in the outputs is due to the choice of various alternative components, which in turn directs the analyst to those areas of the model most in need of investigation. Intramodel variance is important to assess to create hypothesis tests and confidence intervals for model output. While some analysts consider both intramodel and intermodel variance to be part of the overall variance of a model's estimates (for more discussion of this, see Cohen, Chapter 6 in this volume), many statisticians only consider intramodel variance as contributing to the variance of estimates. In the analysis of variance model used above, the variance was decomposed into contributions from the main effects, interaction terms, and error. The mean square error resulted from a combination of lack of fit of the analysis-of-variance model and the sampling variability of the estimates from each model. In addition, since each model was run on the same input data set (before adjustment), the contribution of sampling variability is likely reduced due to correlations between the output from similar models. Therefore, the mean square error cannot be used to create confidence intervals for measuring agreement between estimates and comparison values. However, the size of the main effects, interactions, and error can be compared to the distances between model estimates and the comparison values to examine such questions as whether all important components were included in the sensitivity analysis; whether interactions were ignorable, with the result that the component alternatives can

A VALIDATION EXPERIMENT WITH TRIM2 290 be investigated individually; and which component alternatives influence the model toward the comparison values. A more general model that permits the estimation of both types of variability would need replicate data sets or situations. The resulting model would be a two-way analysis of variance with mixed effects, fixed effects for the model versions and random effects for the replications. With just one model version but with replications, t tests can be used on the differences to measure whether the differences are significantly different from zero, since replications provide a natural way of measuring the variability in each model's estimates. With many model versions, the variability due to replications (intramodel variance) and that due to use of alternative components (intermodel variance) can both be identified and usedâintramodel variance to form confidence intervals of desired coverage probability for each model version's estimate and intermodel variance to determine model fit, significance of interactions, etc. Finally, even if we had been able to compute confidence intervals to identify model estimates that were not significantly different from comparison values, it was also of value to identify alternative modules that resulted in model estimates that, while significantly different, were closer to the comparison values than the grand mean. In other words, there is value in knowing which alternatives âare headed in the right direction.â This is the objective of the last column of Table 4. Since there are three degrees of freedom associated with the main effect for aging, the s um of squares for aging can be further subdivided into components that represent the sum of squares contributed by up to three orthogonal contrasts or any number of subdivisions from any number of nonorthogonal contrasts. By examining the significance of these individual contrasts, more can be learned about individual variation from the use of various aging routines. Some contrasts that are readily interpreted are as follows: The first six contrasts help identify the source of a significant main effect for aging by examining the significance of pairwise differences. The seventh contrast can be interpreted as a global assessment of the difference between not aging and aging. Of course, the usual cautions about simultaneous inference must be observed since these contrasts are not all orthogonal. Various methods can be used to avoid this problem (e.g., Scheffe's method). Implicit in some of the above analysis is the interpretation that the overall mean represents the central tendency of predictions for the current state of knowledge. (This is not the only interpretation of the overall mean, and much of the above analysis is independent of any interpretation of the overall mean. For instance, another interpretation of the overall mean is simply a generalized intercept term.) However, the current state of knowledge can be argued to be the current version of TRIM2, version 1 in the experiment, and not the mean of