9
Validation

As far as I can see, in all uses of models for policy purposes…there is no confidence or error band…But we know that no model is correct. It couldn't be; there are so many factors omitted, even under ideal circumstances…Whether or not these remarks are fair, they lead to the questions of validation. How do we validate these relationships?…Forecasting or hindcasting is a way of validating a whole system. One would like, really, to be able to validate individual relations as well because, if the whole system doesn't work well, it is necessary to know where the repair job is needed. This problem requires a methodological discussion which I have not yet seen.

Kenneth Arrow (1980:260,263)

The purpose of microsimulation modeling, as of any policy analysis tool, is to produce useful estimates for the legislative process. ''Useful" connotes many things, but of prime importance in the panel's view is that the estimates be reasonably accurate. It seems obvious to say that better estimates are preferable to worse estimates, particularly when millions, often billions, of dollars ride on decisions to implement one or another program change. Yet agencies have typically underinvested in model validation, and microsimulation models have not been exempt from this pattern of underinvestment. Indeed, it is arguable that microsimulation models have received even less attention in terms of validation than have other model types—not so much due to deliberate oversight as to the particularly daunting nature of the task. Unlike, for example, cell-based population projection models, which have relatively



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations 9 Validation As far as I can see, in all uses of models for policy purposes…there is no confidence or error band…But we know that no model is correct. It couldn't be; there are so many factors omitted, even under ideal circumstances…Whether or not these remarks are fair, they lead to the questions of validation. How do we validate these relationships?…Forecasting or hindcasting is a way of validating a whole system. One would like, really, to be able to validate individual relations as well because, if the whole system doesn't work well, it is necessary to know where the repair job is needed. This problem requires a methodological discussion which I have not yet seen. Kenneth Arrow (1980:260,263) The purpose of microsimulation modeling, as of any policy analysis tool, is to produce useful estimates for the legislative process. ''Useful" connotes many things, but of prime importance in the panel's view is that the estimates be reasonably accurate. It seems obvious to say that better estimates are preferable to worse estimates, particularly when millions, often billions, of dollars ride on decisions to implement one or another program change. Yet agencies have typically underinvested in model validation, and microsimulation models have not been exempt from this pattern of underinvestment. Indeed, it is arguable that microsimulation models have received even less attention in terms of validation than have other model types—not so much due to deliberate oversight as to the particularly daunting nature of the task. Unlike, for example, cell-based population projection models, which have relatively

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations few components, microsimulation models are typically very large and complex, which greatly complicates the job of validation. Unlike macroeconomic models, which produce unconditional forecasts that can be and are frequently checked against reality, microsimulation models produce conditional estimates that are more difficult to measure against reality. On the positive side, however, validation of microsimulation models is simplified because in many cases their focus is on estimating differences among policy alternatives. The typical application of microsimulation models is to estimate the total costs, caseloads, and caseload characteristics for the current program and for one or more alternative programs. Therefore, the focus is not on levels of costs or caseloads for an alternative program but on differences between the alternative and the current program. This focus means that validation exercises can perhaps safely ignore some of the factors on which the estimates would ordinarily be conditional: a misspecified module or parameter that wrongly affects both a current program and alternative programs in the same way may have little effect on the estimates of differences. Thus, fewer factors are probably "active" for any given problem, facilitating model validation. Still, an analyst cannot assume that particular factors, such as the macroeconomic or demographic forecasts used in the simulations, have little or no impact on the estimates of differences between a current and alternative programs. Such a determination has to be carefully considered. And microsimulation models are also asked to estimate costs and caseloads of entirely new programs. In these instances, the interest is not in a difference but in an estimated level, which will be affected by all of the operative factors in the simulation. Thus, in general, one should take estimates of levels from microsimulation models to be less certain than estimates of differences. Because microsimulation models typically provide a variety of outputs, including aggregate estimates of costs and caseloads in terms of both levels and percentage changes, in comparison with current programs, and detailed estimates of the effects on population subgroups, the question of an appropriate loss function for an evaluation is a difficult one. For example, should one use as the evaluation criterion the minimization of error in estimates of levels of total costs or caseloads, percentage changes for a key population characteristic, or some other criterion? In this chapter we first note the kinds of corroborative activities, such as assessing model outputs against the analyst's view of the world or against outputs from other similar models, that, to date, have constituted the bulk of the effort devoted to microsimulation model validation. Then, we briefly review three basic types of validation studies—external validation, sensitivity analysis, and variance estimation—and note some issues specific to their use in validating microsimulation models. We next discuss briefly the use of loss functions in conjunction with microsimulation output. We then review the handful of studies that have attempted to carry out validations of microsimulation models and draw

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations what limited conclusions we can from these studies about the quality of current models. Next, we describe a validation study of the TRIM2 model that the panel commissioned as part of its work. This study, our "validation experiment," started out as a sensitivity analysis of three components of the TRIM2 model: the procedures used for aging the database, the procedures used for generating monthly income and employment variables from the annual values on the March CPS, and the use of the standard TRIM2 March CPS database versus the use of a database corrected for undercoverage of the population. In addition, we turned the experiment into an external validation study, by having the TRIM2 model use its 1983 database to simulate AFDC program costs and caseloads according to the rules in effect for these programs in 1987. This approach enabled us to compare the TRIM2 results with actual administrative data for 1987. Our experiment was very limited because we examined only three components of one model for only one time period. However, we learned something about elements of TRIM2 that merit further evaluation, and, more important, our experience offers an example of the kind of analysis that should be a regular feature of the microsimulation modeling enterprise.1 In the final section of this chapter we discuss strategies for microsimulation model validation and present our recommendations for policy analysis agencies. CORROBORATION AS A STAND-IN FOR VALIDATION We do not want to convey the impression that policy analysis agencies and modeling contractors have been insensitive to the need for validation of model output; indeed, in a number of areas, they have worked hard to ensure the accuracy of the models. In particular, they have typically devoted considerable resources to "debugging" activities designed to verify the accuracy of the computer code and a model's representation of the detailed accounting rules for the various programs within its purview. One debugging technique that is used is to identify a test sample of households, with complex structures or characteristics that relate to special programmatic features, and to check the records for these households—variable by variable—before and after simulations are run to ensure that the program specifications have been properly implemented. A related technique that could be used is to print the records for a relatively small number of cases that are outliers for benefits or some other characteristic of interest so that the computations for those cases can be checked. Debugging activities are important to continue and to improve, when possible, given the large volume 1   We did not single out TRIM2 because of any belief that it was much worse—or better—than other models. The primary sponsor for our study, ASPE, expressed interest in such an evaluation of TRIM2 and made available contractor and computing resources for the work. We would have liked to include other models as well but lacked sufficient resources to overcome various problems, such as different initial databases.

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations of complex code in microsimulation models and the correspondingly increased probability of coding errors. Microsimulation model analysts also typically engage in other activities that are designed to flag errors and problems in the model estimates. These activities, which amount to seeking to corroborate model results rather than to validate them, include: assessing model output against the analyst's view of the world, which is often informed by many years of experience (as a simple example, a simulated increase in benefits that produced smaller rather than larger numbers of beneficiaries compared with current law would act as a red flag to the analyst for further investigation); comparing model output with the analyst's "back-of-the-envelope" estimates (e.g., comparing TRIM2 projections for mandating the unemployed-parent program with the result of using for all states a simple ratio of existing unemployed-parent to total AFDC caseloads; alternatively, TRIM2 projections might be compared with those from a simple time-series model relating AFDC caseloads to a few key variables such as the unemployment rate); and comparing model output with output from other similar models or from the same model run by another agency. The last corroboration activity is an important mechanism for identifying and correcting problems with model estimates during the course of policy debates. Typically, two or more agencies—such as ASPE and CBO—are preparing estimates of the cost and distributional effects of proposed changes. Communication channels among analysts in the various agencies are usually fairly open, and analysts will compare their estimates. If there are large discrepancies, there will be an effort to determine the source, which often stems from different assumptions but may also stem from problems with one or another of the models. One example of the role of this type of corroborative activity from the history of the Family Support Act concerns the estimates for extending the unemployed-parent program to all 50 states (mentioned above). Analysts at CBO noted that the TRIM2 estimates used by ASPE for the states not previously covering unemployed parents also showed increases in the basic caseload in these states; the CBO estimates for the basic caseload, which were derived from an AFDC benefit-calculator model, did not show comparable increases. Investigation determined that the participation algorithm used in TRIM2 contains a parameter for the generosity of the state in which the eligible program unit resides. The existence of an unemployed-parent program serves as the proxy to distinguish between more and less generous states. Hence, mandating coverage of unemployed parents in all 50 states resulted in raising participation rates for other kinds of eligible units in the states not previously covering unemployed parents. But this is not a plausible outcome if one assumes that mandated

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations coverage does not indicate program generosity in the way that elective coverage does. Useful as these kinds of activities are, they are not a substitute for rigorous external validation studies and sensitivity analyses of model outputs. Indeed, there are grave dangers in relying on corroboration alone, even when the policy analysts are very knowledgeable and careful in debugging their code and checking their results with others. It is possible that the collective wisdom of the entire policy analysis community is simply in error. One example of an erroneous assumption related to AFDC policy concerns participation rates. The conventional wisdom for years held that the basic AFDC program, after experiencing enormous growth in the caseload in the 1960s and early 1970s, had saturated the eligible population. Overall participation rates simulated by the major income support program models regularly exceeded 90 percent. However, a marked drop in the simulated participation rate—to about 80 percent—occurred after 1980. Investigation determined that the primary cause was what appeared to be a minor change in coding family relationships, which the Census Bureau implemented beginning with the March 1981 CPS. The result was to add about 1 million subfamilies who were eligible for the AFDC program but exhibited lower-than-average participation rates (see Ruggles and Michel, 1987).2 As a result of this evaluation, analysts had to revise their views of the participation behavior of the population eligible for AFDC and hence their expectations of "reasonable" participation rates that were simulated by the models. This is but one example of the need for rigorous validation, not just corroboration, of microsimulation models. TECHNIQUES OF MODEL VALIDATION External Validation External validation of a model is a comparison of the estimates provided by the model against "the truth"—that is, against values furnished by administrative records or other sources that are considered to represent a standard for comparison. Several factors complicate, although they do not preclude, the task of externally validating the output of microsimulation models. First, as noted above, these models produce conditional estimates. If the social or policy environment changes and thus is different from the assumptions on which the estimates are based, it will not be surprising that the estimates differ from the reality. If a model simulated the actual program that was enacted, but other factors, such as the economic environment, changed, it may 2   As another example (but with the opposite substantive effect), the use of more detailed data from SIPP on asset holdings and other variables to calculate food stamp participation rates has resulted in higher participation estimates compared with estimates based on the March CPS (see Doyle, 1990).

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations require considerable effort to respecify an appropriate set of model runs that are conditional on the correct factors for comparison. More frequently, a model did not simulate the particular policy alternative that was actually enacted, and so direct comparisons are not possible. Instead, ex post forecasting or backcasting techniques must be utilized. Ex post forecasting is the use of an archived data set from several years ago with an archived (or current) model to predict the program in use a short time ago. In backcasting, one takes the current model and database and simulates program provisions that were operative at some period in the past. Another factor complicating external validation studies of microsimulation models concerns the appropriate time period for comparison, which may not always be clear. Because most models produce estimates of direct effects, the comparison period needs to be after the program changes are fully implemented but before any feedback effects could be expected to show up. Finally, the measures of "truth" that are used for the model comparisons are likely to present several problems. Like the microsimulation estimates themselves, they may be subject to both sampling and nonsampling errors. For example, detailed information on the AFDC caseload is based on a sample rather than on the complete set of administrative records. The available comparison values may also lack the cross-tabular detail needed for a full evaluation: such detail is critical given that a raison d'etre for microsimulation is its ability to provide estimates of distributions as well as aggregates. Despite these difficulties, external validation of microsimulation model estimates is crucial to carry out on a regular, systematic basis. It remains the best method for measuring the total uncertainty in a model, especially when the modeling environment is fairly stable and one can conduct several studies assessing the performance of a model for different time periods and policies. Of course, the information so obtained will not alone suffice to guide policy makers or funding agencies. Because external validation is an ex post operation, by definition it cannot give policy makers the real-time information they need about the quality of the estimates being produced for a current legislative debate. However, backcasting techniques may help policy analysts assess the likely quality of current estimates, as may experience with a model based on a long track record of external validation (both forecasting and backcasting). Another limitation of the results of external validation as a direct guide for policy makers is the fact that differences between model estimates and comparison values will include chance variation. Therefore, providing a complete picture of the model's performance requires generating a set of results that allows distinguishing among underlying sources of error. Further, because of likely practical limitations on the character and extent of external validation studies, they are unlikely to provide specific information about areas of needed improvement. For this type of information, one must turn to sensitivity analysis.

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations Sensitivity Analysis Sensitivity analysis is a technique that measures the effects on model outputs of alternative choices about model structure by replacing one or more existing modules (or the entire model) with reasonable alternative modules. The variability of the resulting estimates helps gauge the model's susceptibility to bias3 in the estimates due to model misspecification—one of four sources of variability.4 The resulting observed variability is not directly useful as a variance estimate because there is no indication of the true value. However, if the various alternatives used are equally plausible (or nearly so, given the current state of knowledge), the resulting range of the output estimates will provide some information on the variability in the output that could be attributed to misspecification of that component. Importantly, if there is an indication that the model is performing poorly, a sensitivity analysis should help to identify those component alternatives that make a difference and therefore should help direct the search for components that need to be improved (possibly by being replaced by one of the tested alternatives). This is one way in which a feedback loop can be constructed to identify and remedy model defects. Two key requirements for the use of sensitivity analysis are that alternative specifications of components exist and that the algorithms and software to implement them be easy to obtain and use. The first requirement of the existence of alternative modules is not a serious problem. Within the framework of microsimulation, there appear to be many candidates for experimentation, such as alternative functional forms for participation equations or alternative methods of calibration. A more important problem is that many alternatives may be difficult to implement. The panel's validation experiment involved a simple sensitivity analysis of TRIM2. Of the three modules investigated—distributing incomes on a monthly basis, treating undercoverage, and making use of various procedures for aging—competing methodologies for the first two modules were easily programmed, but the third, alternative aging techniques, required a large investment of resources. This component of TRIM2 was not specifically designed to accommodate a sensitivity analysis—a situation not, of course, peculiar to TRIM2, or more generally to microsimulation models. In addition, the proprietary nature of some microsimulation models restricts the free exchange of model components. Interpreting the results of a sensitivity analysis is clearly easier when the alternative components have little interaction. Sensitivity analysis is simply 3   The term bias is difficult to define precisely in this context: the general meaning here is the difference between the central tendency and the truth. 4   The four sources are sampling variability in the primary database; sampling variability in other input sources; errors in the primary database and other input sources; and errors from model misspecification. The concept of mean square error properly includes all four sources, but in practice almost always ignores the fourth source and often the third source as well (see the Appendix to Part I).

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations more feasible if the effect of joint modifications of the model's components on the outputs can be decomposed into essentially additive effects from changes to individual components. Because one is often interested in the effect of several factors on model estimates and it is expensive to investigate one component at a time, one usually modifies several factors simultaneously. If there is little interaction between components, it is easier to identify the factors that had the greatest impact on different output estimates. Whether or not it is possible to separate out different components depends, of course, on the characteristics of the underlying model. The panel's validation experiment found some interaction among the effects of adjusting the three modules of TRIM2 we investigated, which complicates the interpretation of the findings. It remains to be seen whether this degree of interaction is common to microsimulation models generally. Sensitivity analysis and external validation have different advantages and disadvantages. An advantage of sensitivity analysis over external validation is that it can be done during model development and use, that is, during the decision-making process; external validation, by definition, must be conducted after the fact. However, a sensitivity analysis cannot by itself identify which components are working well, because there are no comparison values. On the other hand, external validation cannot usually identify specific weaknesses in individual model components because, when the entire model is tested against the comparison values, sufficient information on the components is difficult to generate. Both sensitivity analysis and external validation can benefit from simultaneous application. By testing a variety of model alternatives through use of various component alternatives, and by making use of comparison values to identify models that produce estimates closer and further from comparison values, one can identify superior combinations of components. If the components have little interaction, it becomes easier to identify the combination of alternatives that is performing best (for that situation) and more feasible to study a larger set of factors in a limited number of model runs. If the components do interact, however, simultaneous modification is the only way of identifying the factors that need to be jointly improved, even though the identification process is complicated by the presence of interaction. Another way of learning about deficiencies in a microsimulation model, related to sensitivity analysis, is to make use of completely different modeling approaches to the entire problem, rather than exchanging individual components. It is clear that, for specific outputs, analysts using macroeconomic models, cell-based models, or other approaches, could produce estimates with error properties that were competitive with the estimates produced by microsimulation models. Many of these models would be relatively inexpensive to implement, and they could be very effective for helping to determine the likely variability in the estimates produced by microsimulation and diagnosing areas of weakness.

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations When replacing an entire model, it is harder to determine precisely which component is causing a discrepancy; however, experienced analysts can often pinpoint the likely source. Variance Estimation Because the input data set is one sample out of a family of equally likely input data sets, the output of a microsimulation model, potentially estimated on each member of this family, has an associated variance. In simpler modeling contexts, such as multiple regression analysis, statistical theory has produced methods of directly estimating the variance in estimates arising from sampling. In the complicated world of microsimulation modeling, however, no such theory is available. In this discussion of the measurement of this variance, we focus on one specific technique, the bootstrap (see Efron, 1982). However, it is important to point out that most of the techniques in the general literature on sample reuse or nonparametric variance estimation are applicable; for a full discussion of these techniques, see Cohen (Chapter 6, in Volume II). The idea behind the bootstrap is as follows. The variance of an estimator is a function of the average squared distance between an estimate (a function of the sample) and the true value (a function of the population). Because data are not available for the entire population, the relative frequencies of different estimates, needed to evaluate the average squared distance, are not known. However, if there is a relatively large sample, the observed relative frequencies of different estimates obtained by evaluating the estimator on pseudosamples drawn from the observed population will approximate the unknown relative frequencies obtained by sampling from the entire population and evaluating the estimator for those samples. Hence, possibly a good estimate, even for small samples, of the average squared distance between an estimate and the truth is the average squared distance between estimates derived from repeated sampling, with replacement, from the original sample and the estimate from the original sample, which then serves as a proxy for the truth. The actual application of bootstrap techniques to microsimulation modeling is complicated and rests on many choices that are difficult to make in the abstract. The underlying theoretical and practical aspects are discussed in Cohen (Chapter 6, in Volume II). We are convinced at this stage that the bootstrap or another of the currently available sample reuse techniques can be used to estimate sampling variances for outputs of microsimulation models. In fact, Wolfson and Rowe (1990) provide an example in which they used the bootstrap to estimate the variance of estimates from the SPSD/M model. These bootstrap variance estimates, however, will only measure directly the first component of variability of model estimates, that is, the sampling variability resulting from using one rather than another database. This component may well be the least important source of error in model outputs (although the relative

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations magnitude of the various sources of uncertainty is currently unknown). The concept of the bootstrap could be extended to measure the second component as well, that is, the variability in the model outputs resulting from sampling variability in the many other database inputs, such as imputations for variables not contained on the primary database, control totals from macroeconomic forecasts or population projection models, and behavioral response equations. Each of these inputs contains variability from being based on one rather than another database. To extend the bootstrap in this respect would be a complicated process. One would need to have pseudosamples for the primary database before modification or augmentation with other sources. One would also need to have distributions for each input—such as a series of child care expense imputation equations with varying coefficients—developed by applying the bootstrap (or possibly a less exacting procedure) to the Consumer Expenditure Survey or other data on which the equations are based. It may be possible to use sensitivity analysis in conjunction with bootstrap resampling as a more feasible way to develop estimates of total model uncertainty. Roughly speaking, one could use a bootstrap to measure the variance and a sensitivity analysis to weakly measure bias. Thus, one could use sensitivity analysis alone or together with the bootstrap to evaluate the contribution of errors in the primary database or in other sources, such as undercoverage of the target population, inappropriate imputation methodologies to overcome nonresponse, and misreporting of key variables. In Chapter 5 we recommend that originating agencies, such as the Census Bureau, play a more active role in preparing databases that are suitable for microsimulation modeling and other forms of policy analysis. This role should include more vigorous investigation and correction of data quality problems. In our view, it is not appropriate or cost-effective to ask microsimulation modelers to take on this burden. However, the modelers can make a contribution by using their expertise to identify the data quality problems that may have most import for modeling and conducting some limited evaluations to provide feedback to the originating agencies. Loss Functions The estimates produced by microsimulation models are in reality quite complicated, providing a wealth of detailed information about the distribution of outcomes from any policy alteration. This in itself presents challenges in evaluating the validity of any modeling effort. It is certain to be the case that a modeling effort—characterized by choices of modules, data sources, and the like—will do better in producing estimates for certain outcomes than it will do for others. Moreover, alterations in a model to improve the estimates in one area, such as AFDC participation, could actually worsen the estimates in

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations another area, such as the estimates of benefits transferred to different types of families. The full analysis of models, and the feedback to further development, requires that the producers and users of estimates specify the relative importance of accuracy with respect to the different estimates produced in any given effort. This specification can be characterized by a ''loss function." The loss function provides a quantitative measure of the importance of errors of different types. For example, if the model were invoked solely to estimate total costs of an AFDC option and if policy makers were increasingly unhappy the farther the estimate was from the true cost, the loss function might be posed simply as the square of the distance between the estimate and the true answer. There are many possible alternative specifications of loss functions that could be employed. In the previous example, if the policy makers were more distressed by underestimates of the budget implications than by overestimates, a nonsymmetric loss function might be appropriate. Alternatively, other loss functions could be defined in terms of the deviations of the estimates from certain aspects of the overall distribution of program effects. Past work has seldom employed explicit loss functions. This is unfortunate, because it leads to considerable ambiguity in evaluating model outcomes. In the context of formal estimation of the variance of model outputs, lack of a well-defined loss function presents severe problems. Therefore, one aspect of the future work on model validation should involve developing more explicit notions of an appropriate loss function. This work will require direct interaction between users and producers of policy analysis. An extra benefit from being explicit about the loss function is the considerable guidance gained thereby for any model development process. The loss function provides direct information for analysts about which aspects of a modeling endeavor are most in need of attention and possible modification. REVIEW OF VALIDATION STUDIES As part of our examination of the state of microsimulation model validation, we searched for and collected validation studies for two purposes: to obtain information on the performance of the models currently in use and to gather examples of the methods that others have used for microsimulation validation. Although the literature review that we commissioned was not fully comprehensive, we are reasonably certain that we have not missed any major validation studies of microsimulation models.5 We found only 13 validation studies of microsimulation models. When one considers that microsimulation modeling techniques have been used for policy analysis for over 20 years, and that during this time at least 6 major 5   The discussion here summarizes the material in Cohen (Chapter 7, in Volume II).

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations TABLE 9-4 X2 Goodness-of-Fit Statistics for Distributions, from TRIM2 Validation Experiment   Run Identification Variable 1 2 3 4 5 6 7 8 Total no. in unit 26 37 28 28 27 36 28 27 No. of adults 10 27 39 45 11 27 41 45 No. of children 2 5 6 6 2 4 5 5 Age of youngest child 11 14 12 12 11 16 13 12 Gross income of unit 568 508 461 396 506 456 423 362 Earnings of adults 15 19 18 8 14 26 23 17 Type of AFDC unit 15 3 13 15 17 3 12 14 Race of head 4 9 2 2 6 10 2 2 Sex of head 28 19 1 0 31 21 1 0 Age of head 30 47 45 48 29 49 45 46 Relationship of unit head to household head 1.2 0.9 0.9 0.9 1.2 0.9 0.9 0.9 Marital status of head I 1 15 20 1 2 16 19 Size of benefit 85 114 101 106 87 116 108 109 * D.F. indicates degrees of freedom. The x2 values at the 99 percent confidence limit are as follows (a higher value in the table indicates that a model version differs from the 1987 IQCS by an amount greater than one could expect by chance): D.F. = 1, χ2 = 6.635 D.F. =5, χ2 = 15.086 D.F. = 2, χ2= 9.210 D.F. = 6, χ2 = 16.812 D.F. = 3, χ2 = 11.341 D.F. = 7, χ2 = 18.475 D.F. = 4, χ2 = 13.277 D.F. = 8, χ2 = 20.090 Therefore, for these variables, model 1 was one of the more successful models for approximating the IQCS data. At the same time, model 1 has a relatively high ?2 for the variables gross income of unit, type of unit, and sex of head of household. This lack of general superiority or inferiority is true for all 16 model versions. The last column of Table 9-4 displays the results from using the 1983 IQCS, which assumes that the characteristics of the caseload in 1987 remain unchanged from those in 1983. Under some circumstances, the comparison values for the beginning of a period provide an interesting challenge to the model versions. If a study examines a situation in which a substantial policy change has occurred, the performance of the old IQCS data provides a standard

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations Variable 9 10 11 12 13 14 15 16 IQCS83 D.F.* Total no. in unit 48 19 40 40 49 49 42 41 73 4 No. of adults 33 8 45 48 35 39 48 51 438 2 No. of children 4 6 6 6 6 5 6 6 41 3 Age of youngest child 12 10 5 5 10 7 6 6 77 4 Gross income of unit 436 508 362 310 419 412 338 287 259 8 Earnings of adults 16 38 8 13 20 24 20 29 426 8 Type of AFDC unit 9 0 5 6 12 7 5 6 75 2 Race of head 7 16 6 8 5 5 5 6 493 3 Sex of head 14 3 0 0 20 19 0 0 48 1 Age of head 115 101 40 40 32 38 39 40 272 7 Relationship of unit head to household head 1.3 1.1 0.9 0.9 1.2 0.9 1.0 1.0 0 4 Marital status of head 1 4 20 23 0 1 21 24 57 1 Size of benefit 50 85 120 119 99 127 121 119 2015 7 that a reasonable model should exceed, namely, the model should do better than an estimate based on the assumption of no change over the period. Our experiment (as discussed further, below) examined a situation in which the policy change was modest, but economic changes in the period were relatively large. Under these circumstances, the comparison is less important because the noise is in some sense too large a fraction of the signal. Nevertheless, the 1983 IQCS data set does not compete well with the 16 versions of TRIM2 in the analysis (shown in Table 9-4), but it does outperform many TRIM2 versions in estimating total participants and other aggregates (see Cohen et al., in Volume II). In many situations, this type of comparison is extremely informative in providing a naive estimate of how well one can do with a very simple model. Also, this comparison provides an estimate of how much variability is natural to the problem, which can be compared with the variability left unexplained by the model versions. Limitations of the Experiment Our experiment was designed both to illustrate the types of methods that can

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations be used to validate a model's outputs and to provide some indication of the current performance of TRIM2. With respect to the second of these goals, the experiment is limited in a variety of ways.8 First, the experiment studied TRIM2 during only one time period. With respect to estimating error properties, even in a replicable situation, one replication is very limiting: a model can perform better than it would on average, or worse. Moreover, forecasting situations should not be considered replicates without further investigation. Different time periods will typically present different challenges to a model. In particular, any characteristics of either the 1984 or 1988 March CPS data, or of the period under study, that are peculiar to those data sets or to that time period reduce the opportunity for generalizing from our results. With regard to peculiarities of the data, an analysis conducted as part of the experiment showed that simulations of 1987 law using the 1983 baseline file sometimes outperformed simulations of 1987 law using the 1987 baseline file. This finding triggered a more extensive analysis of the quality of the March 1988 and 1984 CPS data (see Giannarelli, 1990). That analysis documented that the March CPS files typically include some states that have insufficient simulated units eligible for AFDC compared with administrative counts of participants—a phenomenon that complicates the calibration effort. It turned out that the March 1988 CPS had an unusually large number of such states (8), which made the calibration in that year less successful than usual. The number of simulated eligible units dropped by large percentages in some states from the previous year (e.g., by 29% in Connecticut, 22% in Michigan, and 32% in New Mexico). Although sampling variability in the CPS appears to explain some of these changes, Michigan's drop remains something of a mystery. It is important to point out that, although microsimulation modelers are aware of quality problems with the March CPS data, they do not regularly investigate changes in quality from year to year. Hence, we have an example of how validation can lead to more pointed validation and to identification of problems for further investigation and possible correction. The period from 1983 to 1987 was also special because of the large drop in the unemployment rate, from 8 percent in 1983 to 5 percent in 1987, and the differential impact of the change in unemployment on different subpopulations. We note that the changes in welfare program regulations between 1983 and 8   Indeed, the limitations of the experiment, including that only one time period was examined, constrained our analysis to emphasize primarily descriptive rather than inferential techniques and interpretations of the data. However, inferential analysis, including hypothesis testing, with the goal of identifying differences and patterns that are statistically significant versus those that are not (which is not often possible with the descriptive approaches that we used) has a great deal to offer when the number of replications increases. We certainly encourage the use of inferential techniques, such as nonparametric analysis of variance (see, e.g., Lehmann, 1975) and anticipate that expertise with respect to which models and techniques are most applicable will follow as experience is gained with these types of validation studies.

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations 1987 were relatively minor, which limits our findings to periods when there are few changes in law. The limitation imposed by having only one time replication is somewhat offset by having outputs for several characteristics that focus on fairly distinct portions of the model and by having outputs for states that can serve as mini models (for which we have not done an extensive analysis). Clearly though, there are circumstances that would cause a model to perform poorly for all states for a single time period but would not be indicative of the model's overall efficacy, due to the likely correlation among states for some response variables. The limitations resulting from examining TRIM2 at only one time help demonstrate that validation should not be an occasional examination of a model. Rather, model validation should be a continuous process that accumulates knowledge about potential model weaknesses, the size of errors, and situations in which a model is less reliable. Another limitation of our experiment was that, in addition to simulating just one time period, we simulated just one change in law. Had time and resources permitted, we would have liked to simulate an alternative policy— for example, a mandated AFDC minimum benefit nationwide—that would have represented a major policy change. Of course, we could not have conducted an external validation of such an alternative policy because it was never enacted. However, we could have conducted sensitivity analyses of both policies and obtained information relevant to the question of whether, in fact, microsimulation model estimates of differences between two policies are less variable than estimates for a particular policy, because of sources of error affecting both policies to about the same extent. Yet another limitation of our experiment (noted above) is that the surrogate for the truth that we used throughout may have weaknesses. First, the IQCS data are from a sample survey and therefore subject to sampling error. They are also subject to bias from a number of sources. For example, different states have different collection procedures for their quality control data, which may lead to different kinds of biases across states. In our experiment, we ignored the problems raised by the use of an imperfect surrogate for the truth. Whenever feasible, analysts should search for the sources of major discrepancies between model estimates and the comparison values in both data systems. At the same time, analysts should ignore discrepancies between model estimates and the comparison values that are smaller than what would be explained by ordinary sampling variability. Another limitation of the experiment was that we examined only one model, due to the time and resource constraints under which we, the sponsoring agencies, and the agencies' contractors operated. Ideally, it would have been desirable to expand the experiment to include other major models in use today, such as MATH, HITSM, DYNASIM2, PRISM, and MRPIS. However, we should note that all of these models are not directly comparable because they

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations are often used for different purposes and, by design, cannot necessarily all produce estimates of the same variables. Therefore, adding the model class as another factor in an analysis of variance, for example, may not generally be feasible. However, in cases in which models are directly comparable, one could certainly expand our basic experiment in that direction. If one could overcome the problems of comparability, there are major advantages to be gained through comparing the effectiveness of different broad modeling strategies. For example, the question of the relative advantages of static versus dynamic aging could be addressed in this way. Comparing several models with the truth (or a reasonable surrogate), along with a comprehensive analysis of model differences exemplified by Haveman and Lacker (1984), will often yield great insight into the strengths and weaknesses of the various models. Major Conclusions The panel's experiment was successful in demonstrating that microsimulation model validation is feasible. With current methods, analysts can measure the degree of variability attributable to the use of alternative components. Such information helps indicate overall model uncertainty, as well as which components to examine further to make improvements to a model. Thus, sensitivity analysis methods, especially when augmented with comparison values in an external validation, provide a great deal of data with which to direct efforts at model development as well as to measure model uncertainty. Our experiment demonstrated that there is considerable uncertainty due to changes in the three modules we studied in TRIM2. Therefore, the choice of which model version to use makes a difference. Yet it is not clear that any of the 16 versions has any advantage over the others. Certainly, for individual responses, particular versions fared better. However, given that the experiment is only one replication, it would be foolish to assume that our results provide confirmation of any real modeling advantage. Because the experiment did not attempt to measure the variance of any of the versions of TRIM2, we have no idea of the relative sizes of various sources of uncertainty in relation to variance. Therefore, it is difficult to assign a priority to development of variance estimates vis-á-vis use of sensitivity analysis. We do believe that it is important to investigate both. We stress that our experiment was purely illustrative. The benefits from a continued process of validation are rarely evidenced through study of a single situation. There is an important question about the degree to which different studies of this sort of the same model in different modeling situations would represent replications in any sense. However, even if the studies are not replications, use of these methods will provide evidence of general trends in model performance. Their use will generate a great deal of information as

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations to the situations under which a model performs well and provides accurate information. While we have made a convincing case for the feasibility of sensitivity analyses and external validation, the experiment was not cheap. The Urban Institute estimated staff costs (including overhead) to conduct the experiment of about $60,000 for 1,400 person-hours of effort or, roughly, 35 person-weeks. These estimates are probably low because it was difficult for the Urban Institute to separate activities that were needed for the experiment from their own day-to-day efforts; in addition, they do not take into account the time taken to specify the experiment and analyze the data. Moreover, we do not have estimates of computer costs. Overall, it is clear that the way in which TRIM2 (and most other microsimulation models) is currently configured can make a sensitivity analysis very costly. The costs were dramatically affected by our interest in trying out different forms of aging. The overall cost would have been substantially reduced, possibly by a factor of 2 or 3, had an easier module been selected for the experiment. But there were other modules that we did not investigate because the costs of working with them would have been higher still. It is obvious that, for model validation to become a routine part of the model development and policy analysis process, the structure of the next generation of models must facilitate the type of module substitution that is used in sensitivity analysis. In summary, our experiment gave mixed signals on the effectiveness of TRIM2. That TRIM2 is sensitive to the inclusion or exclusion of various factors is apparent. Our results suggest that, in some instances, nothing was gained by implementing TRIM2 rather than available IQCS data. However, in other instances, TRIM2 indeed provided very valid estimates. Our main goal was to show how one might undertake sensitivity and validity studies for microsimulation models. It is quite reasonable to speculate that similar studies on other microsimulation models will produce comparably mixed results, namely, that the model under study will prove to be useful for some variables but not as good as had been believed for others. This knowledge can only be valuable to the analysts using models to inform policy makers as well as to those involved in making improvements to the models. We have made a small start toward this end. STRATEGIES FOR VALIDATING MICROSIMULATION MODELS: RECOMMENDATIONS Our validation study of TRIM2 illustrates both the benefits and the costs of serious attempts to investigate the quality of estimates produced by microsimulation models. The benefits, even in our very limited study, seem clear to us. We determined that TRIM2 estimates are sensitive to alternative choices for model components. We also observed weaknesses in the March CPS database for

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations modeling income support programs. These weaknesses in the March CPS were more or less well known, but our findings underscore the need to investigate them further and to take corrective action of some kind. The costs of validation are also evident from our experiment, in terms of both time and resources. Indeed, the kinds of external validation studies, sensitivity analyses, and variance estimation procedures that we outline for microsimulation models may well appear to involve a dismayingly high expenditure of staff and budget resources, particularly in light of the limited resources that have been allocated to these activities in the past. Clearly, in the context of an ongoing policy debate, there is no possibility of applying such evaluations to even a fraction of the estimates and proposals that are modeled. However, we believe that the cost-benefit ratio for microsimulation model validation can be improved substantially through several mechanisms. First, implementation of the next generation of models with new computer technology, as recommended in Chapter 7, should dramatically reduce the costs and increase the scope of feasible validation studies, particularly if the modeling software is designed—as it should be—with validation in mind. Second, improvements in model documentation and archiving that we recommend (see Chapter 10) should make it easier to carry out validation studies, particularly external validations. Third, academic researchers should find model validation questions of considerable interest, particularly when they are able to access the models directly through new technology, and their work on validation methodology and applications should prove fruitful. Finally, we believe that, as policy analysis agencies and their contractors gain more experience with validation, the task will become easier and more rewarding, particularly when validation results prove helpful in making decisions about priorities for investment in models. Although the greatest improvements in microsimulation modelers' ability to carry out validation studies will come with the implementation of new technology, we believe that more validation can and should be accomplished in the short term with the current models. We outline a set of institutional arrangements that we believe will facilitate cost-effective model validation in the near term. We also recommend agency support of research on model validation methods and agency adoption of the ''quality profile" concept as a way of communicating information about model strengths and weaknesses to a broad user community and a way of organizing a continuing program of model validation targeted to priority areas for improvement. We urge policy analysis agencies to allocate the necessary resources and make the needed commitment so that validation becomes a regular part of the microsimulation modeling enterprise. Institutional Arrangements for Model Validation In formulating our recommendations for model validation, we took cognizance

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations of the severe time pressures for producing estimates that characterize the policy analysis process. We also took cognizance of the relatively limited capabilities of current models for cost-effective validation. Hence, we do not suggest that each and every set of policy estimates be evaluated, either in real time or after the fact—such a recommendation would deservedly be ignored. We present instead a plan for a reasonable approach to the validation task. In our view, major contracts for the development, maintenance, and application of microsimulation models should target a percentage of funds sufficient to carry out validation studies that can provide useful information to analysts and policy makers who are engaged in shaping legislation on a real-time basis. For those agencies that maintain and apply their own models in-house, rather than contracting for these services, the agency should allocate its own modeling budget in this way.9 These contracts should also include an allocation of funds earmarked to implement model revisions on the basis of the results of validation studies conducted by the modeling contractor and others. With regard to specific types of validation, the contractor would be expected to provide estimates of variability (such as a bootstrap estimate) and the results of sensitivity analyses for key sets of model estimates, where an example of a "key set" might be the first set of estimates prepared for the initially proposed version of the Family Support Act The contractor would not hold up delivery of these estimates until the validation was finished, but would endeavor to complete the validation as soon as possible. The results of the validation for one set of estimates would help interpret the quality of the estimates for alternative proposals (unless major new provisions were added). The sensitivity analysis would focus on those model components the analysts believe are most likely to have an impact on the particular set of estimates. In other words, the validation performed by the contractor would be "rough and ready," focused on helping to inform the policy debate. (See Chapter 3 for a discussion of the issues involved in communicating validation results to decision makers.) In addition to the validation efforts performed by the modeling contractor, we believe it is essential for policy analysis agencies to commission independent validation studies that include external validation as well as sensitivity analysis. In principle, independent evaluation is preferable to evaluation performed by the developer and user of a model (just as academic journals appoint independent reviewers for articles submitted for publication). In practice, independent evaluation is preferable as well, given the pressures confronting a modeling contractor to respond to insistent and frequently changing policy demands for large volumes of estimates prepared within short time frames. Agency staff 9   About 10-15 percent of funds might suffice for validation activities on an ongoing basis. However, given the relative lack of investment in microsimulation model validation to date, it may be that the percentage of funds earmarked in major contracts for validation purposes should be higher until sufficient experience is gained with validation techniques.

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations could carry out independent validation studies, but they, too, are usually under severe pressures from the demands of current policy issues. Hence, our recommendation is that, for every major microsimulation model development, maintenance, and application contract that policy analysis agencies let, they also let another contract to a separate organization to carry out longer term, more comprehensive validation studies of the particular model(s). The validation contractor would be expected to carry out external validation studies of selected model estimates and also to conduct extensive sensitivity analyses in order to identify areas for needed model improvement or revision. To implement a program of independent evaluations will clearly entail working out a number of practical matters. There will need to be ways to guard against conflicts of interest, such as a validation contractor deliberately downgrading a model in order to boost the chances for that contractor's own model winning the next bidding round. Most important, there will need to be cost-effective ways of providing validation contractors with access to the models they are evaluating and to modeling experts, without impairing the ability of the modeling contractors to respond to agency needs for real-time policy analysis. A possible approach is for a knowledgeable programmer to bring a second copy of the model to work on-site with the validation contractor; alternatively, staff of the validation contractor could work on-site with the modeling contractor. We are confident that workable arrangements can be devised. Looking ahead, we note that implementation of a new generation of models with computer technology that facilitates their use should make it much easier to deal with these problems and to expand the scope of the validation studies that are feasible to perform. Enhanced documentation will also facilitate independent validation of the type that we describe. Recommendation 9-1. We recommend that policy analysis agencies commit sufficient resources and accord high priority to studies validating the outputs of microsimulation models. Specifically, we recommend: Agencies, in letting major contracts for development, maintenance, and application of microsimulation models, should allocate a percentage of resources for model validation and revisions based on validation results. The types of validation studies to be carried out by the modeling contractor should include estimates of variance and focused sensitivity analyses of key sets of model outputs. The goal of these efforts should be to provide timely, rough-and-ready assessments of selected estimates that are important for informing current policy debates. In addition, agencies, when practical, should let separate microsimulation model validation contracts to independent organizations or in other ways arrange to carry out comprehensive,

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations in-depth evaluations. The types of studies to be performed by a validation contractor should include external validation studies that compare model outputs with other values and detailed sensitivity analyses. The goal of these longer range efforts should be to identify priority areas for model improvement. Research on Model Validation Methods In addition to letting the kinds of validation contracts that we describe above, it would be useful for policy analysis agencies to support research specifically designed to develop improved methods for microsimulation model validation. For example, a useful topic for investigation could be ways to increase the cost-effectiveness of techniques such as the bootstrap for estimating the variance in model outputs. For this kind of work, the agencies could let separate methodological research grants to academic researchers. The agencies could also take steps to interest the National Science Foundation and perhaps the National Institute of Standards and Technology in supporting this type of research, which could well have application for validation of complex models in other fields. Such work should be attractive to researchers, although the difficulties in providing them with access to current microsimulation models is an impediment. In the short term, perhaps an effective strategy would be to support fellowships for researchers to carry out methodological work on-site with the agencies' modeling contractors. The fellowships could be similar to those currently offered by several federal statistical agencies, including the Bureau of Labor Statistics, the Census Bureau, and the National Center for Education Statistics. (These programs are supported by a combination of National Science Foundation and agency funds, and are administered through the American Statistical Association.) Over the longer term, on the assumption that the next generation of models is successfully implemented with new computer technology, the agencies should find it quite easy to attract academic interest in the kinds of methodological work needed to improve model validation methods. Indeed, academic researchers would be able as well to conduct model validations. Recommendation 9-2. We recommend that policy analysis agencies provide support, through such mechanisms as grants and fellowships, for research on improved methods for validating microsimulation model output. Quality Profiles Finally, as a way of organizing an ongoing, comprehensive program of validating microsimulation models and communicating the results of validation studies to users, we urge policy analysis agencies to adopt a concept that is gaining

OCR for page 231
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations ground in statistical agencies, namely, that of developing "quality profiles." A quality profile is a document that brings together all of the available information about sources of error that may affect estimates from a survey or other data collection effort (see Bailar, 1983, and the discussion in Chapter 5). The profile identifies measures and procedures for monitoring errors; assembles what is currently known about each source of error and its impact on the estimates; provides comparisons with estimates from other data sources; and outlines needed research and experimentation designed to gain better understanding of sources of error and to lead to the development of techniques to reduce their magnitude. Clearly, developing a profile for a microsimulation model is a much bigger task than developing one for a single survey; however, the effort to conceptualize the sources of error and to bring together what is known about them in a document can be very helpful. Analysts who make use of the model output can benefit from the knowledge and caveats provided in a quality profile; model developers can use a profile to guide methodological work on understanding and reducing sources of error and to build a cumulative body of knowledge about their models' strengths and weaknesses. Recommendation 9-3. We recommend that policy analysis agencies support the development of quality profiles for the major microsimulation models that they use. The profiles should list and describe sources of uncertainty and identify priorities for validation work.