Read "Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations" at NAP.edu

Page 231 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

9
Validation

As far as I can see, in all uses of models for policy purposes…there is no confidence or error band…But we know that no model is correct. It couldn't be; there are so many factors omitted, even under ideal circumstances…Whether or not these remarks are fair, they lead to the questions of validation. How do we validate these relationships?…Forecasting or hindcasting is a way of validating a whole system. One would like, really, to be able to validate individual relations as well because, if the whole system doesn't work well, it is necessary to know where the repair job is needed. This problem requires a methodological discussion which I have not yet seen.

Kenneth Arrow (1980:260,263)

The purpose of microsimulation modeling, as of any policy analysis tool, is to produce useful estimates for the legislative process. ''Useful" connotes many things, but of prime importance in the panel's view is that the estimates be reasonably accurate. It seems obvious to say that better estimates are preferable to worse estimates, particularly when millions, often billions, of dollars ride on decisions to implement one or another program change. Yet agencies have typically underinvested in model validation, and microsimulation models have not been exempt from this pattern of underinvestment. Indeed, it is arguable that microsimulation models have received even less attention in terms of validation than have other model types—not so much due to deliberate oversight as to the particularly daunting nature of the task. Unlike, for example, cell-based population projection models, which have relatively

Page 232 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

few components, microsimulation models are typically very large and complex, which greatly complicates the job of validation. Unlike macroeconomic models, which produce unconditional forecasts that can be and are frequently checked against reality, microsimulation models produce conditional estimates that are more difficult to measure against reality.

On the positive side, however, validation of microsimulation models is simplified because in many cases their focus is on estimating differences among policy alternatives. The typical application of microsimulation models is to estimate the total costs, caseloads, and caseload characteristics for the current program and for one or more alternative programs. Therefore, the focus is not on levels of costs or caseloads for an alternative program but on differences between the alternative and the current program. This focus means that validation exercises can perhaps safely ignore some of the factors on which the estimates would ordinarily be conditional: a misspecified module or parameter that wrongly affects both a current program and alternative programs in the same way may have little effect on the estimates of differences. Thus, fewer factors are probably "active" for any given problem, facilitating model validation.

Still, an analyst cannot assume that particular factors, such as the macroeconomic or demographic forecasts used in the simulations, have little or no impact on the estimates of differences between a current and alternative programs. Such a determination has to be carefully considered. And microsimulation models are also asked to estimate costs and caseloads of entirely new programs. In these instances, the interest is not in a difference but in an estimated level, which will be affected by all of the operative factors in the simulation. Thus, in general, one should take estimates of levels from microsimulation models to be less certain than estimates of differences.

Because microsimulation models typically provide a variety of outputs, including aggregate estimates of costs and caseloads in terms of both levels and percentage changes, in comparison with current programs, and detailed estimates of the effects on population subgroups, the question of an appropriate loss function for an evaluation is a difficult one. For example, should one use as the evaluation criterion the minimization of error in estimates of levels of total costs or caseloads, percentage changes for a key population characteristic, or some other criterion?

In this chapter we first note the kinds of corroborative activities, such as assessing model outputs against the analyst's view of the world or against outputs from other similar models, that, to date, have constituted the bulk of the effort devoted to microsimulation model validation. Then, we briefly review three basic types of validation studies—external validation, sensitivity analysis, and variance estimation—and note some issues specific to their use in validating microsimulation models. We next discuss briefly the use of loss functions in conjunction with microsimulation output. We then review the handful of studies that have attempted to carry out validations of microsimulation models and draw

Page 233 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

what limited conclusions we can from these studies about the quality of current models.

Next, we describe a validation study of the TRIM2 model that the panel commissioned as part of its work. This study, our "validation experiment," started out as a sensitivity analysis of three components of the TRIM2 model: the procedures used for aging the database, the procedures used for generating monthly income and employment variables from the annual values on the March CPS, and the use of the standard TRIM2 March CPS database versus the use of a database corrected for undercoverage of the population. In addition, we turned the experiment into an external validation study, by having the TRIM2 model use its 1983 database to simulate AFDC program costs and caseloads according to the rules in effect for these programs in 1987. This approach enabled us to compare the TRIM2 results with actual administrative data for 1987. Our experiment was very limited because we examined only three components of one model for only one time period. However, we learned something about elements of TRIM2 that merit further evaluation, and, more important, our experience offers an example of the kind of analysis that should be a regular feature of the microsimulation modeling enterprise.¹

In the final section of this chapter we discuss strategies for microsimulation model validation and present our recommendations for policy analysis agencies.

CORROBORATION AS A STAND-IN FOR VALIDATION

We do not want to convey the impression that policy analysis agencies and modeling contractors have been insensitive to the need for validation of model output; indeed, in a number of areas, they have worked hard to ensure the accuracy of the models. In particular, they have typically devoted considerable resources to "debugging" activities designed to verify the accuracy of the computer code and a model's representation of the detailed accounting rules for the various programs within its purview. One debugging technique that is used is to identify a test sample of households, with complex structures or characteristics that relate to special programmatic features, and to check the records for these households—variable by variable—before and after simulations are run to ensure that the program specifications have been properly implemented. A related technique that could be used is to print the records for a relatively small number of cases that are outliers for benefits or some other characteristic of interest so that the computations for those cases can be checked. Debugging activities are important to continue and to improve, when possible, given the large volume

¹

We did not single out TRIM2 because of any belief that it was much worse—or better—than other models. The primary sponsor for our study, ASPE, expressed interest in such an evaluation of TRIM2 and made available contractor and computing resources for the work. We would have liked to include other models as well but lacked sufficient resources to overcome various problems, such as different initial databases.

Page 234 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

of complex code in microsimulation models and the correspondingly increased probability of coding errors.

Microsimulation model analysts also typically engage in other activities that are designed to flag errors and problems in the model estimates. These activities, which amount to seeking to corroborate model results rather than to validate them, include:

assessing model output against the analyst's view of the world, which is often informed by many years of experience (as a simple example, a simulated increase in benefits that produced smaller rather than larger numbers of beneficiaries compared with current law would act as a red flag to the analyst for further investigation);
comparing model output with the analyst's "back-of-the-envelope" estimates (e.g., comparing TRIM2 projections for mandating the unemployed-parent program with the result of using for all states a simple ratio of existing unemployed-parent to total AFDC caseloads; alternatively, TRIM2 projections might be compared with those from a simple time-series model relating AFDC caseloads to a few key variables such as the unemployment rate); and
comparing model output with output from other similar models or from the same model run by another agency.

The last corroboration activity is an important mechanism for identifying and correcting problems with model estimates during the course of policy debates. Typically, two or more agencies—such as ASPE and CBO—are preparing estimates of the cost and distributional effects of proposed changes. Communication channels among analysts in the various agencies are usually fairly open, and analysts will compare their estimates. If there are large discrepancies, there will be an effort to determine the source, which often stems from different assumptions but may also stem from problems with one or another of the models.

One example of the role of this type of corroborative activity from the history of the Family Support Act concerns the estimates for extending the unemployed-parent program to all 50 states (mentioned above). Analysts at CBO noted that the TRIM2 estimates used by ASPE for the states not previously covering unemployed parents also showed increases in the basic caseload in these states; the CBO estimates for the basic caseload, which were derived from an AFDC benefit-calculator model, did not show comparable increases. Investigation determined that the participation algorithm used in TRIM2 contains a parameter for the generosity of the state in which the eligible program unit resides. The existence of an unemployed-parent program serves as the proxy to distinguish between more and less generous states. Hence, mandating coverage of unemployed parents in all 50 states resulted in raising participation rates for other kinds of eligible units in the states not previously covering unemployed parents. But this is not a plausible outcome if one assumes that mandated

Page 235 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

coverage does not indicate program generosity in the way that elective coverage does.

Useful as these kinds of activities are, they are not a substitute for rigorous external validation studies and sensitivity analyses of model outputs. Indeed, there are grave dangers in relying on corroboration alone, even when the policy analysts are very knowledgeable and careful in debugging their code and checking their results with others. It is possible that the collective wisdom of the entire policy analysis community is simply in error. One example of an erroneous assumption related to AFDC policy concerns participation rates. The conventional wisdom for years held that the basic AFDC program, after experiencing enormous growth in the caseload in the 1960s and early 1970s, had saturated the eligible population. Overall participation rates simulated by the major income support program models regularly exceeded 90 percent. However, a marked drop in the simulated participation rate—to about 80 percent—occurred after 1980. Investigation determined that the primary cause was what appeared to be a minor change in coding family relationships, which the Census Bureau implemented beginning with the March 1981 CPS. The result was to add about 1 million subfamilies who were eligible for the AFDC program but exhibited lower-than-average participation rates (see Ruggles and Michel, 1987).² As a result of this evaluation, analysts had to revise their views of the participation behavior of the population eligible for AFDC and hence their expectations of "reasonable" participation rates that were simulated by the models. This is but one example of the need for rigorous validation, not just corroboration, of microsimulation models.

TECHNIQUES OF MODEL VALIDATION

External Validation

External validation of a model is a comparison of the estimates provided by the model against "the truth"—that is, against values furnished by administrative records or other sources that are considered to represent a standard for comparison. Several factors complicate, although they do not preclude, the task of externally validating the output of microsimulation models.

First, as noted above, these models produce conditional estimates. If the social or policy environment changes and thus is different from the assumptions on which the estimates are based, it will not be surprising that the estimates differ from the reality. If a model simulated the actual program that was enacted, but other factors, such as the economic environment, changed, it may

²	As another example (but with the opposite substantive effect), the use of more detailed data from SIPP on asset holdings and other variables to calculate food stamp participation rates has resulted in higher participation estimates compared with estimates based on the March CPS (see Doyle, 1990).

Page 236 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

require considerable effort to respecify an appropriate set of model runs that are conditional on the correct factors for comparison. More frequently, a model did not simulate the particular policy alternative that was actually enacted, and so direct comparisons are not possible. Instead, ex post forecasting or backcasting techniques must be utilized. Ex post forecasting is the use of an archived data set from several years ago with an archived (or current) model to predict the program in use a short time ago. In backcasting, one takes the current model and database and simulates program provisions that were operative at some period in the past.

Another factor complicating external validation studies of microsimulation models concerns the appropriate time period for comparison, which may not always be clear. Because most models produce estimates of direct effects, the comparison period needs to be after the program changes are fully implemented but before any feedback effects could be expected to show up.

Finally, the measures of "truth" that are used for the model comparisons are likely to present several problems. Like the microsimulation estimates themselves, they may be subject to both sampling and nonsampling errors. For example, detailed information on the AFDC caseload is based on a sample rather than on the complete set of administrative records. The available comparison values may also lack the cross-tabular detail needed for a full evaluation: such detail is critical given that a raison d'etre for microsimulation is its ability to provide estimates of distributions as well as aggregates.

Despite these difficulties, external validation of microsimulation model estimates is crucial to carry out on a regular, systematic basis. It remains the best method for measuring the total uncertainty in a model, especially when the modeling environment is fairly stable and one can conduct several studies assessing the performance of a model for different time periods and policies. Of course, the information so obtained will not alone suffice to guide policy makers or funding agencies. Because external validation is an ex post operation, by definition it cannot give policy makers the real-time information they need about the quality of the estimates being produced for a current legislative debate. However, backcasting techniques may help policy analysts assess the likely quality of current estimates, as may experience with a model based on a long track record of external validation (both forecasting and backcasting). Another limitation of the results of external validation as a direct guide for policy makers is the fact that differences between model estimates and comparison values will include chance variation. Therefore, providing a complete picture of the model's performance requires generating a set of results that allows distinguishing among underlying sources of error. Further, because of likely practical limitations on the character and extent of external validation studies, they are unlikely to provide specific information about areas of needed improvement. For this type of information, one must turn to sensitivity analysis.

Page 237 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Sensitivity Analysis

Sensitivity analysis is a technique that measures the effects on model outputs of alternative choices about model structure by replacing one or more existing modules (or the entire model) with reasonable alternative modules. The variability of the resulting estimates helps gauge the model's susceptibility to bias³ in the estimates due to model misspecification—one of four sources of variability.⁴ The resulting observed variability is not directly useful as a variance estimate because there is no indication of the true value. However, if the various alternatives used are equally plausible (or nearly so, given the current state of knowledge), the resulting range of the output estimates will provide some information on the variability in the output that could be attributed to misspecification of that component. Importantly, if there is an indication that the model is performing poorly, a sensitivity analysis should help to identify those component alternatives that make a difference and therefore should help direct the search for components that need to be improved (possibly by being replaced by one of the tested alternatives). This is one way in which a feedback loop can be constructed to identify and remedy model defects.

Two key requirements for the use of sensitivity analysis are that alternative specifications of components exist and that the algorithms and software to implement them be easy to obtain and use. The first requirement of the existence of alternative modules is not a serious problem. Within the framework of microsimulation, there appear to be many candidates for experimentation, such as alternative functional forms for participation equations or alternative methods of calibration. A more important problem is that many alternatives may be difficult to implement. The panel's validation experiment involved a simple sensitivity analysis of TRIM2. Of the three modules investigated—distributing incomes on a monthly basis, treating undercoverage, and making use of various procedures for aging—competing methodologies for the first two modules were easily programmed, but the third, alternative aging techniques, required a large investment of resources. This component of TRIM2 was not specifically designed to accommodate a sensitivity analysis—a situation not, of course, peculiar to TRIM2, or more generally to microsimulation models. In addition, the proprietary nature of some microsimulation models restricts the free exchange of model components.

Interpreting the results of a sensitivity analysis is clearly easier when the alternative components have little interaction. Sensitivity analysis is simply

³	The term bias is difficult to define precisely in this context: the general meaning here is the difference between the central tendency and the truth.
⁴	The four sources are sampling variability in the primary database; sampling variability in other input sources; errors in the primary database and other input sources; and errors from model misspecification. The concept of mean square error properly includes all four sources, but in practice almost always ignores the fourth source and often the third source as well (see the Appendix to Part I).

Page 238 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

more feasible if the effect of joint modifications of the model's components on the outputs can be decomposed into essentially additive effects from changes to individual components. Because one is often interested in the effect of several factors on model estimates and it is expensive to investigate one component at a time, one usually modifies several factors simultaneously. If there is little interaction between components, it is easier to identify the factors that had the greatest impact on different output estimates. Whether or not it is possible to separate out different components depends, of course, on the characteristics of the underlying model. The panel's validation experiment found some interaction among the effects of adjusting the three modules of TRIM2 we investigated, which complicates the interpretation of the findings. It remains to be seen whether this degree of interaction is common to microsimulation models generally.

Sensitivity analysis and external validation have different advantages and disadvantages. An advantage of sensitivity analysis over external validation is that it can be done during model development and use, that is, during the decision-making process; external validation, by definition, must be conducted after the fact. However, a sensitivity analysis cannot by itself identify which components are working well, because there are no comparison values. On the other hand, external validation cannot usually identify specific weaknesses in individual model components because, when the entire model is tested against the comparison values, sufficient information on the components is difficult to generate.

Both sensitivity analysis and external validation can benefit from simultaneous application. By testing a variety of model alternatives through use of various component alternatives, and by making use of comparison values to identify models that produce estimates closer and further from comparison values, one can identify superior combinations of components. If the components have little interaction, it becomes easier to identify the combination of alternatives that is performing best (for that situation) and more feasible to study a larger set of factors in a limited number of model runs. If the components do interact, however, simultaneous modification is the only way of identifying the factors that need to be jointly improved, even though the identification process is complicated by the presence of interaction.

Another way of learning about deficiencies in a microsimulation model, related to sensitivity analysis, is to make use of completely different modeling approaches to the entire problem, rather than exchanging individual components. It is clear that, for specific outputs, analysts using macroeconomic models, cell-based models, or other approaches, could produce estimates with error properties that were competitive with the estimates produced by microsimulation models. Many of these models would be relatively inexpensive to implement, and they could be very effective for helping to determine the likely variability in the estimates produced by microsimulation and diagnosing areas of weakness.

Page 239 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

When replacing an entire model, it is harder to determine precisely which component is causing a discrepancy; however, experienced analysts can often pinpoint the likely source.

Variance Estimation

Because the input data set is one sample out of a family of equally likely input data sets, the output of a microsimulation model, potentially estimated on each member of this family, has an associated variance. In simpler modeling contexts, such as multiple regression analysis, statistical theory has produced methods of directly estimating the variance in estimates arising from sampling. In the complicated world of microsimulation modeling, however, no such theory is available. In this discussion of the measurement of this variance, we focus on one specific technique, the bootstrap (see Efron, 1982). However, it is important to point out that most of the techniques in the general literature on sample reuse or nonparametric variance estimation are applicable; for a full discussion of these techniques, see Cohen (Chapter 6, in Volume II).

The idea behind the bootstrap is as follows. The variance of an estimator is a function of the average squared distance between an estimate (a function of the sample) and the true value (a function of the population). Because data are not available for the entire population, the relative frequencies of different estimates, needed to evaluate the average squared distance, are not known. However, if there is a relatively large sample, the observed relative frequencies of different estimates obtained by evaluating the estimator on pseudosamples drawn from the observed population will approximate the unknown relative frequencies obtained by sampling from the entire population and evaluating the estimator for those samples. Hence, possibly a good estimate, even for small samples, of the average squared distance between an estimate and the truth is the average squared distance between estimates derived from repeated sampling, with replacement, from the original sample and the estimate from the original sample, which then serves as a proxy for the truth.

The actual application of bootstrap techniques to microsimulation modeling is complicated and rests on many choices that are difficult to make in the abstract. The underlying theoretical and practical aspects are discussed in Cohen (Chapter 6, in Volume II). We are convinced at this stage that the bootstrap or another of the currently available sample reuse techniques can be used to estimate sampling variances for outputs of microsimulation models. In fact, Wolfson and Rowe (1990) provide an example in which they used the bootstrap to estimate the variance of estimates from the SPSD/M model.

These bootstrap variance estimates, however, will only measure directly the first component of variability of model estimates, that is, the sampling variability resulting from using one rather than another database. This component may well be the least important source of error in model outputs (although the relative

Page 240 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

magnitude of the various sources of uncertainty is currently unknown). The concept of the bootstrap could be extended to measure the second component as well, that is, the variability in the model outputs resulting from sampling variability in the many other database inputs, such as imputations for variables not contained on the primary database, control totals from macroeconomic forecasts or population projection models, and behavioral response equations. Each of these inputs contains variability from being based on one rather than another database.

To extend the bootstrap in this respect would be a complicated process. One would need to have pseudosamples for the primary database before modification or augmentation with other sources. One would also need to have distributions for each input—such as a series of child care expense imputation equations with varying coefficients—developed by applying the bootstrap (or possibly a less exacting procedure) to the Consumer Expenditure Survey or other data on which the equations are based.

It may be possible to use sensitivity analysis in conjunction with bootstrap resampling as a more feasible way to develop estimates of total model uncertainty. Roughly speaking, one could use a bootstrap to measure the variance and a sensitivity analysis to weakly measure bias. Thus, one could use sensitivity analysis alone or together with the bootstrap to evaluate the contribution of errors in the primary database or in other sources, such as undercoverage of the target population, inappropriate imputation methodologies to overcome nonresponse, and misreporting of key variables. In Chapter 5 we recommend that originating agencies, such as the Census Bureau, play a more active role in preparing databases that are suitable for microsimulation modeling and other forms of policy analysis. This role should include more vigorous investigation and correction of data quality problems. In our view, it is not appropriate or cost-effective to ask microsimulation modelers to take on this burden. However, the modelers can make a contribution by using their expertise to identify the data quality problems that may have most import for modeling and conducting some limited evaluations to provide feedback to the originating agencies.

Loss Functions

The estimates produced by microsimulation models are in reality quite complicated, providing a wealth of detailed information about the distribution of outcomes from any policy alteration. This in itself presents challenges in evaluating the validity of any modeling effort. It is certain to be the case that a modeling effort—characterized by choices of modules, data sources, and the like—will do better in producing estimates for certain outcomes than it will do for others. Moreover, alterations in a model to improve the estimates in one area, such as AFDC participation, could actually worsen the estimates in

Page 241 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

another area, such as the estimates of benefits transferred to different types of families.

The full analysis of models, and the feedback to further development, requires that the producers and users of estimates specify the relative importance of accuracy with respect to the different estimates produced in any given effort. This specification can be characterized by a ''loss function." The loss function provides a quantitative measure of the importance of errors of different types. For example, if the model were invoked solely to estimate total costs of an AFDC option and if policy makers were increasingly unhappy the farther the estimate was from the true cost, the loss function might be posed simply as the square of the distance between the estimate and the true answer.

There are many possible alternative specifications of loss functions that could be employed. In the previous example, if the policy makers were more distressed by underestimates of the budget implications than by overestimates, a nonsymmetric loss function might be appropriate. Alternatively, other loss functions could be defined in terms of the deviations of the estimates from certain aspects of the overall distribution of program effects.

Past work has seldom employed explicit loss functions. This is unfortunate, because it leads to considerable ambiguity in evaluating model outcomes. In the context of formal estimation of the variance of model outputs, lack of a well-defined loss function presents severe problems. Therefore, one aspect of the future work on model validation should involve developing more explicit notions of an appropriate loss function. This work will require direct interaction between users and producers of policy analysis. An extra benefit from being explicit about the loss function is the considerable guidance gained thereby for any model development process. The loss function provides direct information for analysts about which aspects of a modeling endeavor are most in need of attention and possible modification.

REVIEW OF VALIDATION STUDIES

As part of our examination of the state of microsimulation model validation, we searched for and collected validation studies for two purposes: to obtain information on the performance of the models currently in use and to gather examples of the methods that others have used for microsimulation validation. Although the literature review that we commissioned was not fully comprehensive, we are reasonably certain that we have not missed any major validation studies of microsimulation models.⁵

We found only 13 validation studies of microsimulation models. When one considers that microsimulation modeling techniques have been used for policy analysis for over 20 years, and that during this time at least 6 major

⁵	The discussion here summarizes the material in Cohen (Chapter 7, in Volume II).

Page 242 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

and 10-20 minor microsimulation models have been developed (see Chapter 4), it is surprising that so little effort has been devoted to determining the uncertainty present in these models. We are not the first to come to this conclusion. Betson (1988:12) notes, "Efforts to explore the statistical properties of the estimates derived from microsimulation models have been scant." Doyle and Trippe (1989:v) agree: "Despite the fact that microsimulation models have been used extensively to set public policy, little effort had been invested in ascertaining the quality of the simulation models until very recently." Burtless (1989:49) adds, "No behavioral predictions from microsimulation models have been compared with actual historical experience from a different period than the one used to derive the original behavioral estimates. We can increase the public's confidence in microsimulation results (and probably improve the reliability of behavioral routines) if we periodically compare the model predictions against actual experience." Finally, there is the quote that led off this section by Kenneth Arrow.

Table 9-1 lists the studies that we reviewed and some of their characteristics. The second column of the table indicates the model(s) covered; the third and fourth columns indicate whether the study included a sensitivity analysis or external validation of the model; the fifth column indicates whether the study stated that the results of the analysis subsequently led directly to changes in the model; and the sixth column indicates the number of replications that were involved in each study. The entries for each of these characteristics were not always obvious, and many are subject to debate. For example, it was not always clear whether a finding was used to modify the model, and it was not always clear whether an analysis of a dynamic model over several years was a single replication, because some factors undoubtedly changed during the period of analysis.

These 13 studies demonstrate, first, the lack of any (at least formal) sensitivity analysis for many of the models; our review found only five models that have had a formal sensitivity analysis. This lack greatly hinders the feedback process of model improvement. Our review also found only nine studies involving an external validation, and then only one of them clearly with more than one replication. These studies do not allow us to develop direct measures of model performance, except for individual situations that are unlikely to apply generally and that have great uncertainty associated with them. Lastly, although not covered in the table, to our knowledge there has been only one recent effort to make use of sample reuse methods to determine the variance of the output from microsimulation models (Wolfson and Rowe, 1990).

As shown in Table 9-1, model validation can lead to model improvement. This is all the more reason to be concerned that very few microsimulation models have been validated. Because there have been so few attempts to develop error estimates for the output of any microsimulation model, there is

Page 243 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

TABLE 9-1 Summary of Literature Review: Microsimulation Model Validation

Study	Model	Feature
		Sensitivity Analysis	External Validation	Feedback Improvement	Replications
Hendricks and Holden (1976a)	DYNASIM	Yes	Yes	Yes	1
Hendricks and Holden (1976b)	DYNASIM	Yes	No	No	N.A.
General Accounting Office (1977)	TRIM	Yes	Yes	?	1
Holden (1977)	DYNASIM	Yes	No	Yes	1
Hayes (1982)	MICROSIM	Yes	Yes	Yes	1
Jefferson (1983)	DYNASIM	No	Yes	No	1
Haveman and Lacker (1984)	DYNASIM and PRISM	No	No	?	N.A.
ICF (1987)	TRIM2 and HITSM	No	Yes	Yes	1
Kormendi and Meguire (1988)	TRIM2	Yes	Yes	No	2
Betson (1988)	KGB	Yes	No	No	1
Doyle and Trippe (1989)	MATH	Yes	Yes	Yes	1
Beebout and Haworth (1989)	MATH	No	Yes	No	1
Burtless (1989)	MATH and TATSIM	No	Yes	No	1
NOTE: N.A., not available SOURCE: Data from Cohen (Chapter 7, in Volume II).

Page 244 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

no answer today for questions such as the relative contribution of sampling variance to total uncertainty for microsimulation models.

Our review does give us some reason for optimism. Much of the work that was done is of very high quality, indicating that there is an appetite for such analysis and that this type of work is currently feasible. We discuss below three of these studies to give a sense of some of the methods used and the results of the application of those methods.

Doyle and Trippe

Doyle and Trippe (1989) conducted a two-phase validation of the MATH model forecasts of the parameters of the food stamp program in 1984. In the first phase, they compared administrative data for August 1984 with simulated values from the model implemented with a March 1985 CPS database, using program parameters for that time. This procedure removed the contribution of forecasting error, which is usually present in MATH simulations based on databases that have been aged forward from an earlier year. They also compared the CPS-based results with simulations based on the 1984 SIPP.

In the second phase, Doyle and Trippe directly evaluated the aging module by projecting an unaged 1980 database (from the March 1981 CPS) to 1984, making use of historical data from the March 1985 CPS to generate the control totals, again eliminating the contribution of forecasting error due to incorrect control totals. They compared the distribution of household characteristics for the low-income population—including characteristics controlled for and those not controlled for in the aging process—across the unaged 1980 database, the aged 1984 database, and the actual 1984 database (from the March 1985 CPS).

Doyle and Trippe began their analysis with the following a priori assumptions about possible causes of problems originating in the data inputs for MATH: data limitations in understanding behavioral decisions; weak macroeconomic projections; nonsampling errors in the March CPS, such as underreporting of income; undercoverage of selected population groups in the March CPS; and the omission, in the March CPS, of variables such as assets that are necessary to determine program eligibility and benefit level. For comparison values in the first-phase study, Doyle and Trippe used the administrative data on the food stamp caseload, while recognizing that these data are also subject to error. To account for the sampling error present in the administrative estimates, they used confidence intervals about the administrative estimates for comparison, rather than only the estimates themselves. In addition, they used tolerance levels that attempted to represent differences that were not important in a subject-matter sense. They defined the tolerance levels to be the greater of two values: 5 percent of the value of the administrative estimate or twice the sampling standard error.

Doyle and Trippe's first-phase analysis demonstrated that MATH was

Page 245 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

effective in estimating program costs and caseloads given by administrative data and in providing estimates of the distribution of the caseload by important characteristics, including household size and gross monthly income. However, they showed that, in other respects, MATH was less effective. A major problem was that MATH simulated too few food stamp households that receive public assistance and have school-age children and too many food stamp households that include elderly people and have earnings. Doyle and Trippe thought that these problems likely stemmed from errors in the March CPS, and so they examined whether using SIPP data to perform the simulations in place of the March CPS would alleviate this problem, but the discrepancies persisted. The analysis also determined that the MATH model's procedure for imputing asset values, which are not reported in the March CPS, is inadequate and needs to be improved.

The second-phase analysis conducted by Doyle and Trippe examined the benefits of using the aging module rather than working with an unaged file, in the usual modeling situation in which one must develop estimates for a future year. As would be expected, aging was beneficial for variables that are used as controls in the aging process. However, aging was either not beneficial or somewhat harmful for variables that are not included as controls, such as the distribution of households by income as a percentage of the poverty threshold.

Haveman and Lacker

Haveman and Lacker (1984) analyzed the differences between DYNASIM and PRISM in their estimation of future public and private pension benefits. The initial database for DYNASIM was the March 1973 CPS-SSA exact-match file; the initial database for PRISM was an exact match of the 1978 CPS-SSA exact-match file with the May 1979 and March 1979 CPS. Haveman and Lacker (1984:3) comment: "Although the microdata simulation procedures on which these models rest are marked improvements over previous methodologies, the ability to project retirement income with accuracy has yet to be demonstrated…For the two models under consideration here, the baseline projections do diverge." For example, for the year 2000, DYNASIM projected the average private pension benefits for 65-year-old males to be $3,509; PRISM projected a value of $6,160.

Haveman and Lacker put forward five possible sources of these discrepancies: (1) different initial samples, (2) use of different specifications for the endogenous relationships, (3) relationship estimates from different data sets, (4) different judgments for situations in which no data existed, and (5) use of different exogenous parameter values. Because of budget constraints, Haveman and Lacker could not carry out a sensitivity analysis; instead, they did a qualitative assessment of the likely source of the differences. They admit that this approach does not permit a clear answer as to which model's projections are

Page 246 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

more reliable. For example, they note that DYNASIM takes race, education, and marital status into consideration in its mortality module; PRISM does not. On the other hand, PRISM makes use of disability. However, without exchanging these modules in the two models, it is not clear how much difference this makes.

Haveman and Lacker do point out that it is often possible to perform the equivalent of a sensitivity analysis by simply examining the competing algorithms. For example, they argue that the PRISM estimates of outlays in private defined-benefit plans would be nearly 50 percent higher than the estimates from DYNASIM as a result of indexing nominal-valued constants to different series. However, Haveman and Lacker admit to the limitations of their approach (1984:109):

Overall, we found both models to be impressive and highly innovative pieces of model-building research. Their impressiveness, however, is simultaneously their weakness. The enormous complexity which they embody makes them, effectively, black boxes. The input into them can be seen, understood, and judged. The projections which they yield can be understood and are illuminating. Yet how the assumptions and other inputs came to yield the printed-out projections cannot be seen, understood, or judged. The interaction of the complex relationships, transition matrices, time-triggered status changes, random drawings from unknown pools, and constraints to insure comparability is so complicated that little intuition or 'feel' is possible for why the resulting projections are what they are. The evidence which would lead a reviewer to believe the predictions of one model more than that of another is slim, indeed.

They recommend three approaches for investigating microsimulation models: backcasting, sensitivity analyses in which certain modules are switched between the models, and the analysis they performed for PRISM and DYNASIM.

The kind of careful scrutiny of DYNASIM and PRISM that Haveman and Lacker gave in their paper, in conjunction with an external validation, would greatly expedite determination of the reasons for any discrepancies between the models and the truth. In addition, work of this sort provides modelers with obvious examples of alternative methodologies that can be used in a sensitivity analysis. Therefore, although the Haveman and Lacker analysis is not a validation per se, it represents an important component of validation.

Kormendi and Meguire

Kormendi and Meguire (1988) examined the performance of TRIM2 in estimating the number of households that participate in some form of welfare assistance program and the benefits they receive. They took two different approaches to this validation of TRIM2. The first approach was randomly to perturb parameters in the AFDC participation module of TRIM2, according to reasonable

Page 247 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

estimates of the variability of these estimated parameters, and then to examine the resulting variability in the estimates of participation and level of benefits. This exercise represents an interesting form of sensitivity analysis.

The results of this first part of the validation of TRIM2 indicated that the variability of simulated benefits was less than that of participating units. In nearly all cases, the coefficients of variation (i.e., the standard errors as percents of the estimates) of simulated benefits were less than 3 percent. At the same time, the coefficients of variation of simulated units ranged from 4 to 18 percent.

In the second part of the analysis, which Kormendi and Meguire label dynamic validation, they used TRIM2 to simulate the changes in welfare participation and benefits that occurred from 1979 to 1985 and from 1983 to 1985, comparing the results to those from the March 1986 CPS. The first time period spanned the enactment of the 1981 Omnibus Budget Reconciliation Act, which made a number of cutbacks in welfare programs. The second period represented a time when no major legislation was enacted.

Kormendi and Meguire considered the 50 states and the District of Columbia as mini-TRIM2 models, which allowed them to make use of regression analysis in their validation of TRIM2. One example of this was to regress, for several outputs, the forecasts derived from the 1979 and 1983 baseline files on the "true" values from the March 1986 CPS and compare the estimated regression coefficients with 1, which they defined as one type of unbiasedness. They also used reverse regression to examine whether the forecasts might be improved through correction of the bias.

Another application of regression used by Kormendi and Meguire was in the context of a sensitivity analysis. They attempted to attribute the changes from 1979 to 1985 to either administrative or economic-demographic changes. To do this, they ran TRIM2 based on the March 1986 CPS on the 1979 law, and they ran TRIM2 based on the March 1980 CPS on the 1985 law, in addition to the usual runs with the March 1980 CPS on 1979 law and the March 1986 CPS on 1985 law. With this approach, they were able to measure the percentage change in forecasts conditional on the law remaining constant, adding this measure as a covariate in the regression described above. Hence, they were able to control for variation due to changes in economic and demographic conditions.

Their results showed that the forecast errors of TRIM2 for the period 1979-1985 were much smaller for benefits than for units: the absolute value of the errors for units generally ranged from 5 to 30 percent; those for benefits ranged from 1 to 6 percent. These results are somewhat surprising in view of the fact that TRIM2 baseline files are routinely calibrated to administrative values for participating AFDC units by state and not to the values for total benefits.

Page 248 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

A VALIDATION STUDY OF TRIM2: THE PANEL'S EXPERIMENT

The panel, in conjunction with the Urban Institute, decided to perform an illustrative validation of TRIM2—our "experiment." There were four primary goals of the experiment: (1) to determine the resources required to conduct a sensitivity analysis and an external validation and, more generally, to examine the feasibility of microsimulation model validation; (2) to identify those modules that, when altered, have an appreciable effect on model outputs; (3) to obtain an admittedly limited measure of model validity against comparison values from administrative records (limited because only one time period was examined); and (4) to illustrate some of the data analytic techniques that analysts can use to help answer these questions. In addition, we were interested in attempting to identify component alternatives that were superior to those currently in TRIM2, although this was not a major objective of the study.⁶

Description

The first task in planning the experiment was to identify several modules currently in use in TRIM2 that could be replaced by alternative modules that were, a priori, reasonable substitutes. Some of the alternative modules either had been used previously or had been considered as substitutes in place of the current algorithms. TRIM2 would then be run with various combinations of alternatives for these modules to determine which choices resulted in the greatest variability in the model's outputs. In addition, using administrative quality control estimates as surrogates for the truth, we could determine which alternatives were more successful in approximating the "true" values.

We decided to use TRIM2 with 1983 data (based on the March 1984 CPS) to estimate the costs of the 1987 law for the AFDC program in 1987, as well as a number of other distributional characteristics of the 1987 program in 1987. We chose the years 1983 and 1987 because the March 1988 CPS (which we needed to generate known population control totals) was the latest file available when the panel began its experiment; we wanted at least a 3-year forecasting window; and definitional and other comparability problems began cropping up for CPS data as the forecasting horizon grew appreciably longer. Although we avoided some comparability problems, the period we examined and the March 1984 and 1988 CPS files exhibited some unique features that limit the generalizability of the results. Of course, every time period and every database are unique in some respects, and therefore no single experiment can be used to make general inferences about the efficacy of a model.

⁶	Cohen et al., in Volume II, provide a full discussion of the experiment. The analysis data set is available from the Committee on National Statistics upon request.

Page 249 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

We based our choice of modules for the experiment on considerations of ready availability of alternatives and substantive interest. Because of expected difficulties in reprogramming, we excluded some modules of strong interest on grounds of excessive cost. Our criteria resulted in the choice of three modules: adjustment for population undercoverage, imputation of monthly employment and earnings, and aging. The following descriptions of our modules also show our "name" for each module alternative.

Adjustment for population undercoverage ("adjustment"). TRIM2 currently does not attempt any correction for undercoverage of certain population groups in the March CPS. By modifying models in Fein (1989), we derived a logistic regression model specifying rates of undercoverage for households with various characteristics. The effectiveness of this logistic regression model was limited by the variables that existed in the TRIM2 database and by the necessity of using the same weight for all individuals in a household to adjust for undercoverage. Nevertheless, the model provided an opportunity to see if weighting for undercoverage would make a difference in the model's outputs.

Imputation of monthly employment and earnings ("months"). Because the March CPS provides information only on annual employment and earnings, while the AFDC program operates on a monthly accounting basis, TRIM2 includes a module "MONTHS" that endeavors to capture monthly variation in employment, unemployment, and earnings ("current"). A simpler procedure employed in an earlier version of TRIM2, which we refer to as "old MONTHS,'' simulates a maximum of two spells during the year, one working and one not working ("old").

Aging. Although static aging modules are available for TRIM2, the model as currently applied does not invoke them. We specified three static aging alternatives and not aging ("none") (see details in Giannarelli, 1989). The first aging alternative was to invoke the demographic aging routine to reweight the records in the March 1984 CPS to match target values for population totals by age, sex, and race generated from the March 1988 CPS ("demo"). The second aging alternative, unemployment aging coupled with demographic aging, additionally invoked the routine to adjust labor force activity on the demographically aged file to meet targets from the March 1988 CPS for unemployment during the week of the survey and the preceding calendar year ("unemp"). The third aging alternative, full aging, additionally invoked the routine to adjust income amounts for price changes and economic growth between 1983 and 1987 ("full"). In every case, known control totals were used from the March 1988 CPS to eliminate the source of variability due to erroneous demographic and macroeconomic forecasts.

We therefore specified 16 different "alternative" versions of TRIM2 (2 x 2 x 4) for three modules. It is important to note that we selected the alternatives for each of the three modules either because, a priori, they were not clearly

Page 250 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

worse than the existing module in TRIM2—notably the aging alternatives and undercount adjustment—or because they were substantially simpler and, we had reason to believe, not substantially worse than the existing TRIM2 module—notably old MONTHS.

In addition to the 16 variations simulating 1987 law with the 1983 database, we had available: (1) the 1983 baseline files, both adjusted ("A.Base") and unadjusted ("Base") for undercoverage, which simulated 1983 law with the March 1984 CPS; (2) the 1987 baseline files, both adjusted and unadjusted for undercoverage, which simulated 1987 law with the March 1988 CPS; and (3) tabulations from the Integrated Quality Control System ("IQCS") of characteristics of the AFDC recipient population in both 1983 and 1987. Throughout the remainder of this section, we treat the 1983 and 1987 IQCS data as the "truth," that is, the comparison values. We have remarked elsewhere about the dangers in this assumption and have noted methods that might be used to deal with the problem of sampling error in the comparison values. In addition, the quality control data are believed to be subject to nonsampling errors in measurement of characteristics such as the composition of the AFDC unit's household. Table 9-2 shows the features of all 22 runs in the experiment.

Results

The first question that occurs is what differences in the outputs resulted from the alterations in the three modules. In other words, do the alternatives make a difference?

For estimates of change in the total number of AFDC participants from 1983 to 1987, the estimates from the 16 versions of TRIM2 range (in units of 1,000) from -213 to +293; the comparison value from the IQCS is 98 (2.7% of total AFDC participants in 1983). For estimates of change in total benefits (not adjusted for inflation), the estimates from the 16 versions of TRIM2 range (in millions of dollars) from $1,578 to $3,742; the comparison value is $2,499 (18.2% of total AFDC benefits in 1983). That is, in using alternatives that were a priori thought to have similar success in modeling AFDC, the alternatives disagree about whether the number of participants is going up or down, and by at least twice the size of the observed change of 100,000 people. The alternatives also provide estimates of increased costs in total benefits that range from $1.6 billion to $3.7 billion. Moreover, these estimates of change are with a forecast horizon of only 4 years.

Table 9-3 provides ranges for estimates of both level and change for other statistics of interest, many of which are percentages of dichotomized variables. (The latter are analyzed in their original, undichotomized form below.) Note the instances when a model version incorrectly estimated the direction of change. Also note that, for estimates of change for three variables—race of head, earnings, age of youngest child—the range of values does not include

Page 251 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

TABLE 9-2 Description of TRIM2 Experimental Runs

Run Identification	AFDC Law	CPS Year	Adjustment	Months	Aging
Base_83,83	1983	1983	No	Current	—
Base_87,87	1987	1987	No	Current	—
1	1987	1983	No	Current	None
2	1987	1983	No	Current	Demo
3	1987	1983	No	Current	Unemp
4	1987	1983	No	Current	Full
5	1987	1983	No	Old	None
6	1987	1983	No	Old	Demo
7	1987	1983	No	Old	Unemp
8	1987	1983	No	Old	Full
A.Base_83,83	1983	1983	Yes	Current	—
A.Base_87,87	1987	1987	Yes	Current	—
9	1987	1983	Yes	Current	None
10	1987	1983	Yes	Current	Demo
11	1987	1983	Yes	Current	Unemp
12	1987	1983	Yes	Current	Full
13	1987	1983	Yes	Old	None
14	1987	1983	Yes	Old	Demo
15	1987	1983	Yes	Old	Unemp
16	1987	1983	Yes	Old	Full
IQCS83		IQCS data for 1983
IQCS87		IQCS data for 1987
NOTE: See text for definitions and descriptions of terms.

the comparison value. In general, the results in Table 9-3 show that there are important differences in most of the estimates when alternate modules are used.

These observations are not meant to imply any weakness of TRIM2 relative to other models, microsimulation or otherwise, that have the goal of providing estimates of characteristics. No such conclusions can be drawn because this analysis does not include other modeling approaches. What our experiment does demonstrate is the large amount of variability that results from reasonable changes to the basic model.⁷

The observed variability has a great deal of structure because underlying the 16 observations for each response lies a 2 x 2 x 4 factorial design with one replication per cell. To help understand the variability, we used analysis of variance to measure the relative sizes of the main effects, to test for significance of the main effects, and to test for significance of the interactions. (See the report of this analysis in Cohen et al., in Volume II.)

⁷	Note that these ranges are not intended to be interpreted as any type of confidence interval.

Page 252 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

TABLE 9-3 Observed Differences for Various Estimates from TRIM2 Validation Experiment

Variable	IQCS87 Level	Range of Level, Runs 1-16	IQCS87 Minus IQCS83	Range of Change, Runs 1-16 Minus 83 Base
Type of AFDC unit (% basic)	90.40	87.99, 91.90	1.20	-0.45, 3.52
No. of adults (% 2)	9.58	6.32, 10.43	-0.65	-3.41, 0.42
No. of children (% > 2)	25.98	26.36, 28.21	0.58	0.29, 1.79
Total no. in unit (% > 3)	28.20	28.72, 31.10	0.13	-0.27, 1.46
Age of youngest child (% < 5)	53.58	51.96, 55.25	-0.09	-2.76, -0.64
Race of head (% black non-Hispanic)	41.30	39.50, 42.75	-2.84	-1.85, 0.01
Earnings of adults (% none)	92.31	91.22, 94.88	-2.26	-0.91, 1.67
Marital status of head (% no spouse)	86.82	86.01, 90.42	1.61	-0.45, 3.60
Sex of head (% female)	86.77	82.96, 86.89	1.30	-0.50, 3.53
Age of head (% < 20)	5.73	6.51, 7.30	-1.18	-0.62, -1.51
NOTE: Differences among model runs and between a model run and the IQCS may not be statistically significant.

The principal finding from our analysis of estimates of change was that changing from unadjusted to adjusted data and changing the type of aging used make a substantial difference in most of the output estimates. Changing from MONTHS to old MONTHS makes a much smaller difference. In addition, some interaction effects were observed, which complicates the identification of better and worse modules. We note that no interactions were observed for the parallel analysis of estimates of levels.

Because we also had comparison values, we were able to compare outputs of the different model versions with the comparison values. There are many ways to examine whether some models are "better" than others with respect to closeness to the comparison values (see Cohen et al., in Volume II). First, in conjunction with the analysis of variance, looking at the comparison values indicates which main effects are bringing the overall mean for the 16 model versions closer to or further away from the comparison value; this analysis provides an indication of which alternatives are more promising. Second, the errors themselves are easy to display in tabular form, which provides a visual impression of a superior version, if one exists. A difficulty with this type of presentation is the vast difference in size of the variables for which estimates

Page 253 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

are given. A third possibility is to replace the errors by their ranks over the 16 model versions; one could use this procedure as a descriptive tool or use available nonparametric statistics to base inferences on these ranks. The general impression we gained from these analyses is that the results show no clear pattern, no clearly superior version of TRIM2. For various outputs, different versions show alternately strong and weak results, but there are no general patterns.

We also examined, for some of the frequency table outputs, how close the various versions of TRIM2 corresponded to the same frequencies from the 1987 IQCS. Note that here we are not examining estimates of change because there is no simple way to examine changes in a distribution. (We dichotomized some of these frequencies in the previous analysis, which is one way of focusing on a characteristic of a frequency table that makes analysis of change feasible.)

Table 9-4 presents the ?² test of independence for a 2 x r contingency table, in which one row contains the frequencies from one model version and the second row the frequencies from the IQCS data for 1987. Notationally,

in which n_ij is the number of persons in category j estimated by run i; n_i. is the sum over categories—the total number of people "produced" by run i; r_j. is the sum over runs—the total number of people in a certain category; and N is the total number of people "produced" by the two runs (in which one run, in this case, is the IQCS data). The statistic Q has a chi-square distribution with r - 1 degrees of freedom. (This test can also be used for more than two sets of estimates, if desired.) This statistic, along with the associated degrees of freedom, is presented in Table 9-4 for all 16 model versions, for a variety of model outputs.

One difficulty in the interpretation of the _x² statistics shown is that the statistics for different rows are not comparable because they are associated with different degrees of freedom. However, one can compare model versions within each row and form an assessment of whether some model versions consistently outperform others. (Table 9-4 provides the _x² values for degrees of freedom from 1 through 8 at the 99 percent confidence limit. A higher value in the table for a model version indicates that it differs from the 1987 IQCS by an amount greater than one could expect by chance.) For some variables, such as gross income of unit, all model versions obviously differ substantially from the IQCS.

As in the analysis of change, we found that no model version appears to have any noticeable advantage over other versions. For example, model 1 (the current version of TRIM2) has a low _x² , relatively speaking, for the dichotomized variables unit size, number of adults, number of children, age of head of household, marital status of head of household, and size of benefit.

Page 254 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

TABLE 9-4 X² Goodness-of-Fit Statistics for Distributions, from TRIM2 Validation Experiment

	Run Identification
Variable	1	2	3	4	5	6	7	8
Total no. in unit	26	37	28	28	27	36	28	27
No. of adults	10	27	39	45	11	27	41	45
No. of children	2	5	6	6	2	4	5	5
Age of youngest child	11	14	12	12	11	16	13	12
Gross income of unit	568	508	461	396	506	456	423	362
Earnings of adults	15	19	18	8	14	26	23	17
Type of AFDC unit	15	3	13	15	17	3	12	14
Race of head	4	9	2	2	6	10	2	2
Sex of head	28	19	1	0	31	21	1	0
Age of head	30	47	45	48	29	49	45	46
Relationship of unit head to household head	1.2	0.9	0.9	0.9	1.2	0.9	0.9	0.9
Marital status of head	I	1	15	20	1	2	16	19
Size of benefit	85	114	101	106	87	116	108	109
* D.F. indicates degrees of freedom. The x² values at the 99 percent confidence limit are as follows (a higher value in the table indicates that a model version differs from the 1987 IQCS by an amount greater than one could expect by chance):

D.F. = 1, χ² = 6.635	D.F. =5, χ² = 15.086
D.F. = 2, χ²= 9.210	D.F. = 6, χ² = 16.812
D.F. = 3, χ² = 11.341	D.F. = 7, χ² = 18.475
D.F. = 4, χ² = 13.277	D.F. = 8, χ² = 20.090

Therefore, for these variables, model 1 was one of the more successful models for approximating the IQCS data. At the same time, model 1 has a relatively high ?² for the variables gross income of unit, type of unit, and sex of head of household. This lack of general superiority or inferiority is true for all 16 model versions.

The last column of Table 9-4 displays the results from using the 1983 IQCS, which assumes that the characteristics of the caseload in 1987 remain unchanged from those in 1983. Under some circumstances, the comparison values for the beginning of a period provide an interesting challenge to the model versions. If a study examines a situation in which a substantial policy change has occurred, the performance of the old IQCS data provides a standard

Page 255 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Variable	9	10	11	12	13	14	15	16	IQCS83	D.F.*
Total no. in unit	48	19	40	40	49	49	42	41	73	4
No. of adults	33	8	45	48	35	39	48	51	438	2
No. of children	4	6	6	6	6	5	6	6	41	3
Age of youngest child	12	10	5	5	10	7	6	6	77	4
Gross income of unit	436	508	362	310	419	412	338	287	259	8
Earnings of adults	16	38	8	13	20	24	20	29	426	8
Type of AFDC unit	9	0	5	6	12	7	5	6	75	2
Race of head	7	16	6	8	5	5	5	6	493	3
Sex of head	14	3	0	0	20	19	0	0	48	1
Age of head	115	101	40	40	32	38	39	40	272	7
Relationship of unit head to household head	1.3	1.1	0.9	0.9	1.2	0.9	1.0	1.0	0	4
Marital status of head	1	4	20	23	0	1	21	24	57	1
Size of benefit	50	85	120	119	99	127	121	119	2015	7

that a reasonable model should exceed, namely, the model should do better than an estimate based on the assumption of no change over the period. Our experiment (as discussed further, below) examined a situation in which the policy change was modest, but economic changes in the period were relatively large. Under these circumstances, the comparison is less important because the noise is in some sense too large a fraction of the signal. Nevertheless, the 1983 IQCS data set does not compete well with the 16 versions of TRIM2 in the analysis (shown in Table 9-4), but it does outperform many TRIM2 versions in estimating total participants and other aggregates (see Cohen et al., in Volume II). In many situations, this type of comparison is extremely informative in providing a naive estimate of how well one can do with a very simple model. Also, this comparison provides an estimate of how much variability is natural to the problem, which can be compared with the variability left unexplained by the model versions.

Limitations of the Experiment

Our experiment was designed both to illustrate the types of methods that can

Page 256 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

be used to validate a model's outputs and to provide some indication of the current performance of TRIM2. With respect to the second of these goals, the experiment is limited in a variety of ways.⁸ First, the experiment studied TRIM2 during only one time period. With respect to estimating error properties, even in a replicable situation, one replication is very limiting: a model can perform better than it would on average, or worse. Moreover, forecasting situations should not be considered replicates without further investigation. Different time periods will typically present different challenges to a model. In particular, any characteristics of either the 1984 or 1988 March CPS data, or of the period under study, that are peculiar to those data sets or to that time period reduce the opportunity for generalizing from our results.

With regard to peculiarities of the data, an analysis conducted as part of the experiment showed that simulations of 1987 law using the 1983 baseline file sometimes outperformed simulations of 1987 law using the 1987 baseline file. This finding triggered a more extensive analysis of the quality of the March 1988 and 1984 CPS data (see Giannarelli, 1990). That analysis documented that the March CPS files typically include some states that have insufficient simulated units eligible for AFDC compared with administrative counts of participants—a phenomenon that complicates the calibration effort. It turned out that the March 1988 CPS had an unusually large number of such states (8), which made the calibration in that year less successful than usual. The number of simulated eligible units dropped by large percentages in some states from the previous year (e.g., by 29% in Connecticut, 22% in Michigan, and 32% in New Mexico). Although sampling variability in the CPS appears to explain some of these changes, Michigan's drop remains something of a mystery.

It is important to point out that, although microsimulation modelers are aware of quality problems with the March CPS data, they do not regularly investigate changes in quality from year to year. Hence, we have an example of how validation can lead to more pointed validation and to identification of problems for further investigation and possible correction.

The period from 1983 to 1987 was also special because of the large drop in the unemployment rate, from 8 percent in 1983 to 5 percent in 1987, and the differential impact of the change in unemployment on different subpopulations. We note that the changes in welfare program regulations between 1983 and

⁸

Indeed, the limitations of the experiment, including that only one time period was examined, constrained our analysis to emphasize primarily descriptive rather than inferential techniques and interpretations of the data. However, inferential analysis, including hypothesis testing, with the goal of identifying differences and patterns that are statistically significant versus those that are not (which is not often possible with the descriptive approaches that we used) has a great deal to offer when the number of replications increases. We certainly encourage the use of inferential techniques, such as nonparametric analysis of variance (see, e.g., Lehmann, 1975) and anticipate that expertise with respect to which models and techniques are most applicable will follow as experience is gained with these types of validation studies.

Page 257 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

1987 were relatively minor, which limits our findings to periods when there are few changes in law.

The limitation imposed by having only one time replication is somewhat offset by having outputs for several characteristics that focus on fairly distinct portions of the model and by having outputs for states that can serve as mini models (for which we have not done an extensive analysis). Clearly though, there are circumstances that would cause a model to perform poorly for all states for a single time period but would not be indicative of the model's overall efficacy, due to the likely correlation among states for some response variables.

The limitations resulting from examining TRIM2 at only one time help demonstrate that validation should not be an occasional examination of a model. Rather, model validation should be a continuous process that accumulates knowledge about potential model weaknesses, the size of errors, and situations in which a model is less reliable.

Another limitation of our experiment was that, in addition to simulating just one time period, we simulated just one change in law. Had time and resources permitted, we would have liked to simulate an alternative policy— for example, a mandated AFDC minimum benefit nationwide—that would have represented a major policy change. Of course, we could not have conducted an external validation of such an alternative policy because it was never enacted. However, we could have conducted sensitivity analyses of both policies and obtained information relevant to the question of whether, in fact, microsimulation model estimates of differences between two policies are less variable than estimates for a particular policy, because of sources of error affecting both policies to about the same extent.

Yet another limitation of our experiment (noted above) is that the surrogate for the truth that we used throughout may have weaknesses. First, the IQCS data are from a sample survey and therefore subject to sampling error. They are also subject to bias from a number of sources. For example, different states have different collection procedures for their quality control data, which may lead to different kinds of biases across states. In our experiment, we ignored the problems raised by the use of an imperfect surrogate for the truth. Whenever feasible, analysts should search for the sources of major discrepancies between model estimates and the comparison values in both data systems. At the same time, analysts should ignore discrepancies between model estimates and the comparison values that are smaller than what would be explained by ordinary sampling variability.

Another limitation of the experiment was that we examined only one model, due to the time and resource constraints under which we, the sponsoring agencies, and the agencies' contractors operated. Ideally, it would have been desirable to expand the experiment to include other major models in use today, such as MATH, HITSM, DYNASIM2, PRISM, and MRPIS. However, we should note that all of these models are not directly comparable because they

Page 258 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

are often used for different purposes and, by design, cannot necessarily all produce estimates of the same variables. Therefore, adding the model class as another factor in an analysis of variance, for example, may not generally be feasible. However, in cases in which models are directly comparable, one could certainly expand our basic experiment in that direction.

If one could overcome the problems of comparability, there are major advantages to be gained through comparing the effectiveness of different broad modeling strategies. For example, the question of the relative advantages of static versus dynamic aging could be addressed in this way. Comparing several models with the truth (or a reasonable surrogate), along with a comprehensive analysis of model differences exemplified by Haveman and Lacker (1984), will often yield great insight into the strengths and weaknesses of the various models.

Major Conclusions

The panel's experiment was successful in demonstrating that microsimulation model validation is feasible. With current methods, analysts can measure the degree of variability attributable to the use of alternative components. Such information helps indicate overall model uncertainty, as well as which components to examine further to make improvements to a model. Thus, sensitivity analysis methods, especially when augmented with comparison values in an external validation, provide a great deal of data with which to direct efforts at model development as well as to measure model uncertainty.

Our experiment demonstrated that there is considerable uncertainty due to changes in the three modules we studied in TRIM2. Therefore, the choice of which model version to use makes a difference. Yet it is not clear that any of the 16 versions has any advantage over the others. Certainly, for individual responses, particular versions fared better. However, given that the experiment is only one replication, it would be foolish to assume that our results provide confirmation of any real modeling advantage.

Because the experiment did not attempt to measure the variance of any of the versions of TRIM2, we have no idea of the relative sizes of various sources of uncertainty in relation to variance. Therefore, it is difficult to assign a priority to development of variance estimates vis-á-vis use of sensitivity analysis. We do believe that it is important to investigate both.

We stress that our experiment was purely illustrative. The benefits from a continued process of validation are rarely evidenced through study of a single situation. There is an important question about the degree to which different studies of this sort of the same model in different modeling situations would represent replications in any sense. However, even if the studies are not replications, use of these methods will provide evidence of general trends in model performance. Their use will generate a great deal of information as

Page 259 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

to the situations under which a model performs well and provides accurate information.

While we have made a convincing case for the feasibility of sensitivity analyses and external validation, the experiment was not cheap. The Urban Institute estimated staff costs (including overhead) to conduct the experiment of about $60,000 for 1,400 person-hours of effort or, roughly, 35 person-weeks. These estimates are probably low because it was difficult for the Urban Institute to separate activities that were needed for the experiment from their own day-to-day efforts; in addition, they do not take into account the time taken to specify the experiment and analyze the data. Moreover, we do not have estimates of computer costs. Overall, it is clear that the way in which TRIM2 (and most other microsimulation models) is currently configured can make a sensitivity analysis very costly.

The costs were dramatically affected by our interest in trying out different forms of aging. The overall cost would have been substantially reduced, possibly by a factor of 2 or 3, had an easier module been selected for the experiment. But there were other modules that we did not investigate because the costs of working with them would have been higher still. It is obvious that, for model validation to become a routine part of the model development and policy analysis process, the structure of the next generation of models must facilitate the type of module substitution that is used in sensitivity analysis.

In summary, our experiment gave mixed signals on the effectiveness of TRIM2. That TRIM2 is sensitive to the inclusion or exclusion of various factors is apparent. Our results suggest that, in some instances, nothing was gained by implementing TRIM2 rather than available IQCS data. However, in other instances, TRIM2 indeed provided very valid estimates. Our main goal was to show how one might undertake sensitivity and validity studies for microsimulation models. It is quite reasonable to speculate that similar studies on other microsimulation models will produce comparably mixed results, namely, that the model under study will prove to be useful for some variables but not as good as had been believed for others. This knowledge can only be valuable to the analysts using models to inform policy makers as well as to those involved in making improvements to the models. We have made a small start toward this end.

STRATEGIES FOR VALIDATING MICROSIMULATION MODELS: RECOMMENDATIONS

Our validation study of TRIM2 illustrates both the benefits and the costs of serious attempts to investigate the quality of estimates produced by microsimulation models. The benefits, even in our very limited study, seem clear to us. We determined that TRIM2 estimates are sensitive to alternative choices for model components. We also observed weaknesses in the March CPS database for

Page 260 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

modeling income support programs. These weaknesses in the March CPS were more or less well known, but our findings underscore the need to investigate them further and to take corrective action of some kind.

The costs of validation are also evident from our experiment, in terms of both time and resources. Indeed, the kinds of external validation studies, sensitivity analyses, and variance estimation procedures that we outline for microsimulation models may well appear to involve a dismayingly high expenditure of staff and budget resources, particularly in light of the limited resources that have been allocated to these activities in the past. Clearly, in the context of an ongoing policy debate, there is no possibility of applying such evaluations to even a fraction of the estimates and proposals that are modeled.

However, we believe that the cost-benefit ratio for microsimulation model validation can be improved substantially through several mechanisms. First, implementation of the next generation of models with new computer technology, as recommended in Chapter 7, should dramatically reduce the costs and increase the scope of feasible validation studies, particularly if the modeling software is designed—as it should be—with validation in mind. Second, improvements in model documentation and archiving that we recommend (see Chapter 10) should make it easier to carry out validation studies, particularly external validations. Third, academic researchers should find model validation questions of considerable interest, particularly when they are able to access the models directly through new technology, and their work on validation methodology and applications should prove fruitful. Finally, we believe that, as policy analysis agencies and their contractors gain more experience with validation, the task will become easier and more rewarding, particularly when validation results prove helpful in making decisions about priorities for investment in models.

Although the greatest improvements in microsimulation modelers' ability to carry out validation studies will come with the implementation of new technology, we believe that more validation can and should be accomplished in the short term with the current models. We outline a set of institutional arrangements that we believe will facilitate cost-effective model validation in the near term. We also recommend agency support of research on model validation methods and agency adoption of the ''quality profile" concept as a way of communicating information about model strengths and weaknesses to a broad user community and a way of organizing a continuing program of model validation targeted to priority areas for improvement. We urge policy analysis agencies to allocate the necessary resources and make the needed commitment so that validation becomes a regular part of the microsimulation modeling enterprise.

Institutional Arrangements for Model Validation

In formulating our recommendations for model validation, we took cognizance

Page 261 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

of the severe time pressures for producing estimates that characterize the policy analysis process. We also took cognizance of the relatively limited capabilities of current models for cost-effective validation. Hence, we do not suggest that each and every set of policy estimates be evaluated, either in real time or after the fact—such a recommendation would deservedly be ignored. We present instead a plan for a reasonable approach to the validation task.

In our view, major contracts for the development, maintenance, and application of microsimulation models should target a percentage of funds sufficient to carry out validation studies that can provide useful information to analysts and policy makers who are engaged in shaping legislation on a real-time basis. For those agencies that maintain and apply their own models in-house, rather than contracting for these services, the agency should allocate its own modeling budget in this way.⁹ These contracts should also include an allocation of funds earmarked to implement model revisions on the basis of the results of validation studies conducted by the modeling contractor and others.

With regard to specific types of validation, the contractor would be expected to provide estimates of variability (such as a bootstrap estimate) and the results of sensitivity analyses for key sets of model estimates, where an example of a "key set" might be the first set of estimates prepared for the initially proposed version of the Family Support Act The contractor would not hold up delivery of these estimates until the validation was finished, but would endeavor to complete the validation as soon as possible. The results of the validation for one set of estimates would help interpret the quality of the estimates for alternative proposals (unless major new provisions were added). The sensitivity analysis would focus on those model components the analysts believe are most likely to have an impact on the particular set of estimates. In other words, the validation performed by the contractor would be "rough and ready," focused on helping to inform the policy debate. (See Chapter 3 for a discussion of the issues involved in communicating validation results to decision makers.)

In addition to the validation efforts performed by the modeling contractor, we believe it is essential for policy analysis agencies to commission independent validation studies that include external validation as well as sensitivity analysis. In principle, independent evaluation is preferable to evaluation performed by the developer and user of a model (just as academic journals appoint independent reviewers for articles submitted for publication). In practice, independent evaluation is preferable as well, given the pressures confronting a modeling contractor to respond to insistent and frequently changing policy demands for large volumes of estimates prepared within short time frames. Agency staff

⁹

About 10-15 percent of funds might suffice for validation activities on an ongoing basis. However, given the relative lack of investment in microsimulation model validation to date, it may be that the percentage of funds earmarked in major contracts for validation purposes should be higher until sufficient experience is gained with validation techniques.

Page 262 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

could carry out independent validation studies, but they, too, are usually under severe pressures from the demands of current policy issues.

Hence, our recommendation is that, for every major microsimulation model development, maintenance, and application contract that policy analysis agencies let, they also let another contract to a separate organization to carry out longer term, more comprehensive validation studies of the particular model(s). The validation contractor would be expected to carry out external validation studies of selected model estimates and also to conduct extensive sensitivity analyses in order to identify areas for needed model improvement or revision.

To implement a program of independent evaluations will clearly entail working out a number of practical matters. There will need to be ways to guard against conflicts of interest, such as a validation contractor deliberately downgrading a model in order to boost the chances for that contractor's own model winning the next bidding round. Most important, there will need to be cost-effective ways of providing validation contractors with access to the models they are evaluating and to modeling experts, without impairing the ability of the modeling contractors to respond to agency needs for real-time policy analysis. A possible approach is for a knowledgeable programmer to bring a second copy of the model to work on-site with the validation contractor; alternatively, staff of the validation contractor could work on-site with the modeling contractor. We are confident that workable arrangements can be devised. Looking ahead, we note that implementation of a new generation of models with computer technology that facilitates their use should make it much easier to deal with these problems and to expand the scope of the validation studies that are feasible to perform. Enhanced documentation will also facilitate independent validation of the type that we describe.

Recommendation 9-1. We recommend that policy analysis agencies commit sufficient resources and accord high priority to studies validating the outputs of microsimulation models. Specifically, we recommend:

Agencies, in letting major contracts for development, maintenance, and application of microsimulation models, should allocate a percentage of resources for model validation and revisions based on validation results. The types of validation studies to be carried out by the modeling contractor should include estimates of variance and focused sensitivity analyses of key sets of model outputs. The goal of these efforts should be to provide timely, rough-and-ready assessments of selected estimates that are important for informing current policy debates.
In addition, agencies, when practical, should let separate microsimulation model validation contracts to independent organizations or in other ways arrange to carry out comprehensive,

Page 263 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

in-depth evaluations. The types of studies to be performed by a validation contractor should include external validation studies that compare model outputs with other values and detailed sensitivity analyses. The goal of these longer range efforts should be to identify priority areas for model improvement.

Research on Model Validation Methods

In addition to letting the kinds of validation contracts that we describe above, it would be useful for policy analysis agencies to support research specifically designed to develop improved methods for microsimulation model validation. For example, a useful topic for investigation could be ways to increase the cost-effectiveness of techniques such as the bootstrap for estimating the variance in model outputs. For this kind of work, the agencies could let separate methodological research grants to academic researchers. The agencies could also take steps to interest the National Science Foundation and perhaps the National Institute of Standards and Technology in supporting this type of research, which could well have application for validation of complex models in other fields.

Such work should be attractive to researchers, although the difficulties in providing them with access to current microsimulation models is an impediment. In the short term, perhaps an effective strategy would be to support fellowships for researchers to carry out methodological work on-site with the agencies' modeling contractors. The fellowships could be similar to those currently offered by several federal statistical agencies, including the Bureau of Labor Statistics, the Census Bureau, and the National Center for Education Statistics. (These programs are supported by a combination of National Science Foundation and agency funds, and are administered through the American Statistical Association.) Over the longer term, on the assumption that the next generation of models is successfully implemented with new computer technology, the agencies should find it quite easy to attract academic interest in the kinds of methodological work needed to improve model validation methods. Indeed, academic researchers would be able as well to conduct model validations.

Recommendation 9-2. We recommend that policy analysis agencies provide support, through such mechanisms as grants and fellowships, for research on improved methods for validating microsimulation model output.

Quality Profiles

Finally, as a way of organizing an ongoing, comprehensive program of validating microsimulation models and communicating the results of validation studies to users, we urge policy analysis agencies to adopt a concept that is gaining

Page 264 Cite

Suggested Citation:"9 Validation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

ground in statistical agencies, namely, that of developing "quality profiles." A quality profile is a document that brings together all of the available information about sources of error that may affect estimates from a survey or other data collection effort (see Bailar, 1983, and the discussion in Chapter 5). The profile identifies measures and procedures for monitoring errors; assembles what is currently known about each source of error and its impact on the estimates; provides comparisons with estimates from other data sources; and outlines needed research and experimentation designed to gain better understanding of sources of error and to lead to the development of techniques to reduce their magnitude.

Clearly, developing a profile for a microsimulation model is a much bigger task than developing one for a single survey; however, the effort to conceptualize the sources of error and to bring together what is known about them in a document can be very helpful. Analysts who make use of the model output can benefit from the knowledge and caveats provided in a quality profile; model developers can use a profile to guide methodological work on understanding and reducing sources of error and to build a cumulative body of knowledge about their models' strengths and weaknesses.

Recommendation 9-3. We recommend that policy analysis agencies support the development of quality profiles for the major microsimulation models that they use. The profiles should list and describe sources of uncertainty and identify priorities for validation work.