Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

VARIANCE ESTIMATION OF MICROSIMULATION MODELS THROUGH SAMPLE REUSE 247 alternative projections. Even so, the Census Bureau still provides a range, without specified coverage probabilities, for its population projections that derives from a sensitivity analysis. The identification of factors that properly contribute to variance, or those that properly belong to the definition of the model and therefore contribute to bias and should be investigated through use of a sensitivity analysis, is not always straightforward. Certainly, any decisions that are made after examining the data are properly considered to add to the variance of the results. So, for example, the decision of how to use control totals after viewing the preliminary results contributes to variance. But the question of what happens before or after seeing the data applies to a small fraction of the decisions made. To give some flavor of the difficulties that can arise, suppose one decides to deflate some income figures in a survey used in a microsimulation model. The choice of which series of deflators to use could be considered to be a feature of the model and not one that contributes to variance. However, the various deflation factors have a variance associated with them (since they are often derived from a sample survey), and there is more than one series of deflators that could be used. So one could conceivably estimate the variance due to the variability of each deflator and one could also estimate the variability due to the choice of deflator. There is no clear resolution of these matters. The definition of model, the level of comfort of a statistician with subjective assignment of probabilities, and the intended use of the conditional variances and conditional confidence intervals are all factors that contribute to a decision for particular situations. The most important point is not to ignore sources of variability. BOOTSTRAPPING MICROSIMULATION MODELS This section discusses the application of the bootstrap and other sample reuse methods to the estimation of the variance of microsimulation models. As noted above, the variance of the results from microsimulation models stems from several sources. To start, assume that one is primarily concerned with the variability that results from the sampling involved in generating the primary input data set. This approach ignores the variability from the use of various control totals and regression equations, from the use of imputation and statistical matching, from the use of demographic and macroeconomic projections, and from the use of aging modules. These sources of variability are likely to dominate the variability due to sampling in the primary input data base (although the relative magnitude of the various sources of uncertainty is currently unknown). To assess the variance due to the sample variance in the primary input data set, one can create something like bootstrap pseudosamples or half-sample replicates and then run the microsimulation model for the current program and for the proposed program to arrive at differences for each of the individuals

VARIANCE ESTIMATION OF MICROSIMULATION MODELS THROUGH SAMPLE REUSE 248 on each of the replicate data sets. The variance of various functions of these differences, including the total difference, can then be computed as indicated above. (Note that this process is not simply estimating the variance of a weighted sum of differences because the input data set can affect the use of the control totals, the participation formula, the aging routine, etc.) This procedure would provide a reasonable estimate of the variance attributed to the sample variance in the primary input data set. At this point it is important to point out two peculiarities of the application of the bootstrap to microsimulation. First, many of the differences are equal to 0 since the modifications that are proposed are often targeted at a particular subpopulation. For the people unaffected by the proposed changes, the payments before and after the proposed change are likely to be the same, and the difference will therefore be 0. As a result, the distribution of output values may tend to be fairly nonnormal. This spike at 0 may enable the separation of the bootstrap process into two pieces, the estimation of the frequency of families with no difference and the variability of the change for families with change. However, there are several difficulties with this approach, and more research on it is needed. Second, the emphasis on estimating the differences between programs is a variance-reducing device for the inference, due to the close correlation of the results from similar programs. The conditional inference that was mentioned above can essentially become less conditional because many of the components have an equal effect before and after the proposed change. For example, the method used to distribute annual incomes to months may not matter much when the proposed change has to do with the age at which an applicant is eligible. However, this benefit of the approach does not always apply since microsimulation models are also used for assessing the costs of new programs. The above process for estimating the variance due to sampling variance in the input data set fails to take into consideration some potentially large sources of variability. First, microsimulation models reweight the original sample survey for a variety of purposes, especially to statically age the data, to account for undercoverage of populations, and, more generally, to reconcile margins to accepted control totals. In addition, microsimulation models make use of various types of regression equations that have often been estimated on other data sets, for example, to estimate participation rates and impute variables not in the primary database. It is important to recognize that these control totals and regression coefficients sometimes arise from sample surveys and should not always be treated as fixed constants, but as random variables that have an associated variance. If one first assumed that a microsimulation model uses at most two or three control totals or regression coefficients to modify the original sample survey data set (a rather heroic assumption), one could assume that these estimated parameters are random variables that were generated by a multivariate normal distribution. Presumably, the means could be taken at the

VARIANCE ESTIMATION OF MICROSIMULATION MODELS THROUGH SAMPLE REUSE 249 observed values, and the variances often would be estimable. The difficulty would be in estimating the correlations. However, once these correlations were estimated, for each pseudosample one would simply draw a random vector from the multivariate normal distribution to arrive at the parameter estimates to use with that pseudosample. This approach is sometimes referred to as a parametric bootstrap, since it assumes a particular family of distributions. (It does not seem important whether one chooses multivariate normality or some other distribution, but it is not certain that this is so.) A good example clarifying the need for estimating this component of the variance is the example of the use of a microsimulation model to estimate the cost of tax- deferred individual retirement accounts (IRAs). A parameter used in that particular model, which was not well estimated, provided the proportion of eligible people that would participate in the program. The underestimation of this parameter caused the cost estimates to greatly underestimate the actual cost of the program. A better procedure would have been to indicate the variability of the estimated program costs as a function of the variability in that estimated parameter. Clearly, if one was making use of dozens or even hundreds of parameter estimates, the estimation of the covariance matrix becomes essentially impossible since the number of correlations needing to be estimated is on the order of n2. Since the use of this many control totals and regression coefficients is not that unusual for ma ny microsimulation models, one would have to assume that a large number of the parameter estimates were either constant or independent, which might be reasonable in some situations. But ignoring this contribution to the variability of the estimate could result in overly optimistic variance estimates. Another contribution to the variability of a microsimulation model is the variability from performing an imputation or a statistical match. In terms of a statistical match there are two closely related ways in which this component of the variance could be measured (similar ideas apply to imputations). First, one could statistically match each of the pseudosamples, that is, embed the statistical match inside the bootstrap process. This approach requires that the statistical match be performed several times, and for expensive statistical matches, computationally or otherwise, this would be prohibitively expensive. In particular, this approach would probably not be feasible for constrained statistical matching. When a statistical match is relatively expensive, one could use Rubin's (1986) notion of multiple imputation and create several statistical matches to the original sample within the framework of one run of the statistical match routine. One could then estimate a contribution to the variance from the statistical match and add this variance to the variability due to the remaining sources. However, this approach assumes that the statistical match is independent of the other sources of variability; one could not estimate any interaction of this process with reweighting or other processes. It is hard to say when that assumption would or would not be reasonable.