Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
STATISTICAL MATCHING AND MICROSIMULATION MODELS 81 matching, although it is possible, but difficult, to apply this technique to the case of constrained statistical matching. Rather than selecting the closest match in file B to each record in file A, identify the closest k records. It is unclear what k should be; it would depend on the size of the classes within which matching is permitted, choosing larger k's for larger classes. It is likely that setting k to values close to 5 would work most of the time. Three statistically matched files can then be created: (1) the usual unconstrained statistical match, using the closest match in file B to every record in file A and assuming conditional independence; (2) a negative conditional correlation statistical match, for which one chooses to match a particular one of the k nearest records in file B to a record in file A, where the record is chosen so that âhighâ values of Y are paired with âlowâ values of Z, and vice versa; and (3) a positive conditional correlation statistical match, similar to (2). If there is a particular variable contained in Y and another variable contained in Z that one has primary interest in, âhighâ and âlowâ can simply mean above and below that variable's mean. However, if there are several variables contained in Y and Z that are important and if the conditional independence assumption is a concern, then either one could repeat this process for each pair of interest, or one could use a multivariate notion of âhighâ and âlow.â After forming these three statistically merged data files, one would repeat the analysis on each file. If the results were similar, the assumption of conditional independence probably is not crucial; otherwise, the results are open to question. CONCLUDING NOTE The specific application of statistical matching as input into microsimulation models (possibly the most extensive use of the methodology, but certainly not the only one) makes certain demands on the data set that must be recognized when producing statistically matched files for this purpose. Microsimulation models often operate on data sets that are fairly large. If the model is of national scope and is based on individuals or households, files on the order of 50,000 or more are typical. The use of data sets of this size or larger makes constrained statistical matching computationally intensive, especially considering the costs involved with repeating the matching process when estimating the variance of such a process with a sample reuse technique. In addition, the complexity of the policy issuesâfor example, eligibility for various welfare programs, income taxes, health expendituresârequires that the data sets cover a wide range of variables. If there are a large number of matching variables, say, more than five or six, matching error increases. If there are a large number of Y or Z variables, there are likely to be several uncorrelated pairs, which complicates the choice of a distance function in the match. Furthermore, the extensive use of controlling to accepted totals on the
STATISTICAL MATCHING AND MICROSIMULATION MODELS 82 statistically matched files needs to be considered. Rubin's point about the relative efficacy of constrained versus unconstrained statistical matching depends strongly on whether various control totals are going to be used after the statistical match. Also, Klevmarken's points about the limits of statistical operations that one can safely apply to a statistically matched data set have only been considered in the regression context. His points should also be considered for other models such as logistic regression (found in some participation models of microsimulation models) and iterative proportional fitting. Finally, it is not at all clear what impact processes, such as aging the data, statically or dynamically, or use of various behavioral models, have on a statistically matched data set. There is the possibility that the sensitivity of the results to the conditional independence assumption is heightened through the use of such data-intensive procedures. The use of what one might call âclassicalâ statistical matching in microsimulation models, that is, assuming without evidence the conditional independence assumption, is very likely to misinform. At the very least, some of the sensitivity analysis described above should be performed to assess the likely effect due to failure of the assumption. If the results are not sensitive to the conditional independence assumption, and the bias introduced through the matching process is also tested and considered small, then the results are likely to be useful. In the event that the results are sensitive, to either the conditional independence assumption or the matching bias or both, a âclassicalâ statistical match should not be used. These conclusions are true (almost) regardless of the application of the statistical match. They are even more crucial for statistical matching as input into microsimulation models, since these files are further manipulated by aging routines, monthly allocation routines, behavioral models, various sorts of controlling to independent totals, etc. Rodgers (1984:101) summarized: On the basis of these simulations, which confirm the caution arising from the absence of any mathematical justification for statistical matching, it seems clear that statistical matching may not in general be an acceptable procedure for estimating relationships between Y and Z variables, or for any type of multivariate analysis involving both Y and Z variables. Paass (1985:9.3â15) summarized: At the current state of knowledge SM [statistical matching] is more an art than an exact and reliable technique. Therefore SM methods should be employed only if the CIA [conditional independence assumption] can be verified or replaced by additional information and the demands on the data are not very high. It seems as if microsimulation models place very high demands on data, and those words of caution should be heeded. However, it is important to remember the important function statistical