Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
STATISTICAL MATCHING AND MICROSIMULATION MODELS 66 considering that the real payoff in matching margins arises when samples are not large and census data on the margins exist; and then this constrained approach, which matches sample margins, is not as appropriate as methods designed to match population margins such as ratio and regression adjustment, which can be applied after the matched file is created. There has not been a comprehensive comparison of the use of iterative proportional fitting or related methods on unconstrained statistically matched files with constrained statistical matching, though Rubin's argument encourages this comparison. This trade-off of computational ease and nearest-neighbor matching versus the maintenance of marginal distributions has resulted in several compromises between unconstrained and constrained statistical matching. Paass (1985) mentions that to save computational costs, one need not find the optimal solution to the associated linear programming problem, just a good solution. The statistical matching algorithm used in the Social Policy Simulation Database (SPSD) uses an approximately constrained procedure. As Armstrong (1989:6) notes: Categories are formed on file A and file B using all the X variables except one, denoted by X1 . File A and file B records are sorted within each category using X1. Then the files are matched within each category according to sorted orderâthe file A record with the largest value of X1 is matched with the file B record with the largest value of X1, etc. File B records are duplicated or skipped when numbers of file A and file B records within a particular category differ. This last problem, necessitating duplicating and skipping records, could be quite extreme, as indicated in the examples described below. Choosing the Matching Variables The decision regarding which jointly occurring variables to use as matching variables has received little attention in the literature on statistical matching. One proposed method is to choose an especially important variable, Yi, and regress Yi on the candidates, and then choose X(A) to be those variables with significant regression coefficients at some level of significance. Given that one is also interested in a particular Zj that is (possibly) rather highly correlated with Yi, this group of X variables, X(A), should also be at least modestly correlated with Zjâalthough the discussion below on partial correlation shows how modest this correlation might be, even when the correlations between Yi and X(A) are rather high. However, the X(A) variables could be very poor predictors for other Y's and Z's. Therefore, it would be better to choose the matching variables so that they correlated in a more multivariate sense with both Y and Z, although this objective ma y not be possible with Y or Z variables that are uncorrelated with
STATISTICAL MATCHING AND MICROSIMULATION MODELS 67 each other. One should be aware that the relationship between Y and X(A) ma y not be even approximately linear, in which case the correlation coefficient will not be that informative as to the ability to match Z's using X. The issue of choosing the matching variables so that Y and Z are predicted by X(A) might be addressed through computing the canonical correlation between Y and X, and choosing the variables in X with the highest weight in the canonical correlation. There are obviously other possibilities from multivariate analysis, but they will all have the same problem in the event that the Y variables are not highly correlated with each other, that is, that the match will be much more effective for some members of Y than for others. The above methods would provide weights to help decide which X's to use in matching. A related approach by Kadane (1978) and Paass (1985), commented on by Sims (1978), is to use various Mahalanobis distance functions, which differ with respect to the covariance matrix used, to help weight the distances from to (Xj, Ŷj, Zj), where the Ë indicates a value that is imputed using either an assumed conditional independence assumption or an assumed covariance for Y and Z, resulting in the estimates of interest. That is, a Mahalanobis distance function would provide weights for the various X's in the measurement of distance between data points. Kadane suggests that the measurement of the distance between two records could be weighted using either an estimated covariance matrix for (X, Y, Z) or, more simply, a covariance matrix for X. Paass uses a covariance matrix that is a function of only X and Z. Sims believes that making use of variables other than X in the covariance matrix introduces a bias and that the covariance matrix should therefore simply be a function of X alone. The second issue, the nonlinearity of the relationship between Yi and X, might be addressed through the use of recursive partitioning (see Brieman et al. [1984] for details). This methodology makes use of successive partitioning of the Y-X space so that points in the same partition, defined by the values those points take for the X variables, have very similar values for Yi. Then one could match records within the same partition, possibly making use of some loss function within the partition or randomly assigning a match. It is also possible that some clustering notions might be effective in deciding which X variables should be used to perform the statistical match. Paass (1985) discusses some related possibilities. When Y is multivariate, the same question noted above of which correlations are most important appears when using recursive partitioning, with the same desire of forming some kind of univariate compromise from among the Y's. In a simulation study, Armstrong (1989) found that, as a rule, all information should be used to decide matches. He states (1989:44): The evaluation study reported here provides strong evidence that the use of categories of variables found on only one input data file (Y and Z) as well as