Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
STATISTICAL MATCHING AND MICROSIMULATION MODELS 64 other file, which would require that either X(A) or X(B) be reexpressed with the aid of other variables. Second, the distance (or closeness) function suggested above is sometimes defined so that the distance is infinite between records that differ for certain key variables: that is, a match between records that differ for these variables is not allowed. For example, it may be desired that no record for a person over the age of 50 be merged with a record for a person under the age of 20, and the file treatment can be so structured. Third, the two data files to be merged often represent samples from slightly different universes. For example, one file might represent taxpayers, while the other represents all households. Before statistical matching can be done, one data file must be augmented or one must be reduced so that the universes represented are identical. Constrained and Unconstraine d Statistical Matching There are at least two types of statistical matching, constrained and unconstrained, which have rather different properties. In unconstrained matching, the record in the B file is matched to a record in the A file such that X(A)âX(B) is a minimum, for some distance norm, . That is, a record in B is chosen that is closest to a record in A with respect to some distance measure involving the X variables. This is essentially the same as nearest-neighbor regression, a nonparametric method of estimating a regression surface. Assuming that the Z variables have some smooth relationship to the matching X variables, it seems reasonable to hope that close or identical values of X have close values of Z. The matched record is usually given the sampling weight that the record in A had in file A. The benefits of unconstrained statistical matching are that the closest possible match is used so that fewer assumptions about the smoothness of the Z-X relationship have to be made, and the matching is computationally quite inexpensive. One problem is that the marginal information for Z on the merged file is not identical to the marginal information for Z in file B. In other words, totals, means, standard deviations, correlations, etc., from the merged file for data that are present only in file B will not agree with the same statistics computed from file B. As Rodgers (1984:92) noted: Unconstrained statistical matching has the advantage of permitting the closest possible match for each A record, but at the cost of increasing the sample variance of estimators involving the Z variables. An unconstrained match amounts to taking a simple random sample, with replacement, of the records in file B. The distributions of the imputed Z variables added to file A, then, are distributions of the selected sample rather than the distributions as observed in file B. The alternative is constrained statistical matching. Define wAi as the weight given to record i in file A, wBj as the weight given to record j in file B, and wij as the weight in the matched file given to the pairing of record i in file A to record j in file B. Assigning w ij equal to 0 means that the match between
STATISTICAL MATCHING AND MICROSIMULATION MODELS 65 these two records is not made. If the weight of record i in file A is equal to 3, this record could potentially be matched to one, two, or three records in file B. An obvious objective or constraint (missing in unconstrained matching) is for the marginal distributions for Y and Z in the matched file to be the same as in files A and B, respectively. One would like to combine records such that X(A) is close to X(B) but is willing to make do with a possibly close to nearest neighbor in order to maintain the marginal distributions. The problem can be stated mathematically. The goal is to minimize subject to This formulation is in the form of a transportation problem, a particular kind of linear programming problem. There exist efficient algorithms for the solution of transportation problems. Even so, this application of transportation problems has a large number of constraints, equal to the sum of the number of records in files A and B. (Of course, if the distance function is infinite for a large percentage of the possible matches, the problem is effectively smaller than this formulation implies.) To merge two files of the size of the CPS might involve 120,000 constraints. Barr and Turner (1978) have identified some clever shortcuts to streamline the transportation algorithm for this application, but the procedure is still computationally very demanding for large files. (For illuminating and simplified examples of both constrained and unconstrained statistical matching, see Rodgers .) The advantages of constrained statistical matching are that the marginal distributions of Z (and, imperfectly, the correlations between Z and X(A)) are maintained in the matched file, but at the cost of not using the nearest- neighbor record and with the addition of considerable computation. The failure to maintain the marginal distributions of Z and the relationship of Z to X from file B can have a deleterious effect on the validity of the results of analyzing the merged file. Rodgers (1984) suggests that these problems often cause statistical matches to misinform. Thus, he believes that a good measure of the trouble that one might have with unconstrained statistical matching is the difference between the means, variances, and correlations in file B and in the merged file. Paass (1985:9.3â6) agrees: âMoreover the analysis of the XZ-, the YZ- and the overall distribution lead to the conclusion that unconstrained SM [statistical matching] methods often induce more errors than constrained approaches.â Rubin (1986:89) disagrees with this to some extent: I regard the automatic matching of margins to the original files as a relatively minor benefit of the constrained approach in most circumstances, especially