Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
STATISTICAL MATCHING AND MICROSIMULATION MODELS 79 V(Y,Z) can have a deleterious effect on creating the merged file. Some initial research on the trade-off between the bias from assuming the conditional correlation to be 0 and the variance from estimating V(Y,Z) from a small sample has been performed by Singh et al. (1990). Multiple Matching and File Concatenation Rubin (1986) develops a number of possible alternatives to statistical matching. To quote (Rubin, 1986:89): The method for creating one file that is proposed here treats the two data bases as two probability samples from the same population and creates one concatenated data base with missing data, where the missing data are multiply imputed to reflect uncertainty about which value to impute. Rubin does not merge the records as in statistical matching. Instead, the files are concatenated. Thus, there are nA records from file A with missing values for Z, and these are followed by nB records from file B with missing values for Y. The problem is then one of missing data. Missing values for Z, denoted are estimated by regressing Z on X(B). The same is done to fill in missing Y values, denoted Ŷ. Of course, one need not use linear regression to obtain fitted values; any model could be used, including nonlinear ones. Then, for each record originally from file A, the Z value that is closest to is used to fill in for the missing value in each of the records orginally from file A and similarly for the Y values missing from file B. (Rubin [1986] focuses on a univariate problem, but the multivariate extension is immediate.) This idea has at least two advantages. First, there is an implied distance norm that arises naturally from possibly separate models for Y and Z. Second, all of the X information present in the two separate files is present in the concatenated file, rather than setting aside half of the data, as is typically the case in statistical matching. Note that if one were to fill out the records with the fitted values rather than the nearest observed values to the fitted values, the bias arising from the lack of matches in remote areas of the X-space would be reduced to some extent. Of course, to do this one must trust some model in places where the data are thinnest. This type of extrapolation is potentially hazardous. To understand the proper weight, wABi, for each record in the concatenated file, consider the ith record in file A with weight wAi, and the jth record in file B with weight wBj. These records represent an ideal, for which each file is a probability sample from the same population of n individuals with different patterns of missing values. The process described above is used to fill in these missing values. Now, in the above ideal sense, each triplet (Y, X, Z) (with either Y or Z imputed) is considered as potentially sampled from both file A and file B. In order for the usual totals to be unbiased estimators of their population