National Academies Press: OpenBook
« Previous: More Data Collection
Suggested Citation:"Multiple Matching and File Concatenation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 79

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

STATISTICAL MATCHING AND MICROSIMULATION MODELS 79 V(Y,Z) can have a deleterious effect on creating the merged file. Some initial research on the trade-off between the bias from assuming the conditional correlation to be 0 and the variance from estimating V(Y,Z) from a small sample has been performed by Singh et al. (1990). Multiple Matching and File Concatenation Rubin (1986) develops a number of possible alternatives to statistical matching. To quote (Rubin, 1986:89): The method for creating one file that is proposed here treats the two data bases as two probability samples from the same population and creates one concatenated data base with missing data, where the missing data are multiply imputed to reflect uncertainty about which value to impute. Rubin does not merge the records as in statistical matching. Instead, the files are concatenated. Thus, there are nA records from file A with missing values for Z, and these are followed by nB records from file B with missing values for Y. The problem is then one of missing data. Missing values for Z, denoted are estimated by regressing Z on X(B). The same is done to fill in missing Y values, denoted Ŷ. Of course, one need not use linear regression to obtain fitted values; any model could be used, including nonlinear ones. Then, for each record originally from file A, the Z value that is closest to is used to fill in for the missing value in each of the records orginally from file A and similarly for the Y values missing from file B. (Rubin [1986] focuses on a univariate problem, but the multivariate extension is immediate.) This idea has at least two advantages. First, there is an implied distance norm that arises naturally from possibly separate models for Y and Z. Second, all of the X information present in the two separate files is present in the concatenated file, rather than setting aside half of the data, as is typically the case in statistical matching. Note that if one were to fill out the records with the fitted values rather than the nearest observed values to the fitted values, the bias arising from the lack of matches in remote areas of the X-space would be reduced to some extent. Of course, to do this one must trust some model in places where the data are thinnest. This type of extrapolation is potentially hazardous. To understand the proper weight, wABi, for each record in the concatenated file, consider the ith record in file A with weight wAi, and the jth record in file B with weight wBj. These records represent an ideal, for which each file is a probability sample from the same population of n individuals with different patterns of missing values. The process described above is used to fill in these missing values. Now, in the above ideal sense, each triplet (Y, X, Z) (with either Y or Z imputed) is considered as potentially sampled from both file A and file B. In order for the usual totals to be unbiased estimators of their population

Next: Rough Sensitivity Analysis »
Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers Get This Book
Buy Paperback | $100.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

This volume, second in the series, provides essential background material for policy analysts, researchers, statisticians, and others interested in the application of microsimulation techniques to develop estimates of the costs and population impacts of proposed changes in government policies ranging from welfare to retirement income to health care to taxes.

The material spans data inputs to models, design and computer implementation of models, validation of model outputs, and model documentation.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook,'s online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!