National Academies Press: OpenBook
Suggested Citation:"File Treatment." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 63

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

STATISTICAL MATCHING AND MICROSIMULATION MODELS 63 that contains both groups of variables is often difficult to accomplish, given budget constraints, the interest in reducing respondent burden, and the need to protect the privacy and confidentiality of respondents. Yet information to inform decision making is needed. One technique for addressing this problem, which has been used for over two decades, is statistical matching. This chapter presents a general critique of statistical matching and some possible alternatives for overcoming the identified problems. (For a broad overview of statistical matching, see Radner et al. [1980].) Definition of Statistical Matching Mathematically, statistical matching is defined as follows. Let us call the first data set, with variables Y and X (A), data set A, where both Y and X(A) can denote several variables. The Y variables are the variables of interest, and the X(A) variables will be used for purposes of matching. The second data set, B, has variables Z and X(B) on it. The Z variables are the variables of interest, and the X(B) variables will be used for purposes of matching with data set A. Statistical matching creates complete records of the form {Y X(A) Z}—or possibly some combination of X(A) and X(B) in place of X(A)—by joining records when X(A) is “close” to X(B), for some definition of close. The process of statistical matching makes rather strong assumptions about the relationships between variables Y and Z. This issue is addressed below. Like imputation, statistical matching is a form of nonparametric regression used to fill in missing data values.1 However, statistical matching is in two important ways more extreme. First, imputation is typically used to fill in a relatively small percentage of the data; statistical matching is typically used on 100 percent of the records. Second, imputation typically makes use of complete records to fill in missing values for other records; statistical matching makes use of a conditional independence assumption since no complete records exist. The validity of this conditional independence assumption is often untestable. File Treatment Before two files can be statistically matched, the files may require some treatment. First, the variables X(A) and X(B) ma y not be immediately comparable. For example, a variable representing income may include some components on one file, say, income from interest and dividends, that are not included on the 1The term imputation is used here narrowly as a technique for replacing missing values for one or more response categories. Other analysts use the term in a broader sense for the technique of creating all of the values for one or more missing variables that were never asked in a survey (or never collected in an administrative records system). Imputation of the latter type—for example, on the basis of regression equations estimated from another data source—may exhibit some of the same problems as statistical matching.

Next: Constrained and Unconstrained Statistical Matching »
Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers Get This Book
Buy Paperback | $100.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

This volume, second in the series, provides essential background material for policy analysts, researchers, statisticians, and others interested in the application of microsimulation techniques to develop estimates of the costs and population impacts of proposed changes in government policies ranging from welfare to retirement income to health care to taxes.

The material spans data inputs to models, design and computer implementation of models, validation of model outputs, and model documentation.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook,'s online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!