Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
STATISTICAL MATCHING AND MICROSIMULATION MODELS 63 that contains both groups of variables is often difficult to accomplish, given budget constraints, the interest in reducing respondent burden, and the need to protect the privacy and confidentiality of respondents. Yet information to inform decision making is needed. One technique for addressing this problem, which has been used for over two decades, is statistical matching. This chapter presents a general critique of statistical matching and some possible alternatives for overcoming the identified problems. (For a broad overview of statistical matching, see Radner et al. [1980].) Definition of Statistical Matching Mathematically, statistical matching is defined as follows. Let us call the first data set, with variables Y and X (A), data set A, where both Y and X(A) can denote several variables. The Y variables are the variables of interest, and the X(A) variables will be used for purposes of matching. The second data set, B, has variables Z and X(B) on it. The Z variables are the variables of interest, and the X(B) variables will be used for purposes of matching with data set A. Statistical matching creates complete records of the form {Y X(A) Z}âor possibly some combination of X(A) and X(B) in place of X(A)âby joining records when X(A) is âcloseâ to X(B), for some definition of close. The process of statistical matching makes rather strong assumptions about the relationships between variables Y and Z. This issue is addressed below. Like imputation, statistical matching is a form of nonparametric regression used to fill in missing data values.1 However, statistical matching is in two important ways more extreme. First, imputation is typically used to fill in a relatively small percentage of the data; statistical matching is typically used on 100 percent of the records. Second, imputation typically makes use of complete records to fill in missing values for other records; statistical matching makes use of a conditional independence assumption since no complete records exist. The validity of this conditional independence assumption is often untestable. File Treatment Before two files can be statistically matched, the files may require some treatment. First, the variables X(A) and X(B) ma y not be immediately comparable. For example, a variable representing income may include some components on one file, say, income from interest and dividends, that are not included on the 1The term imputation is used here narrowly as a technique for replacing missing values for one or more response categories. Other analysts use the term in a broader sense for the technique of creating all of the values for one or more missing variables that were never asked in a survey (or never collected in an administrative records system). Imputation of the latter typeâfor example, on the basis of regression equations estimated from another data sourceâmay exhibit some of the same problems as statistical matching.