Methods for Treating Missing Data
The 2000 census, continuing with a pattern typical of all recent censuses, experienced an appreciable amount of nonresponse for characteristics information from households, especially on the long form. Overall, the rate of nonresponse for most items on the long form in 2000 was typically at least comparable to, and for many items was considerably greater than, that of previous censuses. Details on the frequency of item nonresponse are provided in Chapter 7. The Census Bureau, faced with this degree of nonresponse, again used a version of sequential hot-deck imputation, as described in Appendix G, and summarized below. The effectiveness of this treatment of nonresponse, in comparison with other methods currently available, and current techniques that might be used to estimate the variance of statistics derived from long-form-sample data in the presence of nonresponse, are examined in this appendix.
Nonresponse is usefully separated into two broad types: (1) item nonresponse, where a household provides responses for some but not all items on the questionnaire, and (2) unit nonresponse, where
a household does not provide any information on a questionnaire.1 Item nonresponse is often amenable to more sophisticated missing data methods because the responses that are available may be used to help predict the missing values. In contrast, for unit nonresponse, the only information available in the decennial census is the geographic location of the residence, and while geographic information is useful in somewhat the same way as responses to other items, it generally has limited value in comparison to information that is more specific to the household. The lack of information limits the techniques available to the Census Bureau, and as a result censuses have addressed unit nonresponse in the long-form-sample weighting process. This appendix is focused primarily on techniques that address item nonresponse, though some of the discussion will also apply to techniques for addressing unit nonresponse.2
Untreated, nonresponse can cause two problems. First, nonresponse can result in statistical bias. (Statistical bias is a measure of the difference in the expected value of a statistic and its true value.) It is generally the case that the data that one receives from respondents are different distributionally from the data that would have been provided by nonrespondents. This is why so-called complete-case analysis is problematic, since the restriction of the analysis to those cases that have a complete response fails to adequately represent the contribution from those that have missing data. These distributional differences may be present not just unconditionally, but (often) also conditionally given responses to certain items. For example, data for nonrespondents may be different from data for respondents because the respondents have a different demographic or socioeconomic profile. Even within demographic or socioeconomic groups or other conditioning variables that one might choose, data for nonrespondents may still have a different distribution than that for respondents. Methods that do not take these differences into account can introduce a statistical bias, which can be appreciable
with substantial amounts of nonresponse. For example, if nonrespondents tend to commute greater distances than respondents, methods to treat nonresponse that are not sensitive to this will produce estimates of average commuting distance that are generally too low.
Second, nonresponse often is either ignored or not fully represented in estimates of the variances of statistics computed using data sets with missing data, which will result in variance estimates that are statistically biased low. In other words, improper treatment of missing data in the estimation of variances will result in statistics that are represented as being more reliable than they in fact are, which will cause hypothesis tests and confidence intervals based on the (treated) data to understate the variation in the data. As will be discussed below, the current treatment of nonresponse in the long-form sample does result in estimated variances that are likely too low. This is a challenging problem with no easy solutions, but there are new approaches that are extremely promising that are discussed below.
Generally speaking, there is considerably more nonresponse to the additional long-form items than to the basic items asked of everyone. In addition, the relative lack of useful correlations among the basic variables reduces, though it does not eliminate, the benefits one may derive from some of the techniques described here. Therefore, the panel believes that the selection of treatments to address nonresponse for basic items does not have as much urgency as that for the additional long-form items, and as a result, we focus in this appendix on missing value treatments for the long-form sample. However, techniques that are found useful for treatment of nonresponse in long-form-sample items also should be examined for their potential benefits in treating nonresponse in the basic items on the short and long forms.
Before continuing, it is important to stress that the bias due to missing data is often not measurable, or only measurable using relatively costly efforts at additional data collection, including reinterviews, or through use of alternative sources of information such as administrative records. Even state-of-the-art techniques for treatment of missing data are ultimately limited in their usefulness either by increased amounts of nonresponse or by lack of information concerning the nature of the nonresponse. Therefore, the highest
priority is to reduce the extent of missing data. There is a great deal of work by cognitive survey scientists in motivating people to provide data, and efforts should obviously continue along these lines, possibly more so due to the higher degree of nonresponse to the long form in the 2000 census. (Fortunately, the proposed American Community Survey, a continuous survey designed as a replacement for long-form-type information, being continuous is more amenable to this type of research.) The techniques in the literature are meant to address at most moderate amounts of nonresponse; they are not intended for situations in which the nonresponse is on the order of 30 percent or more. (There are examples in the literature in which these techniques have been applied to more extreme cases; however, in those cases the reliability of the resulting estimates was not intended to be of the quality typically expected for census output.)
F.1.a Mechanisms for Nonresponse
When considering nonresponse, it is important to try to understand the likely mechanism underlying the nonresponse, since this has important implications for the methods used to treat it. This is obviously more difficult for unit nonresponse due to the lack of available information, but any progress in this direction is useful. Mechanisms underlying nonresponse can be roughly classified into three categories:
missing data are missing completely at random—in this case the indicator variable for whether or not a response is provided is independent of all other data for the household, whether collected or not, and therefore, cases with missing values have the identical distribution as those with complete responses;
missing data are missing at random—in this case the indicator variable for response is independent of missing values, but could be dependent on responses, and therefore the missing values are, conditional on responded information, distributed the same as responses; and
missing data are not missing at random—in this case the indicator variable for response is dependent on missing values, and therefore, even conditionally on collected data, the distribution of missing values will depend on the values of uncollected data.
This third category is clearly the most difficult situation to be in. It is important to examine which situation is likely to be in effect for missing values for various items on the census questionnaire in order to identify the proper treatments to minimize bias. For instance, are people at the high end of the income spectrum likely to omit their responses to income questions, or is this more typical of those at the low end of the spectrum? Is nonresponse for income more or less common for those in various demographic groups, or in various occupations, or in various geographic regions?
Consider a situation in which responses are missing for some households for commuting distance. If these missing values are missing completely at random, using a hot-deck procedure to substitute random commuting distances from other respondents would not result in a statistical bias for assessment of average commuting distance. On the other hand, if these missing values are missing at random but are not missing completely at random—in particular, that conditional on income, nonresponse for commuting distance is independent of other items on the census questionnaire—then using a hot deck procedure to substitute random commuting distances from other respondents with the same income level would not result in a statistical bias. Relevant to the current Census Bureau missing data procedure, if it is the case that many variables are somewhat homogeneous over small geographic distances, and the data are missing at random by conditioning on possibly unknown covariates, forcing a hot deck to make use of only nearby respondents for donor values can be a sensible procedure since it implicitly conditions on all other data for the household. However, this assumption does not always even approximately obtain.
If the missing values are not missing at random, which is referred to as nonignorable nonresponse, while there are approaches that can be tried, they are very dependent on assumptions and much more difficult to implement in a large, multidimensional data set like the decennial census. So, if those people with very long commuting distances fail to respond as often as those with shorter commutes and nothing else is known about the relationship of commuting distance to other variables, then there may be nothing that can be done to address the resulting bias in the context of decennial census output.
Of course, in real applications, cases do not fall neatly into the above categories. One could imagine a response indicator variable that had a distribution that was predominantly missing at random but that had a relatively small contribution to its distribution that was nonignorable. In that case, making use of a method that was based on the assumption that the data were missing at random would provide a substantial reduction in bias, though it would fail to eliminate it.
A short summary is as follows: (1) if the missing values are missing completely at random, any sensible missing data method will be effective; (2) if the missing values are missing at random, taking into consideration the variables that differentiate the respondents from the nonrespondents is the key to proper treatment of the nonresponse; and (3) if the nonresponse mechanism is not ignorable, then accounting for variables that explain some nonresponse is likely to reduce bias as much as possible without the use of arbitrary assumptions.
With respect to variance estimation, it is still typical for government agencies, based on survey data, to publish variance estimates that understate the variance present in the data due to nonresponse. Specifically, standard complete-data-set variance estimates applied to data sets with missing data values filled in by imputation obviously fail to include the contribution to variance from the variability in the imputations. In addition, failure to correctly model the nonresponse, say by using the assumption that the missing data are missing completely at random when they are only missing at random, or by conditioning on the wrong variables when assuming that missing values are missing at random, not only causes bias in the estimates, but also causes bias in estimates of the variance.
F.1.b Implementation Considerations
A key constraint concerning treatment of missing values in the decennial census is that the Census Bureau needs to provide data products (population counts, cross-tabulations, averages, and public use microdata samples) that can be used by the wide variety of census data users for their varied purposes. Certainly, if one was only interested in the treatment of missing data for a single response to estimate a national average, one could make use of a
method optimized for that situation, and that method might be very sophisticated and computationally intensive, given the focus of the problem. However, the Census Bureau cannot do this for each of the myriad potential uses of decennial census data, and it cannot ask its data users to implement their own treatments for nonresponse. This is because it is the job of the Census Bureau to provide official products from the decennial census and, more pragmatically, because access to individual-level data—needed to implement most procedures—is precluded by Title 13 of the U.S. Code. An exception is that of public use microdata sample files (PUMS), but the ability of PUMS users to carry out their own missing data treatment is severely complicated by the reduced sample size that is available to them. What is needed is a treatment that can be used for a large, multidimensional data set that can accommodate the needs of a wide variety of analysts. (Nevertheless, the Census Bureau should continue the current practice of flagging data values that result from the use of treatments for nonresponse.) Understanding this, the Census Bureau has used, in the most recent censuses, an omnibus treatment of nonresponse that is effective for a large database with a considerable number of response variables (and variable types, i.e., continuous, discrete, and categorical) that supports a wide range of analyses. Any new approach must retain this characteristic.
The multidimensional character of the decennial census data set raises an important issue. There are many situations in which the values provided (or that would have been provided, absent nonresponse) for variable A are correlated with the values for variable B, and vice versa. Often, many more than two variables are involved in these dependencies. In order to minimize bias, assuming the nonresponse is missing at random, it is important to use all the available information that is relevant for imputing values. A multivariate approach to treatment of missing values therefore can have a distinct advantage over a univariate approach, since then the dependence between variables jointly missing can be properly taken into account.
In addition to the multivariate aspect of missing data, especially in the long-form sample, there are computational considerations that have historically helped to select the missing data treatments. Certainly, given the memory size and processing speed typical of the best widely available main frame computers in the 1980s, keep-
ing the entire decennial census data base in active memory, or even the database for large states, and repeatedly accessing individual data elements, was outside the bounds of what was generally feasible. Therefore, what was needed was a methodology for the treatment of nonresponse that kept a large fraction of the database in remote memory while processing only small pieces of the database sequentially. This was one of the arguments in support of the use of sequential hot-deck imputation. However, computational power and speed have changed dramatically in the last decade. Assuming Moore’s Law is even approximately true, the entire 2010 decennial census database (even if the long-form sample is included in the 2010 census) will almost certainly be able to be manipulated in 2011 in the active memory of the more powerful desktop computers. Therefore, for future censuses, the treatment for missing data does not need to be sequential anymore, and treatments can be considered that require multiple access to the database.
F.2 OUTLINE OF THE CURRENT METHODOLOGY
For the past five decennial censuses, the Census Bureau has used variants of the following treatment for nonresponse in the long-form sample (see Appendixes G and H for further details). With respect to unit nonresponse, as mentioned above, the Census Bureau treats this as an additional sampling mechanism. Iterative proportional fitting is used to weight long-form-sample population counts to complete population counts as a variance reduction technique that also treats unit nonresponse.
With respect to item nonresponse, the Census Bureau makes use of sequential hot-deck imputation. (The quick description here ignores the issue of how this technique is initialized and other details.) In this procedure, the long-form data set for each state is processed in one pass through the data. Assume that a given household has missing values for variables X1, X2, and X3, and assume also that the Census Bureau, using its accumulated imputation rules, believes that to predict a household’s responses for variables X1, X2, and X3 it is useful to condition on the values for variables C1, C2, C3, and C4 consistent with a missing-at-random assumption. In processing the long-form data file, a large number of one-way, two-way, three-way, etc., matrices are constantly updated. The matrices are made up of
cells that contain the data from the most recently processed household with complete information that match the specific combination of values for the variables defining the matrices’ dimensions. (In addition, the census file is sorted by some characteristics not included in the matrix, providing data that are for households close geographically as well as similar for their characteristics.) This essentially provides a geographically proximate donor housing unit with identical information for these matching variables for a current household with item nonresponse. Then the values for variables X1, X2, and X3 from the donor household occupying that cell are used to fill in the missing values for the current household.
This description is somewhat inadequate in two ways. First, there are also imputation rules that are used in the imputation of numeric responses. For example, there may be additional variables not used in the matching, say, A1, A2, and A3, that have values that, when combined with the imputed values for the variables X1, X2, X3, and with the matching variables C1, C2, C3, and C4, result in unusual or impossible combinations. Adjustments are therefore made to handle these situations. Second, the choice of variables to condition on can depend on the values of other collected variables. That is, there may be a variable Z for a household, which may take on some values that would require matching on C1, C2, and C3 to identify the donor, and other values that would require matching on C1, C2,and C4. So the imputation can become fairly elaborate.
Besides the benefits of using a geographically (and other characteristics) proximate responding household that is likely to share many nonmatched characteristics with the current household, this process is quick to implement since it only requires a single pass through the data file (barring initialization). It also permits conditioning on important variables, thus avoiding the assumption that nonresponses are missing completely at random and only relying on a specific missing-at-random assumption.
The U.S. Census Bureau’s sequential hot-deck imputation methodology is somewhat related to that used in the Canadian Census, referred to either as nearest-neighbor imputation or new imputation methodology (NIM). As described in Bankier (1999) and Bankier et al. (2002), and originally based on the work by Fellegi and Holt (1976), missing values are filled in through use of a nearest neighbor donor, where nearest neighbor is defined by (1) the donor
household lies within the same geographic region as the household in need of imputation, and (2) given households satisfying (1), the donor household must minimize a metric measuring closeness of the donor and the current household with respect to a number of selected matching variables. As above, there is the complication that variables not used in the matching in conjunction with both the variables used for matching and the imputations may collectively fail a number of edits that households are required to pass. An imputation that permits a household to pass all edits is referred to as a feasible imputation, and among the nearest neighbors, a random feasible imputation is selected for use. In a very rough sense, NIM trades off the computational benefits of the sequential hot-deck procedure by looking around locally for a better match. However, this “better” match may be more distant geographically, in which case it is also poorer with respect to an important variable, geography.
Finally, with respect to variance estimation, as mentioned above, the weighting resulting from unit nonresponse is subsumed into the sampling weights. Therefore, the sampling variances that are published for population counts and other summary statistics, such as means, appropriately represent the contribution to variance from unit nonresponse, assuming the missing-at-random assumption is reasonable. However, no data products from the decennial census long-form sample include an estimate of the contribution to variance from item nonresponse.
F.3 PROBLEMS WITH THE CURRENT METHODOLOGY
The current methodology has two basic deficiencies that can be at least partially addressed by techniques that have recently been proposed in the statistical research literature and, to some extent, implemented in practice. First, the current methodology of sequential hot-deck imputation is somewhat inefficient due to its reliance on a single donor for imputation instead of using more of the local information, including households that do not match. The basic problem is that data from an individual donor household can be odd. Second, the failure to represent the variance from item nonresponse in data products is a serious problem for census data users because
it fails to provide them with accurate information concerning the quality of the information they are using.
More specifically, with respect to the first deficiency, a single-donor imputation is subject to the vagaries of individual data elements. Certainly, a particularly unusual household would be a poor donor because it would provide in essence a very unlikely prediction for the household with missing values. Along the same lines, a very typical donor would not necessarily provide a good imputation for an unusual household with missing data. Through use of more of the information in an area, a better sense of the relationships between variables and the remaining uncertainty can be attained, providing a better imputation. Using the information from a local area has been precluded until recently by the computational demands that such use would entail. However, advances in computing make a number of methods for using this information quite feasible.
A related issue is the necessarily multivariate nature of imputation. We separate this into two issues. First, even a single-variable imputation is dependent on other values provided by the household. The sequential hot-deck procedure addresses this need for multivariate dependence through use of the matching variables and the various imputation rules. For dependencies that are essentially deterministic, this approach is likely reasonable. Certainly, the computer code that specifies which variables should condition the imputation of which other variables, and which rules are useful for insuring that imputations do not violate any understanding of how variables interrelate, represents an important knowledge base that needs to be preserved (though we are curious about how this knowledge base grows—and is pruned—over time and whether that process can be improved).3 On the other hand, some of these specifications for conditioning and some of these rules are probably not uniformly obeyed and should probably be applied more probabilistically through use of a statistical model (see, e.g. Chiu et al., 2001). Further, it would be interesting to examine whether the selection of matching variables and the use of imputation rules would benefit from being applied subnationally rather than nationally.
The second multivariate issue concerns simultaneous imputation of more than one variable (X1, X2, and X3 in the notation above). While the current procedure used by the Census Bureau does impute some variables simultaneously, there are situations in which the sequential hot deck does not impute in a way that would provide proper estimates of the correlation between variables. Roughly speaking, relationships for variables that are closely related (e.g., income questions) may be handled in a manner that preserves correlations, but correlations between, say, income and education questions will not likely not be properly accounted for. This is because the hot deck uses strata based on cross-classification, which sharply limits the number of characteristics that can be considered to a small number, when it is known that there is information in other variables. This motivates the methods described below.
Finally, the most important deficiency with the current method that the Census Bureau uses to address item nonresponse is the failure to represent this nonresponse in its estimates of the variance of its data products. This failure is probably negligible for much complete-count-based statistical output, but will not be negligible for many long-form-sample-based statistics nationally, and for many other long-form-sample statistics for local areas or subpopulations. For the purposes of this discussion, the decennial census data products can be categorized into four types: (1) population counts, (2) cross-tabulations, (3) means and other summary statistics, and (4) public use microdata sample files. Until recently, the only technique that was available for estimating the contribution of nonresponse to variance was probably multiple imputation (see Little and Rubin, 1987). Multiple imputation could have been applied in 2000 to estimate the variance for population counts and various summary statistics. However, providing all users with, say, five replicate cross-tabulations or five PUMS files was probably not feasible. However, large amounts of disk space are becoming so cheap and widely available and this technique is sufficiently easy to apply that this option will certainly be more acceptable in the near future.
F.4 NEW APPROACHES TO IMPUTATION AND THEIR ADVANTAGES
While the current methodology used to treat missing data in the long-form sample was quite effective, given the various require-
ments and implementation constraints discussed above, for the 1980 and 1990 and possibly the 2000 censuses, there remains room for improvement through use of a number of techniques that have been recently proposed for use in this area.
The fundamental idea is to try to make greater use of data from households in the local area to assist in the imputation while retaining the benefits currently expressed in the choice of variables for conditioning and the imputation rules. A first step away from the current technique would be to retain the notion of the use of matching donors, but make use of information from more than the last household processed. An idea that would be relatively easy to implement is that of fractionally weighted imputation, proposed by Fay (1996), in which for some small integer m, m matching donors provide imputations, which are averaged, each receiving weight 1/m. One benefit of this technique is that the Rao and Shao (1992) variance estimation procedure can be directly applied to the resulting database for a large class of estimators (though this variance estimator has limitations described below). Further, most of the software currently used could be retained. The choice of m involves an interesting trade-off between geographic proximity and stable estimation. To implement this technique, it would probably not be reasonable to simply expand the imputation matrices to retain m donors, since this would likely restrict the donors to “one side” of the current household. Instead, it would be preferable to search the area for the best m donors.
A second step would be to try to use data from local households that do not match through use of statistical models. The introduction of statistical models into long-form item imputation is probably the direction that needs to be pursued for substantial improvement (see Box F.1). One attempt at use of statistical modeling for imputation has been carried out by Chiu et al. (2001). More directly relevant for the issues considered here is the work by Thibaudeau (1998)) and Raghunathan et al. (2001).
In Thibaudeau (1998), basic-item imputations, which are simply categorical variables indicating membership in demographic subgroups and whether the residence is owned or rented, are generated for data collected in the 1998 census dress rehearsal in Sacramento, California. Thibaudeau uses a hierarchical log-linear model, fit at the tract level, which is composed of several two-way interactions of variables for a household and its immediate “neighbor” on the
Decisions as to which of many algorithms to use for imputation will involve the trade-off between an algorithm such as sequential hot-deck imputation in which no explicit model is used to identify a possible value for imputation, versus an algorithm that uses an explicit model. It is useful to examine some of the trade-offs and related considerations comparing and contrasting these two approaches.
Sequential hot-deck imputation selects donor households by requiring them to match recipient households on several variables, in addition to being closely proximate geographically. The matching implicitly assumes that the matching variables, and all of their interactions, are important for predicting the variable (variables) with missing values. Unfortunately, the requirement of a match may result in use of a donor household not as proximate geographically. Further, as mentioned above, reliance on data from one household for the imputation is inefficient, whereas model-based methods generally provide fairly well-behaved predictions. However, inefficiency of a single donor approach can be readily addressed by using a procedure such as Fay’s fractionally weighted imputation.
A model-based algorithm would propose that non-matching households within some geographic area be used to help fit parameters from some statistical model that would specify how the various variables are related to a variable with missing data. The hope would be that these models would be able to rely on a small number of parameters, since they may only have main effects and interactions, not using the higher-order interactions implicit in the matching.
With respect to geography, an empirical question is which type of algorithm would use households less proximate to the household of interest. A reasonable guess is that even with fractionally weighted imputation, one will usually need to use households further away with a model-based approach, though with the use of nonmatching donors, it is conceivable that one may be able to fit the parameters using households no more distant than the matching donor used in a sequential hot deck.
The obvious benefit from the use of a model rather than a proximate matched donor (or donors) is that models borrow more information to help reduce variance. However, there is a real opportunity for introduction of bias. So we have a standard bias versus variance trade-off. With a proximate matched donor, the imputation is likely to have minimal bias—unless the matching variables are poorly chosen—but it can have a substantial variance. The reason for the opportunity for additional bias with the model-based approach is that a model can provide poor imputations not only if the covariates are chosen poorly (in the same way as if matching variables are chosen poorly), but also if the model relating how the variables for imputation are related to the covariates is seriously wrong. Therefore, some efforts at model validation are needed, but unfortunately there are a lot of models here. For example, Thibaudeau fit his models at the tract level, and the United States has approximately 65,000 tracts.
There are two final smoothing possibilities that this problem raises. First, it would be bothersome if all of the local models used for imputation had strongly different parameter values. One would expect to see slow changes in the parameters as one moved from one area to another. Therefore, one interesting possibility would be to smooth the estimated parameters to each other after fitting, by using some combining information approach. Another smoothing idea is to try to blend the benefits of a sequential hot-deck imputation with a model-based one, by taking some linear combination of the two approaches.
census file. Very roughly speaking, this model has similar structure to sequential hot-deck imputation conditioned on matching variables, since these two-factor interactions are similar to the use of the same variables in setting up the various matrices containing donor records for neighbors. Imputations and parameter estimates are produced jointly using a form of the EM algorithm (see Box F.2), specifically, the data augmentation Bayesian iterative proportional fitting algorithm (DABIPF) found in Schafer (1997). The parameter estimates are interesting to examine to understand the local nature of various main effects and interactions in producing the imputations. Thibaudeau’s framework also provides immediate variance estimation through use of the posterior predictive distribution from his model, which also provides a basis for evaluating his procedure in comparison with sequential hot-deck imputation. The entire fitting process took 12 hours on a computer that can currently be emulated by many standard desktop computers; however, the population size of the dress rehearsal was only one-two thousandths that of the United States. Still, it is very reasonable to believe that this approach will be feasible before the 2010 census if the long form is used, and almost immediately for the American Community Survey. Schafer (1997) provides a more general approach that could be used to provide imputations for discrete and continuous data as well as categorical data. The main question is whether a reasonably robust model could be identified that would provide good imputations for long-form items.
In Raghunathan et al. (2001), an imputation procedure is described that has been implemented recently in the SAS software package as part of the IVEware system. This method is also closely related to the methods described in Schafer (1997). Using a Bayesian
The EM algorithm (E is for expectation, M is for maximization of likelihood) is a broadly applicable method for providing maximum-likelihood estimates in the presence of missing data. The EM algorithm was identified and initially investigated in its general form by Dempster et al. (1977). (It had been applied in several different missing-data situations for decades without the common structure of the techniques being recognized.) Begin with initial values for the parameters of the distribution generating the data and the sufficient statistics for this distribution. (Sufficient statistics are often small-dimensional functions of the data that allow one to estimate the parameters of a distribution.) Then the EM algorithm can be informally described, for well-behaved data distributions, as an iterative application of the following two steps, continuing until convergence:
E-Step: Fill in missing portions of the sufficient statistics due to the missing data with their expectation given the observed data and the current estimated values for the parameters of the data distribution
M-Step: Using these estimated sufficient statistics, carry out a standard maximum-likelihood calculation to update the estimates of the parameters of the data distribution.
So, very crudely, the parameters help one to identify the contribution of missing data to the sufficient statistics, and then the parameters are reestimated given the updated estimated sufficient statistics.
framework, explicit models for selected parameters and data are used, relating an individual response variable to missing values, conditional on the fully observed values and unknown parameters. Bayesian simulation is used to update a noninformative prior to form a posterior distribution for both the parameters and the missing values. Imputations are therefore draws from the posterior predictive (marginal) distribution for the missing values. The models are regression-type models, including logistic, Poisson, and generalized logit, to be able to fit categorical, discrete, and continuous data-type variables. In addition to the regression models, IVEware can accommodate restrictions of the regression models to relevant subpopulations, especially including local areas, and the imposition of logical bounds or constraints for the imputed values.
In the fitting process, variables are imputed individually, conditionally on all other observed and imputed variables at that point in the computation. However, Raghunathan et al. (2001) argue that by cycling through the fitting process variable by variable, the imputations can properly represent their dependence structure. As will
be discussed below, Raghunathan et al. (2001) propose repeating the process of randomly drawing from the posterior predictive distribution for the missing values to form multiple imputations for variance estimation. The computational demands of IVEware are considerable, and efforts are being made to implement modifications that could permit use on data sets of the size of the ACS.
It is clear that the area of imputation has a number of promising research avenues well underway. Given that, it would be reasonable to predict that some form of EM imputation mechanism that was feasible computationally for either the long-form sample or the ACS and preferable to sequential hot-deck imputation could be identified well before the 2010 census with concentrated research efforts.
F.5 NEW APPROACHES TO VARIANCE ESTIMATION AND THEIR ADVANTAGES
Along with the recent work on alternatives to sequential hot-deck imputation, there has been promising work done recently on estimating variances for statistics computed from data with missing values.
The first suggestion for an omnibus approach to the estimation of variances was Rubin’s proposal of multiple imputation. (For details, see, e.g., Little and Rubin, 1987.) The basic idea is to repeat the process for producing imputations for missing values in a data set m times, where the imputation mechanism can be represented as draws from a posterior predictive distribution for the missing values. The m data sets, “completed” through use of one of the m imputations, are each used to estimate a statistic S and its variance, where the variance is computed assuming that the imputations were collected data. The variance of the statistic S is estimated by separately estimating the within-imputation and the between-imputation contributions to variance, as follows:
where WithinVari is the standard complete-data variance of Si, Si is the estimate for S computed from the ith completed-data set, and BetweenVari is the variance of the Si′s, (namely
where is the average of the Si). Multiple imputation has been implemented in IVEware and in the software described in Schafer (1997).
A great deal of research has been carried out examining the properties of multiple imputation (see Rubin, 1996). However, to date it is not widely used by federal statistical agencies, and in particular, it has not been used to estimate variances for data output from the long-form sample. One primary reason for this is the need to provide users with m versions of a large data base (or cross-tabulation). However, the greater availability of computer memory and the wide acceptance by users of acquiring data sets electronically rather than in paper version reduce the relevance of this criticism.
Another objection to the use of multiple imputation was raised by Fay (1992). Fay discovered that when the assumptions for the model underlying the imputation mechanism and the assumptions underlying the model used by the analyst on the data set with the imputations are different, multiple imputation can produce erroneous estimates of variance. A number of people have contributed to this discussion to determine how likely the situation identified by Fay is to occur in practice and what the magnitude of the bias would be, including Binder (1996); Eltinge (1996); Fay (1996); Judkins (1996); Meng (1994); Rubin (1996). We believe that imputation models exist that are suitably general to accommodate many analyses of the data without biasing the variance estimation substantially. However, this is both a theoretical and an empirical question that is worth further investigation in the census context.
Several approaches have been proposed that might avoid these two primary objections to multiple imputation. A leading alternative approach has been proposed by Rao and Shao (1992), which is based on a modification to the leave-one-out jackknife that accounts for the effect of each collected data value on the mean of the imputed values, when the objective is to estimate the variance of a mean (though this technique works for more general types of estimators). As pointed out by Judkins (1996), though Rao-Shao avoids the Fay difficulty, the algorithm is inherently univariate and therefore is not a candidate for an omnibus approach to variance estimation for the long-form sample.
Kim and Fuller (2002) have a specific proposal that is also very promising for directly estimating the variance of statistics computed
from data sets with missing values, in particular for long-form-type applications. Their approach has a great deal of appeal, but it seems somewhat limited in its ability to handle highly multivariate patterns of nonresponse. This would exclude, for example, most small-area estimators. Also, it is not clear what the computing demands are for Kim and Fuller’s algorithm and therefore whether it could be implemented on as large a problem as the decennial census long-form sample.
One may wonder why the posterior predictive distribution for the imputations created from application of the EM algorithm could not be used to estimate variances due to each imputation and thus the variances for statistics computed using the imputations (as was done by Thibaudeau, 1998). In fact, this could be carried out, at least approximately, for most situations, for specific statistics of interest. However, as mentioned above, the objective is an omnibus approach to variance estimation that can be used for any statistics of interest to a census data user. For example, suppose a public use microdata sample user wanted to fit a small-area model of average income by area, explained by other area averages, and was particularly interested in the significance of percent female head of household as a covariate in this model. Use of the posterior predictive variances would still require that a considerable amount of analytic work be carried out by this user in order to compute variances of the parameter estimates in this model. Similarly, there is an issue as to how posterior predictive variances should be communicated to users interested in fitting log-linear models to census cross-tabulations.
Given the degree of item nonresponse in the 2000 census long-form sample, the issue of variance estimation to incorporate the variance due to missing values is the key missing value problem facing the Census Bureau heading into either the implementation of a 2010 long-form sample or the ACS. As discussed here, there are some very exciting research efforts now ongoing that might be feasible by 2010. Even if none of the newer possibilities become fully capable of being implemented on either a 2010 long-form sample or ACS, multiple imputation, if used carefully, should be strongly preferable to the current practice of ignoring the contribution to variances from item nonresponse.
F.6 SUGGESTIONS FOR WORK FOR THE AMERICAN COMMUNITY SURVEY AND THE 2010 CENSUS
To conclude the above discussion, a number of suggestions are proposed for work leading up to either the 2010 census or to the full implementation of the American Community Survey, depending on how long-form-sample information is to be collected in the immediate future. The Census Bureau should:
Examine patterns of nonresponse in the 2000 long-form sample to see the extent to which assumptions about data either missing completely at random or missing at random are justified. This can be accomplished through use of reinterview studies (a 2000 reinterview study to measure response variance was carried out; see Singer and Ennis, 2003) and through use of matching to other data sources, including administrative records.
Examine the current quality of the sequential hot-deck imputation. This can be accomplished in two steps. First, to see whether hot-deck imputation accurately mimics the data generation mechanism for respondents, simulate with 2000 census files the random omission of collected data and examine the quality of the resulting imputations in comparison to the values omitted. Second, to see whether the mechanism for census long-form-item nonresponse is ignorable, use reinterview studies and matching studies to compare the sequential hot-deck imputations to collected data through these sources. If this is not feasible, late-arriving census forms might be considered to be intermediate between responses and missing values, and imputations could be simulated for and compared to those values. Such analyses should distinguish between data provided by household members and proxy responses.
Implement a comprehensive procedure for validation of imputation rules. As mentioned previously, the Census Bureau has constraints on imputations to observe various relationships with other collected and imputed information. The question is whether collected data observe these rules, and whether there are new rules not currently in place that should be added. This is essentially a data mining problem (though obviously some field work would be beneficial to decide whether data that failed edits were valid), and there is a growing body of techniques, such as
classification and regression trees, and other forms of machine learning that, when applied to the decennial census long-form sample, could be used to examine the validity of current rules and the benefits of additional ones.
Initiate a comprehensive program pointed toward 2010 to examine whether some modification of the work of Thibaudeau (1998) or other EM algorithm-type approaches to long-form-item imputation are feasible and whether they provide superior imputations in comparison to sequential hot-deck imputation.
Initiate a comprehensive program pointed toward 2010 to examine whether some type of multiple imputation process would be feasible to incorporate the variance due to item nonresponse in the long-form sample or in the ACS. The goal would be to publish estimated population counts and other summary statistics (e.g., means and frequencies) with standard confidence intervals computed using variances resulting from implementation of multiple imputation, and also to release in electronic form multiply imputed versions of cross-tabulations and PUMS for the same purpose.
Finally, with respect to shifting long-form-type data collection from the decennial census to the ACS, it is useful to point out that the missing data problem is somewhat different for that survey. The ACS, having a lower sampling rate than the long form, has fewer neighbors in close proximity to a household with item nonresponse (and the rate of item nonresponse is likely to be lower given experience with the Census 2000 Supplementary Survey; see Section 7-C.2). This argues more strongly for a model-based approach. Also, the data file is somewhat smaller, putting fewer computational demands on an algorithm. Further, there is the possibility of stratifying responses for imputation by mode of response (telephone vs. mail). Given this, the algorithms that are feasible and optimal for these two problems may not be identical. But a first guess would be that the problems are sufficiently similar so that procedures useful for the long-form sample would be useful for ACS. An early decision on which approach is to be taken for collection of long-form information in 2010 would help focus research. Lastly, the continuous nature of the ACS would make it easier to carry out research and experimentation on possible imputation methods.