Page 25
5
Weighting and Imputation
Nonresponse and undercoverage in the ACS will require various weighting and imputation schemes to produce annual and monthly estimates that accommodate both of these sources of incomplete data. The Census Bureau's current plans for the ACS involve the use of as many as 11 factors in an overall weighting scheme (see Alexander et al., 1998, for a more detailed description). The factors are designed to account for:
-
oversampling of small governmental units,
-
computer-assisted personal interviewing (CAPI) subsampling,
-
monthly variations of the percentage of the population using different response modes,
-
noninterviews (two separate factors called “noninterview factors”),
-
mode bias,
-
differences in nonresponse in individuals and households, so that housing unit counts agree with the totals on the master address file (two separate factors),
-
differences between marginal population totals and totals based on demographic analysis resulting from census undercoverage (referred to as a person post-stratification factor), and
-
household-level undercoverage.
As is obvious from these descriptions, some weights need to be applied to individual records and some are applied to household records.
While the justification for many of these factors is relatively straightfor-
Page 26
ward, for others, such as noninterview adjustment or mode bias factor, there are clearly a variety of ways to define the weights and how they are applied. How should alternatives be judged? What are the evaluation criteria? Also, the adequacy of some of the controls, such as population controls (person poststratification factors), need to be considered; perhaps methods could be developed in which ACS data would be used to improve various control factors.
Charles Alexander suggested that the weighting scheme currently used in the ACS pilot testing might include some unnecessary factors, since it was designed before data were available in order to have something in place for the July 1997 deadline for 1996 data. This timing resulted in a number of weighting factors that in practice are very close to one, so their utility, at least for the initial application, has been minimal. The overall approach that the Census Bureau adopted is somewhat old-fashioned, mimicking that used in the decennial census, but it is known to work. At least one of the weighting factors results from the Census Bureau's decision to have a given month's estimates make use of data collected during that month, rather than data originating from the sample selected for that month. This was done to provide data that would likely have less mean square error, based on the reasoning that it was easier for the respondent to provide reliable information for the month of data collection (reducing recall bias), but it has complicated the weighting factors needed to combat nonresponse.
The Census Bureau is not wedded to the current methodology, and there is still an opportunity to make modifications to the weightings that are used with the 1999 ACS. In addition, after data have been collected for a few years of full implementation, it is hoped that the weighting scheme can be changed to reflect specific aspects of the data. One area that is likely to change is the controlling of ACS population counts to county-level population estimates. The Census Bureau is uncomfortable assuming that the county-level population estimates are so reliable that they cannot be improved through combination with the ACS estimates, especially at the county level disaggregated by age, race, and sex. (Differences in residence rules also may make these controls questionable for this application.) For the future, the Census Bureau is even considering, through use of ACS data, development of population estimates at the tract level.
RESEARCH DIRECTIONS
Robert Bell's presentation on weighting intentionally raised questions more than it provided answers. There are at least three purposes of weighting: (1) for nonresponse adjustment and related issues, (2) for poststratification to accepted external values (such as dealing with undercoverage), and (3) to treat differential sampling rates. This discussion focuses mainly on weighting
Page 27
for treatment of nonresponse (and mode bias) or for differential sampling rates, and the few general points made relate to the use of weighting for those purposes. However, some minor points are also made concerning a specific use of poststratification in the ACS.
The typical purpose of weighting to adjust for nonresponse is to reduce bias without greatly increasing variance. Biases can occur when three conditions hold: (1) sampling probabilities vary (designed or otherwise, as through nonresponse) with some background variable, x, such as geography, tenure, or demographic characteristic; (2) an outcome of interest, y, also varies with x; and (3) the sampling probability is correlated with y. In this way, housing units with a high (or low) probability of being sampled are likely to have a response that is higher (or lower) on average, which will cause a bias. If the nonresponse is properly modeled as a missing-at-random process (i.e., the value of the outcome, y, is not involved in the probability of response conditional on x), weighting could theoretically eliminate the bias. Although this assumption is generally not true, weighting can generally serve to reduce bias.
At the same time, weighting tends to increase variance, which raises weighting choices involving tradeoffs of bias and variance.1 The ideal solution would be to compute an estimate that is a weighted mean of predicted values for strata—the predicted values making use of a model for nonresponse within the strata—weighted by the population size of the strata. Then the variance could be controlled by “shrinking” the predicted values through the type of models discussed in Chapter 2. For example, an analysis of variance model could relate stratum estimates based on the demographic or other characteristics used to define the strata.
Practically, however, many users either cannot or would not perform this type of modeling. The Census Bureau therefore needs to use a method that takes into consideration how the data will actually be analyzed, which is to make use of an output file that is the result of the application of weights.
There are a lot of decisions involving weighting, such as whether to try to address a specific set of problems or to try to apply a more general weighting framework, what level of geography to use in forming weighting cells, what factors to take into account, and whether to cross-classify these factors or to do some sort of raking.2 Most of the decisions (both explicit and implicit) made by the Census Bureau in its weighting scheme for the ACS seem very reasonable. However, there are a lot of things that could have been done in other ways. It is worth mentioning that bias correction is elusive, in the sense
1 While this statement is true when there is no use of additional, external information, there are applications of weights when variances are reduced through the use of external information. See, e.g., Rosenbaum (1987), which provides a technique that might be applicable to the ACS when using external information.
2 Raking refers to a constant multiplicative adjustment within a row or a column so that the rows or columns of the revised table add up to a given marginal row or column total.
Page 28
that correcting for bias at a given level of aggregation (e.g., geographic) is no assurance that a bias does not exist at a lower level of aggregation. So, for example, correcting a bias at the county level does not necessarily correct for bias at the tract level.
Because weighting involves a tradeoff between bias and variance, how should this tradeoff be considered? One criterion that makes sense is to minimize mean square error (though there are alternatives). This criterion still leaves two important issues: (1) For which outcomes is one minimizing mean square error and (2) at what level of geography and for what period of time should the focus of the weighting be? With respect to outcomes, it is important to decide which outcomes have the highest priority and therefore require weighting treatment. Should the focus be on demographic characteristics or attributes captured on the census long form? Reflecting the second point above, what is the relative importance of accuracy at different levels of geography or for shorter versus longer time periods? If one tries to minimize mean square error for each month of the survey, that is going to put too much emphasis on reducing variance, and as a result there is going to be bias that will tend to replicate month after month, which will dominate the variance for longer moving averages, such as annual estimates. Setting up this criterion is not easy, but it is very useful to do, since it puts emphasis on the outcomes, and it makes explicit which interactions one believes are important to consider.
An ACS feature worthy of attention is that the monthly estimates are based on responses collected in that month, which is quite different from the questionnaires collected from the panel selected for that month. Furthermore, respondents are asked to provide answers corresponding to the month of data collection and not corresponding to the month of panel selection. One major advantage of this is that the data are available sooner, since, for example, data from households in the March sample that respond by mail in March are immediately available for combination with data from the January and February sample households that also responded in March by telephone and in person, respectively. In contrast, for sample-based cohorts, one would have to wait until May for the complete data collection for March. The Census Bureau supports this with a second advantage, that of reducing recall bias. While there are advantages, it does necessitate additional weighting, referred to as “variable monthly sampling weights,”3 which are used to address the biases that might be caused by the decision to base estimates on the data collected during a month.
Consider a case in which the March mail response was 40 percent of the sample, the CATI response was 30 percent of the February panel, and the
3 The phrase “variable monthly sampling rates” will refer in the following specifically to weights that are applied to address this potential source of bias.
Page 29
Mode and Month of Data Collection |
||||
Month of Mailout |
|
CATI |
CAPI |
Total |
January |
Jan 55% |
Feb 25% |
Mar 20% |
100% |
February |
Feb 45% |
Mar 30% |
April 25% |
100% |
March |
Mar 40% |
April 30% |
May 30% |
100% |
NOTE: Boldface type indicates the data collected in March.
CAPI follow-up was 20 percent of the January panel (see Table 5-1). Then the 20 percent and 30 percent need to be jointly reweighted to represent the 60 percent missing from the March sample, so the individuals in those cells should get a weight of 1.2. This is reasonable if there are no major month-to-month differences. However, suppose the reason there was only a 40 percent response in March was because there was a flood in March, and the mail responses were from people who responded early in the month, who had a high propensity to respond. In this case it is not clear that the January CAPI and the February telephone responses are necessarily representative of the same groups for March. The question is whether variable monthly sampling weights can fix systematic variations in response rates.4
A second set of weighting factors that might benefit from an alternative approach is the noninterview weights. These are two-stage weights in which the first stage involves weighting all respondents to account for nonresponse and the second stage accounts for the different modes of response. (These stages are clarified in the example that follows.) For purposes of this set of weighting factors, there are three categories of people in terms of response: (1) mail or CATI respondents, (2) CAPI respondents, and (3) CAPI nonrespondents (see Table 5-2). One approach to this problem would be to assume that the CAPI respondents provide the best information about nonrespondents, which would argue for weighting the CAPI respondents to take CAPI nonresponse into account within each tract (see Table 5-3). The Census
4 A simulation study is currently under way by the Census Bureau to examine whether using data collected in a given month (rather than from the sample selected for each month) is causing too much of a bias to support continuation of that procedure.
Page 30
Mode |
Tract I |
Tract II |
Total |
Mail/CATI |
60 |
80 |
140 |
CAPI |
40 |
20 |
60 |
None |
10 |
20 |
30 |
Total |
110 |
120 |
230 |
Weights |
Weighted Counts |
||||
Mode |
(1) |
(2) |
(1) |
(2) |
Totals |
Mail/CATI |
1.00 |
1.00 |
60 |
80 |
140 |
CAPI |
1.25 |
2.00 |
50 |
40 |
90 |
Total |
110 |
120 |
230 |
Bureau is concerned that this could generate relatively high weights in some tracts if the number of CAPI responses is small. Instead, the Census Bureau decided to make use of a two-stage procedure. In the first stage, all respondents are weighted up to account for the CAPI nonresponse (see Table 5-4). In doing so, one essentially imputes complete records through weighting using the records corresponding to mail/CATI respondents more than the CAPI respondents, so it is necessary to reweight to remedy this.
In the second stage, the mode bias factor downweights mail/CATI respondents by the ratio of the original weights across tracts to the weights taking nonresponse into account (see Table 5-5). The difficulty with this procedure is that the weights for each tract no longer match the number of (weighted) housing units in the sample; this is treated using another weight in a later stage of the weighting scheme, as demonstrated below (see Table 5-6).
In the example provided, the two-stage weights for this factor have the advantage of considerably less variability (a range of 0.91 to 1.66) compared with that for the procedure that reweighted only the CAPI respondents (a
Page 31
Weights |
Weighted Counts |
||||
Mode |
(1) |
(2) |
(1) |
(2) |
Total |
Mail/CATI |
1.10 |
1.20 |
66 |
96 |
162 |
CAPI |
1.10 |
1.20 |
44 |
24 |
68 |
Total |
110 |
120 |
230 |
NOTE: To reweight Mail/CATI and CAPI respondents to equal total, apply ratio of total to number of respondents. Therefore, apply the weights of 1.10 = 110/100 and 1.20 = 120/100.
Weights |
Weighted Counts |
||||
Mode |
(1) |
(2) |
(1) |
(2) |
Total |
Mail/CATI |
0.95 |
1.04 |
57.0 |
83.0 |
140 |
CAPI |
1.46 |
1.59 |
58.2 |
31.8 |
90 |
Total |
115.2 |
114.8 |
230 |
NOTE: To make the CAPI percentage equal the observed rate, apply the four weights as follows: 0.95 = 1.10 (140/162), 1.04 = 1.20 (140/ 162), 1.46 = 1.10 (90/68), and 1.59 = 1.20 (90/68).
range of 1.00 to 2.00). In a sense, instead of a cell-based model, one is using a type of raking. The reduction in the variability of the weights is not a compelling argument to use this procedure, especially if there are strong effects that are particular to tract-mode combinations. However, in the absence of strong tract-mode combination effects, it is difficult to design alternatives to this weighting procedure that have obvious advantages and that also keep the variability of the weights from being too large. Therefore, depending on the data, this approach may have a considerable advantage over the procedure described above.
Page 32
Weights |
Weighted Counts |
||||
Mode |
(1) |
(2) |
(1) |
(2) |
Total |
Mail/CATI |
0.91 |
1.08 |
54.4 |
86.8 |
141.2 |
CAPI |
1.39 |
1.66 |
55.6 |
33.2 |
88.8 |
Total |
110.0 |
120.0 |
230.0 |
NOTE: To make the housing counts in each tract equal to the observed number, 0.91 = 0.95 (110/115.2), 1.08 = 1.04 (120/114.8), 1.39 = 1.46 (110/115.2), and 1.66 = 1.59 (120/114.8).
Another factor with some interesting alternatives is the person poststratification factor. The idea of this weighting factor is to try to correct for undercoverage by controlling the ACS population counts to independent population estimates based on age, sex, race, and Hispanic origin at the county level. The key question is whether the population estimates provide better information than the ACS data at the county level and for demographic subgroups. The estimates from the ACS probably provide valuable additional information on changes since the last census at low levels of aggregation, but for higher-level totals the direct ACS estimates may be inferior. In addition, even controlling to better demographic data may not improve the estimates of population characteristics. For example, if ACS has inferior estimates because there is differential undercoverage, weighting probably will help; however, if the estimates are inferior as a result of the different manner in which people respond to the ACS questionnaire in contrast to the census questionnaire, then the ACS estimates may be harmed through use of this weighting.
A related point is that the ACS provides an estimate of the average population over the year, while population controls represent a point-in-time estimate of population, which could make a difference in areas with seasonal populations. In reaction to this possibility, the Census Bureau has developed a different question on the 1999 ACS form about the seasonal population. (This is one of potentially several areas in which ACS could be controlled to estimates that are conceptually distinct.)
The following points require further consideration. First, given that there are both individual and household weights, any inconsistencies between person and household weights may cause problems for users. Second, any weighting
Page 33
method needs to take the size of counties into consideration since they vary tremendously in size, which has an effect on the size of weighting cells.
In addition, there are other issues to examine. First, as the ACS accumulates enough data over time, it might be possible, using some simple time-series techniques, to model mail response (and other kinds of considerations) and thereby identify unusual differential mail response patterns to determine when certain weighting methods might have advantages.5
Second, the variable monthly sampling weights should be checked to see if there is an interaction of response mode with various characteristics, e.g., a vacancy rate. There is some evidence of the following situation: if a housing unit is vacant, it will obviously not respond by mail or telephone; it will be interviewed 2 months later with CAPI, at which point it may be occupied. However, if it had been occupied initially, there might have been a response, so there is a potential bias. This is true for any characteristic that differs between movers and nonmovers. The variable-mode sampling weight factor was an attempt to address this problem; its success is not clear.
Third, there is the possibility of different responses to the ACS for race and ethnicity questions. The calibration that is discussed in Chapter 7 should provide information as to the degree of this problem.
Fourth, weighting rules do focus on reducing bias without explicitly considering possible increases in variance. If these estimates are going to be used as inputs to nonlinear allocation formulas, the work in Chapter 4 indicates that with a high variance even unbiased estimates can result in strongly biased allocations. Therefore, the appropriate emphasis may be less concern with bias and more with getting good estimates.
Fifth, the county-level controls for race and Hispanic ethnicity are not now very reliable. Implicit in the methodology that produces these estimates is the assumption that anything at lower levels of aggregation within a state changes in the same way as the entire state. Therefore, information since 1990 at substate levels is not used to produce the estimates. Either the ACS should be used to improve these estimates or administrative record information at that level of aggregation should be incorporated into the estimates.
FINAL POINTS
Some alternative approaches to weighting and imputation methods were examined in comparison to the current plans put forward by the Census Bureau. An examination of these alternatives on ACS pilot data will determine which techniques to apply when ACS is fully implemented.
5 It was mentioned that the average ACS nonresponse rate was around 2 percent, probably because response is currently required by law and therefore it is unlikely to matter how one treats nonresponse if the mandatory status is retained.