Appendix B
Logistic Regression for Modeling Match and Correct Enumeration Rates

It is reasonable to suspect that match rates and correct enumeration rates, in addition to being a function of the variables used to define the accuracy and coverage evaluation (A.C.E.) poststrata in 2000, may also vary across the local census offices used to manage the workload in the census. The local office identifiers are on the A.C.E. research database, but they were not included in the six logistic regression models described above or the study by Schindler (2006).

Local census office indicator variables might be predictive of match and correct enumeration rates because factors that are particular to small areas could affect ease of enumeration. For example, local economic conditions and the expertise and capabilities of local census office administrators could vary. Because of the large number of local census offices (more than 500) and the limited amount of data for each, these effects are more naturally represented as random effects. By including these random effects in the logistic regression models, the Census Bureau could estimate the effects of individual offices on match and correct enumeration rates and obtain valid estimates of the contribution of variability across offices to uncertainty about coverage rates in each area.

Malec and Maples (2005) explored this approach by adding local area random effects into a synthetic estimation model and then measured the variance component of these random effects for local census offices. The ultimate objective of this approach is a small-area estimation methodology that would provide a compromise between synthetic estimation and a design-based estimator for each local office area.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 153
Appendix B Logistic Regression for Modeling Match and Correct Enumeration Rates It is reasonable to suspect that match rates and correct enumeration rates, in addition to being a function of the variables used to define the accuracy and coverage evaluation (A.C.E.) poststrata in 2000, may also vary across the local census offices used to manage the workload in the census. The local office identifiers are on the A.C.E. research database, but they were not included in the six logistic regression models described above or the study by Schindler (2006). Local census office indicator variables might be predictive of match and correct enumeration rates because factors that are particular to small areas could affect ease of enumeration. For example, local economic con­ ditions and the expertise and capabilities of local census office admin­ istrators could vary. Because of the large number of local census offices (more than 500) and the limited amount of data for each, these effects are more naturally represented as random effects. By including these random effects in the logistic regression models, the Census Bureau could estimate the effects of individual offices on match and correct enumeration rates and obtain valid estimates of the contribution of variability across offices to uncertainty about coverage rates in each area. Malec and Maples (2005) explored this approach by adding local area random effects into a synthetic estimation model and then measured the variance component of these random effects for local census offices. The ultimate objective of this approach is a small­area estimation methodol­ ogy that would provide a compromise between synthetic estimation and a design­based estimator for each local office area. 1

OCR for page 153
1 COVERAGE MEASUREMENT IN THE 2010 CENSUS Because of the complex design of A.C.E.’s postenumeration survey (weighted cases within samples of block clusters), many of the empirically correct enumeration rates and match rates used in Malec and Maple’s model are more variable than the nominal sample sizes would indicate. To account for the extra variability, Malec and Maples (2005) used a pseudo­likelihood approach with effective sample sizes estimated by the bootstrap approach. In this approach, both logistic regression models (for match rate and correct enumeration rate) have the following generic form: p  log  i , k  = βi + µ k + α i , k ,  1 − pi , k  where bi is the fixed effect for ith poststratum membership, mk is a random effect for the kth local census office, and aik is model error. Furthermore, ( ) µ k ~ N ( 0, Σ ) and α ik ~ N 0 , γ ce(i) , 2 where ce(i) is an index representing the collapsing of the poststrata into 11 or 8 cells, depending on whether the model is applied to the E­sample or the P­sample. Malec and Maples (2005) were able to estimate the large number of parameters in these models using Bayesian simulation. This research suggests that inclusion of small­area effects could sub­ stantially improve coverage estimates. Several questions remain: how best to treat the complex sample design, how many random effects can be included and at what level of aggregation, the best way to estimate the model parameters, and how the model fit should be assessed. The panel is impressed with this high­caliber research that addresses an important issue in coverage modeling; further work in this area would be very valuable. Mulry et al. (2005) examined the following anomalous results in A.C.E. More than 5 percent of incorporated places1 in 2000 had an esti­ mated net overcount of greater than 5 percent, and 0.5 percent had a net overcount of greater than 10 percent. This result runs counter to findings from the 1980 and 1990 coverage measurement programs of the potential net overcoverage due to true erroneous enumerations and duplications. In contrast with 2000, only 0.1 percent of places had an estimated net under­ count of greater then 5 percent, and nationally, the degree of overcoverage and undercoverage were of essentially the same magnitude. There is a concern that the lack of balance of designated erroneous enumerations and designated omissions may be due to the use of proxy status and the type of census return as poststratification variables for the E­sample but not for P­sample computations. 1 See http://www.census.gov/dmd/www/ACEREVII_PLACES.txt for a list.

OCR for page 153
1 APPENDIX B To examine this further, Mulry et al. (2005) demonstrated that by using proxy status in the E­sample poststratification, there were 91 places with a net overcount of more than 10 percent: however, if it is assumed that there was no error for proxy enumerations, there were only 16 places with net overcounts of more than than 10 percent. Furthermore, if one assumes that there were no errors for proxy enumerations and no errors for late nonmail returns, there were only four places with a net overcount of more than 5 percent. Given this and given that 27 percent of proxy enumerations had insufficient information for matching and follow­up, it is clear that proxy enumerations could contribute to substantial balancing error. The Census Bureau concluded that proxy enumerations contributed to these anomalous findings, but that it was not the only cause. Related research carried out by Spencer (2005) examined the quality of synthetic estimates for block clusters based on A.C.E. revision II esti­ mates, either using 938 E­sample poststrata and 648 P­sample poststrata or using the same 648 poststrata for the E­ and P­samples. His findings, in which the standard of comparison was either (a) the direct dual­systems estimate or (b) the census count plus people found in the P­sample who were omitted in the census for each block cluster, suggested that coarser but consistent poststrata may have provided more accurate estimates of net coverage error than finer poststratifications based on different E­ and P­sample stratifications. However, for large blocks with proxy rates greater than 10 percent, the finer and inconsistent poststrata performed better. The specific model form for logistic regression is p  = Xβ . log   (1 − p )  As described in the literature on generalized linear models, this represents a specific relationship between the mean of a random variable and a linear combination of predictors, called the link function, y log  .  (1 − y )  Research on the best link function is continuing at the Census Bureau, with possibilities that include logit, probit, loglog, and robit. An incorrect link function would result in poor extrapolations to situations that do not occur in the P­ or E­sample data, unnecessary interaction terms in the model, and other typical results of lack of fit. The panel suggests that if the Census Bureau uses the Hosmer­Lemeshow goodness­of­fit test, it may help to choose the appropriate link function: that test will indicate whether an alternative link function would provide a better fit to the data.

OCR for page 153
16 COVERAGE MEASUREMENT IN THE 2010 CENSUS Several complications would remain to be addressed. Software for Alternate Link Functions. If it is discovered that an alter­ nate link function is preferred, it might require a modest amount of software development to implement. However, this should be relatively straightforward in either SAS or R, which are two standard statistical software systems that the Census Bureau uses. Loss Function or Objective Functions for Assessing Fit of Models. Another complication is that the current loss function underlying the fitting of the coefficients of these logistic regression models is implicit in the separate likelihood equations for the two models and is therefore somewhat dis­ connected from the ultimate goal, which is to predict the population size or, what amounts to the same thing, net coverage error. It may be that the ultimate goal can be better represented by weighting the likelihood equa­ tions to take this modified objective function into account. The Census Bureau has done some work in this direction and we support this research and its implementation if it is found to provide preferred estimates. Measurement Error. Census data are subject to measurement error, and these errors will have deleterious effects on the application of logistic regression models. If the measurement error is unrelated to the outcome (match status or correct enumeration status), the effect on the data is the attenuation of relationships. In other words, the predictors will not be as effective without the measurement error. But if the measurement error is related to the outcomes, the effect could be much more complicated, including the introduction of severe biases.