ADDITIONAL COVERAGE EVALUATION RESEARCH USEFUL FOR CENSUS ERROR REDUCTION
The Census Bureau’s research program, described in Chapter 3, will lead to important improvements in coverage measurement in 2010 for assessment of components of coverage error. This chapter discusses other research activities that are potentially valuable but not part of current plans. Greater use can be made of data from the 2000 Accuracy and Coverage Evaluation (A.C.E.), and data captured in 2010 can be structured to facilitate exploration of the relationship between census component coverage error and specific census component processes. While some of these suggestions might be implemented in time for the 2008 dress rehearsal to provide guidance for the 2010 census, the design of the latter is relatively firm, and therefore most of the benefits would not be realized until the 2020 census, though plans for implementing these ideas would need to made prior to the 2010 census to collect and save the requisite information.
We begin by discussing the existing research literature on personal and household factors and census processes associated with components of coverage error. We argue that a key product of a census coverage measurement (CCM) program with the objective of census improvement is a database that jointly represents census processes; person, household, and area characteristics; and census component coverage error assessments. This database can support analyses of factors associated with census component coverage error, which would advance identification of census processes that can be improved. We then discuss how the Census Bureau can better use the 2000 data both to guide design of this database and to help complete the design of the 2010 coverage measurement program. We conclude with some thoughts about planning for coverage measurement in 2010 and how to report coverage error to users.
THE RESEARCH LITERATURE ON PERSON AND HOUSEHOLD CHARACTERISTICS AND CENSUS PROCESSES ASSOCIATED WITH COVERAGE ERROR
Demographic analysis and dual-systems estimation for the 1980, 1990, and 2000 censuses were not designed to identify characteristics of individuals, households, or areas that were associated with high or low rates of components of census coverage error, or
processes responsible for these errors. These methods are limited for that purpose for at least two reasons.
First, demographic analysis and dual-systems estimation measure net coverage error, which obscures many offsetting census omissions and erroneous enumerations. Second, these coverage measurement programs only disaggregate coverage error by a limited set of variables: demographics (age, sex, race, ethnicity), some modest geographic detail (census region), and other variables that measure urban/rural, mail return rate (high/low), and owner/nonowner status. This is true for demographic analysis since it is limited to the information in the record systems utilized. The level of detail in dual-systems estimation has been limited by the restricted number of poststrata used and therefore to variables included in the poststratification. While many of these variables are associated with reasons for census coverage errors, relatively modest differences in net undercoverage rates between many poststrata in 1990 and in 2000 suggest that many of these associations are themselves modest. Furthermore, none of these factors has been chosen on the basis of potential links to potentially deficient census component processes.
Since the past two censuses conducted coverage measurement primarily to support adjustment, it is commendable that the Census Bureau has also devoted substantial resources to the study of factors associated with census coverage error. Studies of reasons for census omissions include several participant observation studies, first in the 1970 census (Valentine and Valentine, 1971), then during the 1986 Test of Adjustment Related Operations (e.g., Garcia-Parra, 1987), the 1988 dress rehearsal (Martin, Brownrigg, and Fay, 1990), and the 1990 census (Ellis, 1995). In addition, the Census Bureau supported ethnographic studies during the 2000 census (de la Puenta, 2004), as well as the 1993 Living Situation Survey, which assessed response to a variety of residence and household composition cues (see, e.g., Martin, 1999). These studies identified person- and household-level characteristics associated with the misinterpretation of the census residence rules or with noncooperation with the census, which might be due to mistrust of government or fear of exposure of illegal behavior (e.g., Brownrigg and de la Puenta, 1993; Bates and Gerber, 1998; Martin, 1999).
More quantitative studies include Fein (1990), who used logistic regression to identify factors associated with census undercoverage, and studies (e.g., Dillman, Treat, and Clark, 1994) of effects of mail presentation on census mail response (and hence potential undercoverage). Analyses by Ericksen et al. (1991) suggest that census undercoverage was greater in areas with low mail response rates, high crime rates and rampant drug use, or high rates of irregular housing, for individuals with low levels of English literacy or unfamiliarity with surveys (the poor and the less well educated), in housing units that share a common address or are likely to be omitted from the census Master Address File for other reasons, and households that include distant relatives and nonrelatives. Ericksen et al. (1991) also pointed out that coverage improvement programs, in particular those more distant from Census Day, were associated with a high rate of census coverage error, especially erroneous enumerations.
The less extensive literature addressing reasons for whole-household omissions (e.g., Childers, 1992; Moriarity and Childers, 1993; and in particular Ruhnke, 2003), suggests that there are substantial problems in enumerating households in small, multiunit buildings.
There has also been some research on factors associated with erroneous enumerations and duplications, with a good example being work on factors associated with duplicates in the 2000 census, especially with respect to group quarters (see, e.g., Feldpausch, 2001; Fay, 2004; Mule 2001, 2002).
While the research literature that has been only touched on here is considerable, the reasons for census omission and erroneous enumeration still remain poorly understood, as do the census component processes that would benefit from modification to reduce their frequency of occurrence. For example, a recent National Academies study of census residence rules (National Research Council, 2006) reported that little is known about the extent to which the following types of individuals were missed, duplicated, or erroneously included in the census: people with multiple residences and highly mobile populations (including snowbirds and sunbirds,1 modern nomads, commuter workers and people in commuter marriages, and migrant farm workers), individuals in complex household structures (including children in joint custody, cohabiting couples, and recent immigrants), linguistically isolated persons, people in long-term-stay hotels and motels, people dislocated by disasters, and people residing in unusual housing stock.
The extent to which this research literature has directly motivated changes in census processes is unclear, but it is probably relatively limited, given the nonspecificity of the information collected. However, as mentioned in Chapter 1, some coverage improvement programs were added in 1980 and 1990 due to findings from demographic analysis and dual-systems estimation on the high differential rate of omission of young adult black men, and a number of the design changes for 2010 were consequences of information collected by A.C.E. in 2000.
INTEGRATING CENSUS PROCESS DATA AND PERSON, HOUSEHOLD, AND AREA CHARACTERISTICS WITH CENSUS COVERAGE ERROR DATA
Each stage of the decennial census consists of a number of alternative component processes. For example, there are a number of different ways in which an address for a housing unit can be added to the Master Address File. Also, various areas of the United States are initially enumerated using mailout-mailback, update list-leave, or list-enumerate (and other less common processes). There are various stages of nonresponse follow-up and coverage follow-up. Alternative ways of being enumerated include the Be Counted program (which allows people to provide census data if they believe they were missed in the census), telephone questionnaire assistance, and processes that help
households obtain foreign language questionnaires in the mail or receive other forms of language assistance, including actual enumeration. Very different techniques are used to enumerate people living in various types of institutions or other group quarters. This outline is only a hint at the many parts of a census process that is in total enormously complicated. For more details, see National Research Council (2004b:Chapter 4.)
As a result, a given household might take any of a number of paths through this census process “tree” to arrive at either a proper enumeration or a coverage error. The path depends on various characteristics of the household and its occupants, for example, the type and location of the housing unit, how complicated the relationships of the residents are, and their interest in cooperation. Recording the census process path taken and the corresponding person, household, and area characteristics is therefore crucial to understanding what factors may be associated with census coverage error.
This argument points to the need for a database that represents the census processes used to enumerate housing units, characteristics of the persons and housing unit and area, along with the assessment of correctness or type of coverage error represented by these cases. The 2006 census test attempted to collect such data, showing the Census Bureau’s interest in determining the value of such a database. If a database can be created that contains this information, properly linked, statistical models can be developed that are likely to be very effective in identifying those combinations of characteristics and processes that jointly result in higher rates of census coverage errors. As mentioned in Chapter 3, searching for factors associated with census errors can be regarded as a discriminant analysis problem, since one has a number of individuals whom the census did or did not miss, or did or did not duplicate, or did or did not erroneously include, or did or did not enumerate in the proper geographic area (and there could be several definitions of proper area), along with many potential explanatory factors.
THE POTENTIAL FOR IMPROVED ANALYSES OF 2000 CENSUS DATA
The evaluations following the 2000 census of the various census processes and A.C.E. usually did not combine information on census processes with detailed information on person, household, and area characteristics, beyond the factors used for the A.C.E. poststratification. The master trace sample (see National Research Council, 2004a) was created using 2000 census process data to provide at least a portion of the analysis capability outlined here, but its value was limited since it did not include information from A.C.E. on census coverage errors. (For details, see Hill and Machowski, 2003.) The A.C.E. data were carefully analyzed to evaluate the quality of the various sets of adjusted counts that were considered for important uses between 2001 and 2003. However, those analyses were directed at evaluating the reliability of estimates of net error, not at assessing predictors of components of census coverage error.
There are many obstacles to further analysis of the 2000 data. Many census coverage errors in 2000 were not errors under the stricter definitions given in Chapter 2.
Furthermore, the 2000 census data are now six years old and therefore may not be accessible or fully documented.
Nonetheless, analyses of the 2000 census and A.C.E. data with the current objective of census component coverage error measurement in mind might provide insights about potential modifications in census or census coverage measurement processes, and they might identify appropriate topics for further research in the 2010 census. A few possibilities include
Census omissions identified by A.C.E. could be matched to the merged E-StARS administrative records database to assess characteristics that predict omissions.
Addresses of whole-household omissions could be matched to the 2000 Master Address File database, which includes a history of additions to and deletions from the Master Address File as it was created and improved. Analyses of the matched database could help determine whether the addresses for these missed housing units were ever on the Master Address File and were dropped for some reason.
The 2000 A.C.E. data might be helpful in estimating how large the coverage follow-up interview might need to be in 2010.
These and other analyses certainly have problematic aspects, and the findings would not be confirmatory, only suggestive. However, these analyses would not require any fieldwork, and they might provide important information. Also, it is true that given the innovative plans for the 2010 census (described in Chapter 3), some deficiencies discovered in the 2000 census might no longer be relevant for 2010. Nonetheless, in many respects the 2010 design is quite close to that used in 2000, and more comprehensive evaluation of the latter would provide a better basis for understanding and evaluating the outcomes of the 2010 census.
Finally, using the 2000 census data in this way will help to clarify what data need to be saved from the coverage evaluation program, various management information systems, and other data associated with the execution of the 2010 census. It will also help to understand how best to structure a database to support analysis of census component coverage errors looking toward 2020.
LOOKING TOWARD 2010
The approach we propose argues that the Census Bureau, in order to satisfy its own goals for coverage measurement in 2010, needs to retain the necessary data from the 2010 census to support analysis of possible relationships between census component coverage error and census processes. The data that are retained should include information from the CCM program in 2010 on omissions, erroneous enumerations, enumerations in the wrong location, and duplicates, as well as the characteristics of the household and the local area in question. In addition, data from the various census processing files should be included indicating the specific census processes that were
used to enumerate (or not enumerate) a given housing unit. A properly structured database would link the information on census processes, people, housing units, and local areas, and the information on census coverage error, to support analysis of the combinations of factors associated with census component coverage error. Included in this database would be information to determine whether a person was a P-sample correct enumeration, omission, erroneous enumeration, duplicate, or an enumeration in the wrong geographic area. Similarly, also included would be information to ascertain whether a person was a census correct enumeration, census omission, erroneous enumeration, duplicate, or an enumeration in the wrong geographic area. Finally, it would be possible to determine, if a person was omitted in either the P-sample or the census, whether the whole household was also omitted.
Data from the various management information systems also could be folded in to represent aspects of the quality of the application of census procedures in a given area. Finally, other contextual information about each housing unit and its residents can also be folded in, possibly from the American Community Survey and E-StARS.
The structure of this database is crucially important, but it lies outside the panel’s expertise to provide its specifications. We are sympathetic with the challenge, since it is a complex undertaking to determine what data to include and how to link it to other related data. For example, it is likely to be useful to include the detailed information on the history of the formation of the Master Address File (including all of the various operations that can add or remove addresses from the list), the totality of results from nonresponse follow-up (including how many attempts at enumeration were made and whether the ultimate response was a proxy enumeration), the results from the coverage follow-up interview, the degree of item and unit nonresponse, and the various stages of matching of the postenumeration survey to the census. Representing this complexity will not be a simple matter.
Furthermore, it is unclear how much from the census processes can be saved in real time on a production basis. If constraints dictate the need to save data on a sample basis from various sources, it is unclear how that will reduce the utility of the database for answering various types of questions. We are uncertain as to the feasibility of data capture, and we hope to say more on this topic in our final report. However, the key is to try to anticipate the type of analyses that would be useful to carry out and then determine a database structure and contents that facilitates carrying out those analyses.
Once the database is available, two types of analyses should be carried out. First, many hypotheses generated from reports from the field, from census tests, and other sources can be confirmed using these data. For example, one might suppose that households that have been newly constructed are often missed in the Master Address File, or one might assume that linguistic isolation is a major cause of census undercoverage, or one might suppose that children in joint custody arrangements and people in nursing homes are often missed and often duplicated. These types of questions will be easier to address with a properly structured database. Second, in addition to these confirmatory studies, the Census Bureau should also carry out exploratory studies, examining the data
for unanticipated interesting relationships between census coverage error and census processes that might indicate a census process that was not effective for a small area or subpopulation of individuals or housing units.
What is important in this analysis is practical significance. The appropriate metric is how many census coverage errors could potentially be corrected through a modification of the relevant census process, both nationally and for important geographic and demographic domains.
While promoting the benefits of the construction of this analytic database, we are aware that feedback loops linking census component coverage errors to specific components of census processes are always going to be somewhat limited in their ability to pinpoint specific problematic components and to suggest alternatives. For example, knowing that there were many erroneous Be Counted enumerations in big cities is not extremely helpful toward identifying an alternative process that would reduce that error, since these cases tend to be problematic under the best circumstances. In addition, some of the situations discovered may be for such small populations that the census coverage measurement program will not have enough observations to support analysis. We intend to discuss in more detail in our final report how a feedback loop for improving census methodology might operate and what can be done to make it more effective.
ASSESSMENT OF CENSUS QUALITY WITH NEW METRICS
In Chapter 2, we suggest that coverage measurement results should be reported to inform users as to the quality of census counts. The appropriate summarization is not specified, except that the Census Bureau needs to provide assessments of net undercoverage for a variety of geographic and demographic domains. This has been accomplished in the previous two censuses with the release of information on undercoverage for census poststrata. With the new emphasis on four types of census component coverage error—rates of erroneous enumeration, duplication, enumerations in the wrong place (at various geographic resolutions), and omission—an important question is the extent to which users could benefit from having more local knowledge of these four types of errors and, if so, how should this be communicated?
It is unclear whether information on these rates for specific domains would be that useful to users, given their understandable interest in net error. Furthermore, what if the analysis of CCM data demonstrated several (nongeographic) predictor variables that were strongly associated with, say, omissions? Should the knowledge of these predictors be made available to users in some way? Should the communication be in the form of research reports without any sense of the amount of error for a given domain? This is another topic that the panel intends to consider for inclusion in the final report.
We urge the Census Bureau to initiate the development of a database that jointly represents person, household, and housing unit characteristics, census processes, and
census component coverage error to facilitate the development of statistical models to help link census errors to census processes in need of improvement.
Recommendation 3: The Census Bureau should collect data in the 2010 census to support development of a database that links person, household, and housing unit characteristics, census processes, and the presence or absence of census component coverage error. This database should also represent coverage errors, including erroneous enumerations, enumerations in the wrong place, duplications, and omissions. The use of this database would better identify the sources of high rates of census component coverage error.
Finally, the panel realizes that the various research and development activities already started by the Census Bureau on contamination, KEs, identification of duplication, CCM forms and sample design, analysis of the 2006 and 2008 test results, etc., are challenging. Furthermore, the panel has made a number of suggestions for further research, especially concerning the development of the logistic regression models, and we have suggested a new framework for analysis that will require additional staff resources. Given the importance of all of this research, which in essence is guiding the development of a feedback loop to improve census-taking over time, the panel thinks that the resources currently devoted to this effort are insufficient.
Therefore, we strongly advise the Census Bureau to provide the coverage measurement group with sufficient resources to carry out its current research program, its planning activities regarding the dress rehearsal and the 2010 census, and the activities listed in this report --including searching for covariates for the logistic regression models on net coverage error, greater targeting of the design of the census coverage measurement survey, further development of the small-area random effects modeling of CCM match rate and census correct enumeration rate, use of administrative records in coverage improvement and coverage measurement, further analysis of A.C.E. data to assist in the design of the census and CCM in 2010, and creation of the database on individual and household characteristics, census component coverage error, and census processes to help diagnose reasons for census coverage component error. Unless properly supported, the panel is concerned that resources will be insufficient to carry out the wide variety of research and planning activities needed in moving toward 2010.
Recommendation 4: Given the number of important research activities currently under way, the needed design of the coverage measurement programs in the dress rehearsal and in the 2010 census, and the additional research suggested by the panel, the Census Bureau should provide the coverage measurement group with sufficient resources to carry out its current research program, its planning activities regarding the dress rehearsal and the 2010 census, and the activities listed in this report.
This report provides an overview of the Census Bureau’s coverage plans for the 2010 census, along with some suggestions for additional work. In the panel’s final report, we hope to provide more direction on the following issues:
What data to save in 2010 to support the various coverage measurement models, including the feasibility of saving some data on a 100 percent basis, and the possible need for sampling from the output of some management information systems.
More work on the framework document looking into assumptions and estimation.
Random effects modeling for small-area estimation.
Variance estimation for synthetic estimation and related techniques.
Treatment of non-data-defined cases in logistic regression.
Allowable covariates in the logistic regression models (both in terms of balance issues for the E-sample and P-sample, and also due to Alho’s concern about using covariates related to census processes).
Sample design for the CCM survey, at both the state and substate levels.
The products to use to inform about census component coverage error.
Use of survey weights in logistic regression models.
Improvements in demographic analysis in 2010
How to exploit 2000 data more for 2010 design.
Very generally, how to best operate a feedback loop for census improvement.
What issues will come up in identifying duplicates in real time?
Finally, we also hope to consider the question of what a coverage measurement program entirely focused on measuring census component coverage error, including the use of administrative records, might look like.