National Academies Press: OpenBook
« Previous: 3 Data Collection Methods
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

4

Data Processing and Analytic Issues

Several aspects of data processing can affect the quality and usefulness of the data to users. This chapter covers four topics: the effects of the population controls, the effects of data review, the role of administrative records, and the role of small area estimation in the production of American Community Survey (ACS) estimates.

POPULATION CONTROLS

One of the major differences between the ACS and the decennial census long-form survey is the type of estimates available to serve as population controls as part of the weighting methodology. Since the long-form survey was administered as part of the decennial census enumeration, controls from the full count were used as a basis for controlling long-form estimates for small geographic areas, such as census tracts. In contrast, the ACS uses controls from the Census Bureau’s population estimates program at the county level by age, sex, race, and Hispanic origin, and at the subcounty level for the total population of incorporated cities and minor civil divisions (for those states that have those jurisdictions).

Challenges Associated with the Current Population Controls

The controls for the ACS are created using the decennial census as a base, with components of change derived from vital statistics, other administrative records, and survey data. The Census Bureau’s Population Division works with Federal State Cooperative for Population Estimates (FSCPE)

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

agencies in each state to produce subnational population estimates. FSCPE agencies supply vital statistics and information about group quarters. They also review and comment on the estimates produced by the Census Bureau.

Changes of address on tax returns are used to create domestic migration rates, vital statistics are used to calculate the balance of births and deaths (natural increase), and the ACS is used to determine international migration. For the population aged 65 years and older, Medicare records are used to determine migration of that population. The Census Bureau’s controls for housing units are created with data on new construction and, for most areas, a demolition model based on data from the American Housing Survey. In both cases—for population and housing—the base for controls is the decennial census. The areas for which housing and population controls are created by this method include larger counties and groups of smaller counties.

For legal or political areas at the subcounty level, such as incorporated cities and towns and minor civil divisions, the Census Bureau estimates population using an allocation based on housing units. It is important to recognize that many large jurisdictions, such as New York City’s five counties or Clark County, Nevada, do not have legal-political geographic entities smaller than the county level that are recognized by the Census Bureau for much, if not all, of their populations. Thus, in these jurisdictions, controls for the ACS are set at the county level—both housing unit controls and controls for population by age, sex, race, and Hispanic origin.

The panel did not have the resources to conduct a comprehensive analysis of the accuracy of population controls below the county level, but it did examine the magnitude of discrepancies between decennial census and ACS counts for age-sex strata in a systematic selection of communities. For the ACS 3- and 5-year estimates, the controls are simple averages of the most recent population estimates. In particular, for the 2012 5-year ACS, these estimates include the intercensal estimates for 2008, 2009, and 2010 and the postcensal estimates for 2011 and 2012. Thus, the population estimates all make use of data from the 2010 census and therefore are not subject to the known deterioration of accuracy of county-level population estimates in years that are increasingly further from the last census (Albright, 2011; Yovell and Devine, 2013). Although it seems reasonable to expect that the average of these estimates would approximate the April 1, 2010, census estimate, as would the distribution of characteristics by age, sex, race, and Hispanic origin, comparisons for the selected communities, ranging from the size of an average census tract to cities with populations exceeding 100,000, show large differences in many age-sex cells: see specific examples in Appendix B. In some cases, there were discrepancies by factors as large as two.

These discrepancies between the census base and the 2008-2012 ACS

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

estimates for variables used as controls are an important point of concern. The ACS relies on these controls as a means of “grounding” the ACS estimates, using the census as the “gold standard,” in order to compensate for differences in coverage of groups, a critical factor in many places.

It would be beneficial to have a subcounty population estimation methodology that more closely reflects and builds on decennial census estimates for small geographic areas not covered by the present program, taking into account the time since the last census. Such an approach could be applied to both postcensal and intercensal controls.

RECOMMENDATION 12: The Census Bureau should conduct research on how the decennial census can be used for controls for the American Community Survey at a finer level of geographic resolution than the controls currently used on an annual basis.

Traumatic Events

If census controls are problematic during normal times, then the challenges become even greater when conditions in a local area are affected by catastrophic events. The usefulness of the ACS data in such situations is undermined unless the population estimates that serve as controls reflect the actual conditions and rest on a solid empirical foundation.

The terrorist attacks of September 11, 2001, rendered the decennial census data for a sizable portion of Manhattan obsolete. After the attacks, through a special arrangement with the Census Bureau, a data file from the 2005 ACS was acquired for a customized set of geographic areas in Manhattan. This file provided data based on the only representative sample available of the population post-9/11. Most important, New York City engaged the Census Bureau in the years that followed 9/11 by offering a housing unit–based population estimate for Manhattan as part of the population estimates challenge program. This helped maintain the integrity of the population controls as the ACS entered full implementation.

After the fall of 2005, when Hurricane Katrina hit the Gulf Coast, the Census Bureau, in cooperation with local authorities, made an effort to provide information on population and housing. Since the administrative data sources that the Census Bureau relies on to generate population estimates under normal circumstances (birth and death records, filings with the Internal Revenue Service [IRS], Medicare records, and state counts of group quarters populations) were either incomplete or too lagged in time to reflect post-disaster population conditions in 2006, special strategies to estimate the population were adopted. The Census Bureau used U.S. Postal Service national change-of-address records to track the movements of indi-

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

vidual households and develop special January 2006 population estimates for affected counties.

In the cases of 9/11 and Katrina, the Census Bureau successfully took on challenges in concert with local governments in an effort to provide a better picture of population and housing conditions on the ground. In October of 2012, yet another traumatic event occurred when superstorm Sandy hit the East Coast of the United States. It is clear that there is a new and very challenging environment for the application of controls to estimates in the ACS. Thus, it is important to ask whether going forward, the population and housing unit controls used in the ACS will adequately reflect the actual population and housing stock in local communities in the face of traumatic events.

The 12 months of data collection in 2013 (which will be available in fall 2014) should yield valuable information for the areas affected by Sandy, both for public-use microdata areas and smaller geographic areas. The problem is that the controls for July of 2013 may not reflect the true population of the affected areas in New York and New Jersey. The most serious issue is likely to be the time lag for changes of addresses on income tax returns from the previous year; these data may be problematic for the creation of accurate domestic migration rates. Even without the time lag, many displaced residents may keep their original addresses when filing returns, leading to the erroneous conclusion that conditions have not changed. Moreover, the Bureau has not developed controls for housing that adequately reflect the role of housing demise—demolitions and the like—since a model from the American Housing Survey is currently used in most places to gauge demolitions, not actual permit data.

Confusion over the number of demolitions caused by superstorm Sandy, especially in communities on the New Jersey shore, may have caused the number to be understated, perhaps partly due to uncertainty over the status of housing units in situations where units may be standing but uninhabitable. Units awaiting repairs or demolition or just in abeyance because of requirements from the Federal Emergency Management Agency or flood insurance regulations add to the uncertainty. Further, re-occupancy of previously existing housing may not occur for a significant period of time or may not occur at all. New federal flood insurance requirements may make re-occupancy too costly for many homeowners. Thus, some communities may have a sizable number of homes in flux regarding their condition or occupancy status. This situation will be a challenge, not only for ACS follow-up operations, but also for estimation and weighting of housing units, particularly because the identification and estimation of vacant housing units in the ACS has been especially problematic, even under more normal circumstances (see, e.g., Albright, 2011; Cresce, 2012; Yovell and Devine, 2013).

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

When the ACS was conceived, environmental issues were a subject of discussion, but few could anticipate how these would grow in importance, with a series of weather events that brought the problem of climate change into daily news reports. Current expectations are that these weather events will become more frequent, with several “100-year storms” each decade. Establishing mechanisms in concert with local governments so that the estimates used as controls for the years following a traumatic event adequately reflect changing conditions might not be cost neutral, but there is an opportunity here for the Census Bureau to demonstrate the usefulness of the ACS by maximizing its potential to measure “current” conditions in communities across the nation. Again, the controls need to be accurate so that any remedial action can be based on accurate estimates of an event’s impact.

RECOMMENDATION 13: The Census Bureau should conduct research on the benefits of developing procedures and standards for the creation of controls for the American Community Survey that can be put in place in times of disasters or other disruptive events. The benefits of closer collaborations with state, local, and tribal governments should be explored for the development of controls in general and for crisis situations in particular.

DATA REVIEW

Most of the ACS data review is carried out after 1 year’s worth of data are edited and imputed and the data products (including those based on the multiyear datasets) are generated. This final review before the estimates are released is performed by subject-matter analysts, with the goals to verify that the data edits have been correctly specified, the microdata seem reasonable, the data products have been correctly specified and rendered, and the supporting documentation does not include any errors.

The data review has four steps: (1) review of supporting documentation; (2) edit review; (3) data review, including a process for the 1-year data and a process for multiyear data; and (4) data product review, again including a process for the 1-year data and a process for multiyear data. Examples of specific actions for steps (2)-(4) are summarized in Box 4-1.

The review relies on a number of automated tools, but it is a massive and very resource-intensive operation. Furthermore, although many of the checks are automated, issues that are flagged as part of the automated process generally require manual review. Typically, the review of 1 year’s worth of 1-year data tables takes a large number of analysts more than a month. As would be expected, most of the errors identified during the review are associated with changes, such as new questions or products introduced since the previous year’s review. Because the ACS is still fairly

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

BOX 4-1
Examples of Data Review Steps

Edit Review

  • Verify edit specifications.
  • Verify variable universes.
  • Review tallies.
  • Review matrix counts.
  • Review consistency among variables.
  • Review unweighted imputation rates.
  • Examine edited frequency distributions by allocation flag values.
  • Compare unweighted unallocated and allocated relative frequency distributions.
  • Compare unweighted current-year and prior-year relative frequency distributions.

1-Year Data Review

  • Compare current-year and prior-year summary distributions.
  • Compare current-year and prior-year derived measures.
  • Compare ACS estimates with other Census Bureau estimates.
  • Review weighted imputation rates.
  • Compare weighted unallocated and allocated relative frequency distributions.
  • Verify data product specifications.
  • Verify the programming of any new or modified data products.
  • Verify any new or modified table shells.

Multiyear Data Review

  • Review variable crosswalking and inflation adjustment on the unweighted multiyear microdata, done by the Census Bureau’s Population (POP) and Social, Economic, and Housing Statistics Divisions (SEHSD).
  • Review variable crosswalks and inflation adjustment on the unweighted multiyear microdata.
  • Review the multiyear core measures and report the results to branches in POP and SEHSD.
  • Review the coordination staff materials and decide whether to clear or perform additional review, done by POP and SEHSD.
  • Review the disclosure avoidance performed on the multiyear microdata, done by selected POP and SEHSD branches.

new, there are still a relatively high number of changes from year to year. Further revisions of survey content and products are likely to occur in coming years (albeit limited by the need for continuity of measurement in order to estimate trends). Thus the ACS staff may continue to be stressed by the burden imposed on the current quality control system.

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

Although some of the errors are introduced during the various stages of data preparation (such as weighting or imputation), it appears that a small number of errors are associated with problems during the fieldwork (such as field representatives not administering the questionnaire correctly). Although challenges during the fieldwork are to some degree unavoidable, the fact that the review does not commence until a full year’s worth of data are collected leads to situations in which it is too late to correct the problem, and some of the estimates have to be suppressed. This time lag can affect not only the 1-year data from the previous year but also other datasets that include the 1-year data. The Census Bureau has developed a system for ongoing monitoring of the data, but it has not been implemented, perhaps because full implementation would require changes to a large and complicated operation. However, for a survey of the scale of the ACS, it is particularly critical to ensure that problems are identified while they can still be corrected and while the consequences can be minimized. From the perspective of data processing, implementing ongoing quality control and editing processes is the most important next step. Once implemented, in the long run these changes could result in significant cost savings if they prevent potential major errors from affecting a full year’s worth of data. Some of the new systems being implemented as part of the shift to adaptive design could also facilitate this process.

RECOMMENDATION 14: As a priority, the quality control and editing processes in the American Community Survey should be ongoing and as close to the data collection as possible, to ensure that problems are identified promptly and that their impact is minimized.

RECOMMENDATION 15: The Census Bureau should evaluate whether procedural changes might improve the efficiency of the American Community Survey quality control operations.

USING ADMINISTRATIVE RECORDS

The Census Bureau has several ongoing research projects on the potential use of administrative records, many housed in its Center for Administrative Records Research and Applications, which is a new interdisciplinary group within the Research and Methodology Directorate. These projects tend to be focused on the crucial step of evaluating the scope and quality of available administrative records databases, and the immediate interest is in the possible use of administrative records for modeling missing data or increasing operational efficiencies for the ACS.

For example, the 2010 ACS Match Study (Luque and Bhaskar, 2013), a continuation of the work on the 2010 Census Match Study, evaluated

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

administrative records coverage of 2010 ACS addresses, persons, and person-address pairs at different levels of geography as well as by demographic characteristics and response mode. The study looked at the coverage of records in several data sources, including:

  • Individual Income Tax Returns (IRS Form 1040)
  • Information Returns (IRS Forms 1099 and W2)
  • U.S. Department of Housing and Urban Development (HUD) Public and Indian Housing Information Center
  • HUD Tenant Rental Assistance Certification System
  • HUD Computerized Homes Underwriting Management System
  • Social Security Administration Supplemental Security Income records
  • Selective Service System Registration File
  • Centers for Medicare & Medicaid Services Medicare enrollee data
  • Indian Health Service Patient Registration File
  • U.S. Postal Service National Change of Address File
  • Temporary Assistance for Needy Families

In addition, the Census Bureau evaluated data from five commercial vendors: Experian, Targus, Veteran Service Group of Illinois, InfoUSA, and Melissa Data Base Source. These datasets tend to contain basic demographic information.

The ACS Match Study concluded that administrative records provided more than 90 percent coverage for addresses and persons in the 2010 ACS and around 75 percent coverage for person-address pairs. Coverage was lower for some groups, including young children, some ethnic minorities, and group quarters residents.

Another study (Bond et al., 2014) evaluated the potential for systematic biases in the Census Bureau’s ability to assign each record a unique identifier, called a protected identification key (PIK). That study found that the ability to successfully assign a PIK for person records in the ACS is lower for young children, minorities, residents of group quarters, immigrants, recent movers, low-income individuals, and unemployed individuals than others. This result probably reflects either that the identifying information was insufficient or that the information did not uniquely match any of the administrative records used in the person validation process. However, between 2009 and 2010 (the 2 years examined in the study), changes introduced to the Census Bureau’s Person Identification Validation System greatly reduced these biases.

Other sections of this report discuss the potential use of administrative records to improve data collection operations (Chapter 3), in small area estimation (below), or as substitutes for items on the questionnaire (Chapter 6). There are a number of ways in which administrative records

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

could also be useful in various stages of the data processing. Some of these options are discussed in this section.

Editing and Imputation

Administrative records could be used to replace misreported or missing items in the ACS directly or through modeling. Some examples of data that are available from administrative records that have been considered for this type of use include age, sex, race, Hispanic origin, earned income, welfare program participation, and food security program participation. The direct use of administrative records would involve matching individual-level ACS records to corresponding individual-level administrative records and using the information from the administrative records about the person or household to replace data that are inaccurate or missing in the ACS survey responses.

Administrative records could also be used for model-based imputation to improve the accuracy of imputed values. Indeed, the Census Bureau considers modeling missing data to be one of the more promising potential future uses of administrative records. Although direct imputation has the advantage of increased accuracy, it can be more resource intensive. Moreover, relevant records for the appropriate time period would have to be available at the time when the ACS data are being processed to ensure that delays are not introduced in the ACS data release. The confidentiality considerations can also be more complex in the case of direct uses of administrative data than in the case of their use for model-based applications.

Reducing Bias

Administrative records can be used to evaluate data accuracy, including bias resulting from sampling or survey nonresponse. This evaluation can be accomplished by comparing the individual-level characteristics of the survey respondents to matched person-level information from administrative records or by comparing aggregate survey responses to aggregate administrative records. Again, individual-level comparisons are resource intensive, but they can provide more insight into the problems identified with the data. To a limited extent, administrative records, particularly other Census Bureau records, are already used to evaluate survey data once the data collection and processing are complete. Income, assisted renters, public health insurance, receipt of benefits from the Supplemental Nutrition Assistance Program (SNAP), and residence 1 year ago have been among the administrative data considered for uses of this type.

Administrative records can also be used to improve the weights applied to the data. One challenge for the ACS is that subcounty-level controls

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

are not available from a full enumeration of the population conducted in parallel with the survey, as was the case with the census long-form sample. To reduce the level of variance in the subcounty estimates, the ACS Office uses administrative records from other federal agencies as part of a model-based estimation step in the weighting process for the multiyear data. The possibility of expanding this type of use of administrative records would be worth evaluating.

Evaluating Post-Collection Uses of Administrative Records

The research currently conducted by the Census Bureau on potential ways of integrating administrative records into the ACS is focused on the appropriate first steps in assessing feasibility, including understanding the coverage of the data in both federal and commercial databases and the extent to which the records can be matched to sample cases in the ACS. This provides an important basis for additional research projects.

Further research will be needed on the quality of the administrative records, especially in the context of comparisons to the quality of ACS data. In the case of administrative records beyond basic demographic characteristics, the extent to which the information available represents the same underlying concepts as those that the ACS is intended to measure will have to be evaluated. The reference period for which data are available and how that relates to the ACS data collection period is also an important consideration. Time is also a factor in terms of whether administrative records can be obtained on a schedule that does not adversely impact the ACS data release schedule. Finally, what types of permissions, if any, may be necessary from the individuals whose records are integrated into the ACS is important to assess for different potential uses, along with whether there are any new confidentiality concerns that could emerge.

RECOMMENDATION 16: The Census Bureau should coordinate efforts across units on research related to the potential use of administrative records, and when possible, the American Community Survey Office should build on the research being conducted in other units. Promising topics include the use of administrative records for adaptive design, as sources of data for items on the questionnaire, and to enhance estimation in the post data collection stages. (See also Recommendation 26.)

SMALL AREA ESTIMATION

Previous chapters describe the effects of the reduced sample size of the ACS relative to the decennial census long-form sample on the precision

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

of estimates for tracts and small governmental units. Domains defined by combinations of geography with demographic or other characteristics (such as “Iranian immigrants over 65 years old in New Jersey”) suffer a similar loss of precision for direct ACS estimates. (Direct estimates refer to those based solely on data from the same primary information source in the same domain.) Estimates for areas or other domains for which direct estimates are not acceptably precise are commonly referred to as small area estimates or small domain estimates. (The latter term is technically more general but the former is more common; we use them interchangeably.)

Small area estimation generally involves introducing supplementary information beyond that included in direct estimation in each domain. Statistical models or procedures based on hypothesized relationships among the data sources are then used to obtain improved “indirect estimates.” Small area estimation is successful when at least on the average (although not necessarily for every small area) indirect estimates are closer than direct estimates to the target estimands, the quantities that would have been obtained if the primary information source had been available for the entire domain population.

Supplementary information used in small area estimation may take several forms. One form is information from the primary information source extended over time (e.g., using data from previous years of the same survey to improve estimates for the current year), over space (e.g., using data from a larger surrounding area to improve estimates for a small area), over domain definition (e.g., using data from two- and four-person families to improve estimates of small area median income for three-person families), or over survey mode or method (e.g., using mail responses to predict potential in-person interview responses, as suggested in Chapter 3). Another form is information from distinct information sources that contain “auxiliary variables” related to the variables of interest in the primary information source. Typically, these auxiliary variables are measured with better precision than the primary variables because of larger sample sizes in the auxiliary data, but conceptual differences or nonsampling errors make it unacceptable to simply substitute the auxiliary variable for the primary source (e.g., income and family composition data from tax returns as an auxiliary to ACS estimates of poverty rates, data from the previous decennial census as a source for population and housing characteristics when estimates are desired for a more recent year).

Given the diversity of characteristics of primary sources (sample sizes and design, scales of measurement and distributional characteristics of variables, patterns of variation across various dimensions, units of measurement, etc.) and auxiliary data (the same characteristics and relationships to the variables of primary interest), as well as differing definitions and requirements of accuracy, a large literature of small area estimation meth-

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

ods has developed (for reviews of this literature, see Ghosh and Rao, 1994; Rao, 2003; Jiang and Lahiri, 2006; Pfeffermann, 2002, 2013). Despite this development of principles and methods, small area estimation is still not an “off-the-shelf” methodology: in fact, a concerted effort is typically required to develop a major new small area estimation product. Nonetheless, small area estimation methods may be the only practical alternative when it is infeasible to expand data collection to the scale required to obtain needed information through direct estimation.

The rest of this section discusses several approaches to small area estimation: current Census Bureau activities, spatial and temporal modeling, synthetic data, and general issues and principles with respect to the ACS.

Current Small Area Estimation Implementation and Development Projects Involving ACS Data

As the nation’s largest timely household survey, the ACS plays a key role in current small area estimation efforts at the Census Bureau, and is likely to continue to do so. Current and potential uses of the ACS in this work broadly fall into two categories. In the first, the ACS itself is the primary information source and contains the target variables; in the other, the ACS provides auxiliary variables for estimation of a measure on another survey. Broadly, one might think of the first of these as filling the gap left by the smaller samples of the ACS relative to the decennial census while maintaining the improved currency of the ACS, and of the second as uses of the ACS to extend the level of detail of population surveys that typically are much smaller than the ACS. A similar perspective emerged in discussions with the Census Bureau staff about their plans for small area estimation.

The first kind of use is represented by two ongoing Census Bureau series, the Small Area Income and Poverty Estimates (SAIPE) and Small Area Health Insurance Estimates (SAHIE) Programs. SAIPE was originally developed with support from the Department of Education to generate up-to-date state and county estimates of numbers and rates of children in poverty, which were required to calculate timely allocations of local school aid under Title I of the Elementary and Secondary Education Act. (The original SAIPE development program was extensively evaluated by the National Research Council [2000a, 2000b]).

The SAIPE Program produces poverty counts and rates for four age groups and estimates of median income for states and counties. Previously, the census long-form sample had been the only source for estimates at this level of detail, which could result in allocations that were based on data as much as 12 years old. Initial SAIPE releases relied on the Annual Social and Economic Supplement (March Supplement) to the Current Population Survey (CPS), a survey of approximately 60,000 households, for income data.

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

CPS data (averaged over 3 years) were the source of the dependent variable in a model that predicted poverty rates and population counts using auxiliary predictor variables from income tax and information returns, SNAP, and the preceding decennial census. Direct CPS estimates were combined with predictions from the model, weighted by their relative precisions. In many counties most of the weight was placed on the model, because most counties had few or no residents in the CPS sample.

Since 2005, the target variables are drawn from single-year ACS data (Bell et al., 2007). The much larger ACS sample supports direct estimates much more precise than those from the CPS, although there are nonsampling differences between the CPS and ACS income measurements. Hence, since introduction of the ACS, more weight has been placed on direct estimates, especially in the larger states and counties, improving precision and reducing any possible biases due to error in the auxiliary-variable regression model. Furthermore, the ACS sample includes people in every county, and these data contribute to county estimates even for counties that fall below the population threshold for public reporting of 1-year data. The transition from relying primarily on direct estimates to relying primarily on the model is seamless, in the sense that their relative weights vary continuously as a function of the precision of each.

The SAHIE Program produces estimates of health insurance coverage by state and county, using data inputs and methods broadly similar to those of SAIPE. An interesting feature of SAHIE is that it provides a joint distribution of insurance status and income (within age-sex-race and ethnicity demographic cells by state). Proportions for five income groups are estimated first under a normal model for logit-transformed proportions, and then with insurance rates within income-by-demography cells using a similar model. This differs from the age-stratified SAIPE estimates, for which the age distribution is estimated from the census or intercensal population estimates rather than a model. By providing estimates of a bivariate outcome, the SAHIE Program illustrates both the importance and challenges of multivariate small area estimation (Bauder et al., 2011).

Another Census Bureau small area application falling into the same general class but using a very different modeling strategy concerns estimation of the numbers of potential voters speaking a language other than English whose limited English proficiency may impede their ability to participate in elections (Joyce et al., 2014). Under the Voting Rights Act, political jurisdictions meeting criteria of rates or absolute numbers for any linguistic group are required to provide assistive materials in that group’s language. Although language group by age by detailed geography is drawn from the census, the measures of English-language proficiency are only available from the ACS. Because the areas (covered jurisdictions) may be small and the number of languages is large, the estimation problem is challenging and

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

not suited to the type of regression model used in SAIPE and SAIHE. The strategy adopted was to form classes of areas with similar predicted rates of limited English proficiency in a language group based on ACS variables and then to use a beta-binomial model to “shrink” estimates for each area toward the mean for the class. This generic methodology supports “mass production” of the large number of estimates required.

The reduced sample size of the ACS relative to the census long-form sample affects the precision of estimates for all of the variables in the ACS. This fact suggests that it might be beneficial to adopt a more generic approach to small area estimation from ACS data so the full range of data products could be released for domains whose 1- or 3-year estimates are now suppressed. Nugent and Hawala (2012) investigated an approach proposed by Schirm and Zaslavsky (2002) based on reweighting of survey data for relatively large domains to controls estimated for smaller domains, possibly using auxiliary data and/or regression estimation methods. The product of this methodology is a weighted microdata file of households, all of which are based on actual data, although some or all of the cases are “donated” from other areas. Once this file is created, all desired tabulations and other statistics can be calculated without requiring separate modeling efforts for each: the admixture of households from within and outside the small area provides some protection against inadvertent disclosure of confidential data.

An effort to apply this methodology to generate estimates for school districts, some of which are very small, was unsuccessful but informative (Nugent and Hawala, 2012). An important problem was the inconsistency between the geographical boundaries of many school districts and standard census geographies. An additional technical obstacle was the very large variation in weights in the ACS files, which contributed to problems with convergence of the algorithms. Another line of research aimed at providing generic methods for small area estimates of ACS variables used beta models as a general modeling strategy, extending the methods used in the Voting Rights Act analysis described above. One extension to this model accommodates areas in which the prevalence of a certain characteristic is either 0 percent or 100 percent (Wieczorek et al., 2012), which cannot be predicted under a standard beta model.

There are fewer examples currently for the second role of the ACS in small area estimation, in which the ACS provides auxiliary data for small area estimates of a variable appearing in another, smaller survey. This category is represented by a developmental project on state-level small area estimates for disability (Maples and Brault, 2013). Detailed information on disability is collected by the Survey of Income and Program Participation (SIPP) for a sample of about 37,000 households annually; however, the SIPP sample size and design are not capable of supporting state-level estimates.

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

The ACS (since 2008) contains six items about broad types of disability; the same items are asked on the SIPP (although on a separate wave of the survey than the more detailed scales). An individual-level regression model was fitted to the SIPP data predicting the SIPP disability items from age, sex, race and ethnicity, and the six ACS disability items appearing on the SIPP. Predictions from this model were calculated for state ACS samples to estimate state rates of disability on the detailed SIPP measures (the regression projection method of Kim and Rao, 2012).

Spatial and Temporal Modeling

Continuous measurement and geographical detail make the ACS a natural candidate for application of spatial, time-series, and spatio-temporal modeling to improve small area estimates, which is currently a topic of research. Spatial methodology has been shown to improve the precision of the ACS small area estimates. In the univariate case, Porter et al. (2014b) demonstrated the advantages of using intrinsic conditional autoregressive models in addition to auxiliary functional covariates (e.g., Google Trends data), rather than models having no spatial dependence. In contrast, Porter et al. (2014a) proposed two multivariate models: the first model had a separable outcome-by-space dependence structure, whereas the second model accounted for cross-dependence using a generalized multivariate conditional autoregressive (GMCAR) structure. In a state-level example, the GMCAR model yielded smaller mean square prediction errors relative to both the separable model and a multivariate model with unstructured dependence between outcomes and no spatial dependence. This approach is well suited to producing several estimates simultaneously rather than a series of separate estimates for different variables.

To aggregate data to user-defined geographies, areal data spatial models could be constructed using change-of-support methodology in which demographic variables are defined on new spatial supports. Bradley et al. (2014) developed an approach that models count-valued survey data using a Poisson distribution by interpreting Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach enables ACS data users to consider spatial supports other than those released for publication.

In principle, spatio-temporal small area estimation models might be considered for the ACS. Indeed, 3- and 5-year estimates could be regarded as a crude form of temporal modeling. To date, the ACS annual time series are too short for some of the more complex temporal models, but this will change as more years of data are collected.

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

Synthetic Data

Typical applications of small area estimation generate estimates and standard errors for specific domains. Depending on the nature of the sampling design, it can be somewhat difficult to aggregate (or disaggregate) these estimates into other domains. Doing so may require knowledge of covariances of estimates across the original domains, which often are not available in published material. One approach to this problem is to generate and release synthetic populations in which every geographic area (at the finest level of aggregation deemed of statistical use) has a complete roster of simulated households and individuals, each having all ACS variables imputed.

Synthetic populations could simplify secondary analyses of the ACS enormously. In particular, analysts can estimate any finite population quantity of interest in any geographic region by simple unweighted tabulations. Furthermore, if the Census Bureau releases multiple copies of the simulated populations, as in multiple imputation (Rubin, 1987; Raghunathan et al., 2003), then analysts can compute the variance of any estimate as the variance of the corresponding population quantities. These simple computations apply regardless of how an analyst aggregates the data.

To illustrate the outline for a fully model-based approach to synthesis, the data synthesizer might start with a list of housing units with some characteristics from the sampling frame or from the decennial census. The next step would be to estimate models for unknown household characteristics given known housing unit characteristics based on ACS sample data. To borrow strength across geographic units, the models might include random effects for blocks or tracts (possibly with spatial correlation). The synthesizer would then impute unknown household characteristics by sampling from the estimated models. Having generated a synthetic roster of households, the Census Bureau would next populate them with individuals by drawing from a model for person characteristics given household characteristics, generating a complete synthetic roster from which any desired tabulations or other statistics could be prepared.

The success of a synthetic ACS approach would depend on the quality of the models used for synthesis (Reiter, 2005). Constructing these models is a substantial challenge and might require new methodological developments. Nonetheless, there are precedents for synthesis of such complex datasets, such as the synthetic SIPP (Abowd et al., 2006) and the synthetic Longitudinal Business Database (Kinney et al., 2011). It may also be possible to reduce modeling effort and sensitivity to model specification by imputing or weighting into an area the actual households with the desired characteristics but from a different area, so only summary characteristics

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

need to be modeled rather than every detail of household relationships and personal characteristics (see Zaslavsky, 2004).

General Issues and Principles for Small Area Estimation with the ACS

The examples described above illustrate the feasibility and usefulness of small area estimation as a contributor to the production of small area statistical products and the central role of the ACS in such efforts. Because of the diversity and complexity of small area estimation methodology, as well as the diversity of data needs that might be addressed through small area estimation, the panel did not consider it to be within our scope to make specific recommendations about priorities for new programs or methodologies. Instead, we note several general principles and issues, some of which are illustrated by the above examples.

A number of methodological issues and potential solutions call for attention to realize the potential for small area estimation. First, a methodology that generates small area estimates for many variables at once would have advantages relative to a series of separate estimation projects for different variables, since interactions or cross-tabulated cells might serve analytic needs that are not met by tables for single variables. The multivariate spatial models, reweighted microdata, and data synthesis approaches described above are three possible approaches to this objective.

Second, small area estimation models in many cases generate model-based intervals with good properties as well as point estimates. Thus, a by-product of small area estimation of ACS variables may be a solution to the problem of implausible intervals (discussed further in Chapter 5).

Third, ACS data are useful as an auxiliary data source for the small area estimation of variables from other population surveys (the second type of use defined above) when the ACS includes variables predictive of the key outcomes of the other survey. This use can be a consideration in content definition for the ACS (which is discussed in Chapter 6). Because the ACS is so much larger than other population surveys, there could be considerable benefits to estimation even if such variables were included in the ACS only on a sampled basis.

Finally, the Census Bureau has a long and impressive history of protecting confidentiality of individuals’ data, and the ACS is no exception. The panel recognizes that the Census Bureau has controls in place to reduce risks of unintended disclosures and encourages the Census Bureau to continue to be vigilant in safeguarding confidentiality while preserving as much data quality as possible. One option is to offer tiered access to ACS data for different categories of researchers (see National Research Council, 2005). In this context, there is intermediate ground between tabular data and geographically nonspecific microdata released for public use, and block-level

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

microdata accessible only in a Research Data Center. For example, virtual data enclaves like those developed by NORC and in use in Europe could be allowed for approved researchers (in academia, government, and industry), to improve access to ACS data with acceptable risks to confidentiality.

The Census Bureau will also need to be cognizant of the potential for additional disclosure risks due to use of sensitive or potentially identifying auxiliary data in small area estimates. For example, if the Census Bureau synthesizes populations by substituting values from administrative records that are also in an external database, then unusual values could result in identification. However, the information in auxiliary variables is typically aggregated and modified through complex models, reducing the risk of disclosure relative to other potential Census Bureau uses of administrative data, such as substitution for nonresponse or supplementing frame creation.

A number of organizational issues affect the prospects for a small area estimation program. First, an important limiting step for expansion of small area estimation products is the availability and quality of auxiliary data sources. Some of the most valuable administrative sources, notably IRS databases of tax and information return, are only made available to the Census Bureau for a few applications. Such restrictions on sharing of data across agencies have limited the ability of the Census Bureau to make the best use of federal data for small area estimation, although recent encouragement from the U.S. Office of Management and Budget (2014) for increased administrative data sharing could increase collaborations. Second, administrative restrictions are exacerbated by the major effort needed to prepare administrative datasets for statistical use, including geocoding to the appropriate levels of census geography. Optimally, this effort would be spread over the maximum number of uses of the data, so preparation of data could be made a priority for administrative records staff.

Third, small area estimation involves a combination of general methodologies and survey- and subject-matter-specific expertise. The Census Bureau does have a small expert staff devoted to small area estimation methodology, and it also has staff working on small area estimation in a number of program areas, including the ACS. A cross-cutting organizational structure could connect staff working on small area estimation projects on different subject-matter topics and using different data sources, by encouraging sharing of methodology and rotation of staff across methodologically related projects and to avoid duplication of effort. Given the importance of small area estimation to ACS objectives and of the ACS to other small area estimation initiatives, the issue of ongoing staffing in the ACS is important, because it would be important to maintain staffing for this work on a continuing basis. Establishing and preserving links with small area estimation practitioners in other agencies would also be productive.

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×

Finally, the Census Bureau could encourage the user and research communities to develop methodologies for small area estimation using ACS data, both on Census Bureau designated and user-initiated topics. Mechanisms could include support through the National Science Foundation-Census Bureau Research Network and small-scale contracts, access to the Research Data Centers, challenge competitions, and a repository and search engine for small area estimation techniques and applications to which researchers could contribute.

RECOMMENDATION 17: The Census Bureau should continue its program of small area estimation using American Community Survey data, maintaining a balance of methodological research and development of production applications directed to current user needs, methods for univariate and multivariate estimation, and intramural and extramural research.

RECOMMENDATION 18: The Census Bureau should negotiate agreements with potential federal sources of auxiliary variables for small area estimation, allowing sharing of data for multiple developmental and production uses, with suitable protections of confidentiality. In particular, the Census Bureau should endeavor to broaden its data-sharing agreement with the Internal Revenue Service to facilitate statistical uses beyond those directly related to the Small Area Income and Poverty Estimates Program.

Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 59
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 60
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 61
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 62
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 63
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 64
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 65
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 66
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 67
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 68
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 69
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 70
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 71
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 72
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 73
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 74
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 75
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 76
Suggested Citation:"4 Data Processing and Analytic Issues." National Research Council. 2015. Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities. Washington, DC: The National Academies Press. doi: 10.17226/21653.
×
Page 77
Next: 5 Data Dissemination »
Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities Get This Book
×
 Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities
Buy Paperback | $56.00 Buy Ebook | $44.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The American Community Survey (ACS) was conceptualized as a replacement to the census long form, which collected detailed population and housing data from a sample of the U.S. population, once a decade, as part of the decennial census operations. The long form was traditionally the main source of socio-economic information for areas below the national level. The data provided for small areas, such as counties, municipalities, and neighborhoods is what made the long form unique, and what makes the ACS unique today. Since the successful transition from the decennial long form in 2005, the ACS has become an invaluable resource for many stakeholders, particularly for meeting national and state level data needs. However, due to inadequate sample sizes, a major challenge for the survey is producing reliable estimates for smaller geographic areas, which is a concern because of the unique role fulfilled by the long form, and now the ACS, of providing data with a geographic granularity that no other federal survey could provide. In addition to the primary challenge associated with the reliability of the estimates, this is also a good time to assess other aspects of the survey in order to identify opportunities for refinement based on the experience of the first few years.

Realizing the Potential of the American Community Survey provides input on ways of improving the ACS, focusing on two priority areas: identifying methods that could improve the quality of the data available for small areas, and suggesting changes that would increase the survey's efficiency in responding to new data needs. This report considers changes that the ACS office should consider over the course of the next few years in order to further improve the ACS data. The recommendations of Realizing the Potential of the American Community Survey will help the Census Bureau improve performance in several areas, which may ultimately lead to improved data products as the survey enters its next decade.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!