Page 7

2

Combination of Information Across Areas

Given its substantial sample size, for many purposes output from the ACS collected directly from a group or an area at a point in time will be adequate to provide useful estimates. However, the utility of the ACS will be greatly enhanced through its use in producing indirect estimates, i.e., estimates derived by combining information from other data sources or from the ACS for other time periods or for other geographic areas, through the use of statistical models. Using the census, administrative records, other household surveys, and now the ACS, statistical models hold the promise of providing timely estimates for smaller areas and groups than would otherwise be possible.

Combining information is an area in which statistics has recently made important advances (see, e.g., National Research Council, 1992). This includes progress in empirical and hierarchical Bayes' modeling (facilitated by the advances in computation provided by Markov chain Monte Carlo methods), variance component modeling, small-area estimation, and time-series analysis, along with advances in generalized linear models (GLM). This is in addition to a greater understanding of how to accommodate complex sample designs using these techniques. While these advances have demonstrated wide utility, each individual application typically presents some novel complications. Especially given the variety and number of sources of information and the variety and number of different responses of interest (information on education, welfare, unemployment, income, health, etc.), understanding how to make use of these techniques in this setting presents a difficult challenge to the Census Bureau.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 7
Page 7 2 Combination of Information Across Areas Given its substantial sample size, for many purposes output from the ACS collected directly from a group or an area at a point in time will be adequate to provide useful estimates. However, the utility of the ACS will be greatly enhanced through its use in producing indirect estimates, i.e., estimates derived by combining information from other data sources or from the ACS for other time periods or for other geographic areas, through the use of statistical models. Using the census, administrative records, other household surveys, and now the ACS, statistical models hold the promise of providing timely estimates for smaller areas and groups than would otherwise be possible. Combining information is an area in which statistics has recently made important advances (see, e.g., National Research Council, 1992). This includes progress in empirical and hierarchical Bayes' modeling (facilitated by the advances in computation provided by Markov chain Monte Carlo methods), variance component modeling, small-area estimation, and time-series analysis, along with advances in generalized linear models (GLM). This is in addition to a greater understanding of how to accommodate complex sample designs using these techniques. While these advances have demonstrated wide utility, each individual application typically presents some novel complications. Especially given the variety and number of sources of information and the variety and number of different responses of interest (information on education, welfare, unemployment, income, health, etc.), understanding how to make use of these techniques in this setting presents a difficult challenge to the Census Bureau.

OCR for page 7
Page 8 We separate this topic into two subtopics: combining information across areas for a single time period and combining information across multiple time periods (and across areas). This is mainly for convenience of discussion, since clearly both topics need to be considered simultaneously. This chapter concerns the former issue, and the next chapter concerns the latter. Models that combine information from these various sources must take account of the following (and other) complications: the information sources to be combined could provide information for different populations (e.g., tax filers are not the same as residents of the United States), represent slightly different reference periods, and make use of different survey or data collection methods. Therefore, to combine estimates from these sources may require techniques that can combine estimates with measurement error and bias that are not well modeled or estimated. In addition, estimates are typically needed at different levels of geographic aggregation, such as national, state, county, and possibly lower (e.g., census tract) levels. Several questions concern the development of such models: What types of models are likely to be effective? How can estimates be combined with measurement error and other biases? At what level of aggregation should the modeling be done? For example, should estimates be modeled at the county level and then aggregated, or should estimates be modeled at the state level and, using simple types of models such as synthetic estimation or modeling county shares, passed down to counties? How can Bayes' (or related) methods be used to fold the direct estimates in with the model-based estimates? What complications are posed by the sampling weights, nonresponse, and undercoverage for each of the data sources? For some large areas, direct estimates from a relevant household survey are likely to be recognized as standard values given their lack of measurement error (but possibly appreciable variance, depending on the area), so agreement of indirect estimates incorporating ACS information with these standards would have the advantage of consistency with an accepted estimate. To address this, one possibility is to control the indirect ACS estimates to the standard estimates. Or one might try to use the ACS information to improve on these standard values. Both approaches, controlling and smoothing, 1 are complicated by the existence of several of these standard values. Should one consider modeling each ACS response separately, or is there some kind of 1 The term “smoothing” is used to indicate a wide variety of techniques in which two or more estimates are combined through use of weighted averages in order to reduce variance.

OCR for page 7
Page 9 very general missing data technique that would put each of the important household surveys together with the ACS, possibly into a multipurpose database that could simply be aggregated to provide estimates? An important special problem relevant to this last point is that of providing population estimates for demographic groups within counties. The ACS will provide information as to the size of these populations, and simply controlling ACS estimates to the existing population estimates ignores this important source of information. Therefore, how should ACS and population estimates be combined to produce better small-area population estimates? RESEARCH DIRECTIONS In his presentation on the topic, Tom Louis noted that direct estimation for small areas will typically be inferior to reasonable methods in which information is combined using statistical models, and in the ACS context, that will usually mean “borrowing strength” over geography, or time, or both. The determination of how to combine information typically involves a tradeoff of bias and variance. In making this tradeoff, Bayesian formalism effectively structures the integration of information and ensures that all uncertainties are captured by the posterior distribution. This approach can combine information for relevant data sources and can properly account for missing data. Further, Bayesian methods can address nonstandard goals, which is relevant to a topic addressed in Chapter 3—the use of ACS-based estimates for input into fund allocation formulas, which often have nonstandard forms and therefore implicitly nonstandard loss functions for the associated estimates, e.g., fund allocation formulas that have eligibility thresholds. Though the Bayesian approach has these and other attractive properties, due to the national importance of the ACS in providing estimates for various official purposes, its use in this context must have good frequentist properties (good objective performance) as well. A large body of literature validates the judgment that Bayesian methods used with care do have excellent, objective properties. In addition, they are no more complex than the application dictates, and they have the advantage of making all assumptions explicit. Bayesian methods separate the two activities of summarizing information and using it to make inferences. Multiple goals, such as those governed through the use of point estimates, estimation of ranks, and estimation of the cumulative distribution function of the underlying parameters, can be addressed individually, or a single compromise inference or estimate can be used that performs well (but not necessarily optimally) for all goals. This point initiated discussion concerning the distinction between production of estimates for general use and production of estimates for specific purposes. The approach advocated might be more relevant to an estimate needed for

OCR for page 7
Page 10 input to specific fund allocation formulas. However, if an agency is producing estimates for broad national purposes, it is not clear that this approach is relevant. In the particular application of small-area estimation involving the ACS, based somewhat on the experience of the small-area poverty estimates panel mentioned above, fixed-effects regression modeling combined with empirical and hierarchical Bayesian and random-effects modeling should be very effective in a wide variety of specific problems. Given that one is simply aggregating lower-level estimates to provide estimates at higher levels of aggregation, a natural concern is that the aggregate estimates will not approximately equal the direct estimates at higher levels of aggregation. (Equality would not be sensible given the variance associated with all sample-based estimates.) Bayesian hierarchical linear models can be developed, in principle, that would have the property that estimates at various lower levels of geographic aggregation would sum to the corresponding estimates at higher levels of aggregation; in addition, these sums would closely approximate the direct estimates for these higher levels of aggregation. To do this it would be appropriate to develop a model at the finest level of geographic/demographic aggregation and let the Bayesian prior to posterior mapping bring in data organized at various levels of aggregation. This can be challenging, since the number of parameters can become large, but approaches to a solution exist. Unfortunately, the property that estimates sum over levels of aggregation and also that higher-level estimates closely approximate direct estimates at that level of aggregation, which is currently possible for Bayes' hierarchical linear models, may be difficult to achieve for generalized linear versions of these models. In Rod Little's discussion of Tom Louis' presentation, he agreed that Bayes' hierarchical modeling, also known as full probability modeling, was attractive for complex ACS estimation tasks. Having available the full posterior distribution for estimates was important to handle loss functions other than that of mean square error. Of course, most analyses will continue to focus on common summaries, e.g., means and standard deviations, but there are alternatives that should be considered. One important application in which the posterior distribution would play a role is for multiple imputation of missing data, which the ACS will need to accommodate. When engaged in survey inference, the goal is to create predictive distributions for nonsampled and missing data values in a population. A well-constructed Bayes' hierarchical model can yield these predictive distributions in a manner that gives them many positive features. These models can incorporate information from disparate data sources, they can be used to treat missing data, and they can be used to appropriately reflect variables used in the sample design. Furthermore, these models allow borrowing strength across geography (and time) for good small-area estimates, and they support the display of the use of prior information, which can inform users as to the

OCR for page 7
Page 11 specific impact of the prior on the posterior distribution. These models have the further advantage of flexibility in that sensitivity to specification of the prior distribution can be readily assessed. While Bayes' hierarchical modeling can be used to incorporate information from administrative records, the use of such records requires that they be comparable across regions. This can be checked by keeping track of differences in programmatic rules and methods, but it can also be checked by comparing administrative record tabulations with survey data pooled over time, a technique similar to that used in the work by the panel on small-area estimates of poverty. One method for reducing regional biases if they exist is to form strata that cut across regional boundaries. The criticism that these models are too complex is easily refuted. Markov chain Monte Carlo techniques make the computational complexity of Bayesian models an increasingly minor issue. Similarly, the criticism that the techniques are too dependent on assumptions is refuted since the sensitivity can be assessed. Also, many simple, frequentist techniques rely (at times implicitly) on strong assumptions that may not be supported by the data. What really matters is not the complexity of the algorithm used to generate the estimates, but the complexity of the model itself and the key assumptions on which it relies. It is necessary to identify the mean structure and the variance structure and the hierarchy between the model components. It is then necessary to examine the sensitivity of the model to misspecification. FINAL POINTS Bayesian models are likely to provide a natural framework for combining information from the ACS, the census, household surveys, and administrative records. The various advantages of this framework, such as the incorporation of nonresponse, were stated. The identification of particular models for this purpose could not be done, since the ACS has not yet been fully implemented and therefore much about the data structure remains unknown.