Page 12

3

Combination of Information Across Time

While very large, the ACS sample size collected from the smallest jurisdictions in a year will not be sufficient to support direct annual estimates of acceptable precision. Current plans are to not publish direct estimates for areas with populations of less than 65,000 people. For these areas, instead of the individual yearly estimates, the Census Bureau proposes to report equally weighted moving averages of the ACS yearly estimates for the most recent 2 to 5 years, depending on the size of the area. (The use of the cross-sectional models suggested above and the possible oversampling of governmental units with less than 2,500 population could reduce, but likely not eliminate, the need for some kind of borrowing of information for the smallest areas.)

Assuming linear changes in the true response of interest over time, moving averages will be estimates of the situation in an area 6 months to 2 years in the past, which would still be preferable to use of the decennial census information, which is generally less current and can be as much as 10 or more years out of date. Rather than use a moving average, particularly the equally weighted moving averages under consideration, other time-series approaches are possible, especially when one considers that ACS information (unofficially) can be tabulated on a monthly basis, therefore providing 60 observations over a 5-year period. Alternate forms of time-series modeling (e.g., ARIMA1) could reduce the variance of the resulting ACS estimates compared


1 ARIMA, autoregressive integrated moving average, is a broad class of time-series models.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 12
Page 12 3 Combination of Information Across Time While very large, the ACS sample size collected from the smallest jurisdictions in a year will not be sufficient to support direct annual estimates of acceptable precision. Current plans are to not publish direct estimates for areas with populations of less than 65,000 people. For these areas, instead of the individual yearly estimates, the Census Bureau proposes to report equally weighted moving averages of the ACS yearly estimates for the most recent 2 to 5 years, depending on the size of the area. (The use of the cross-sectional models suggested above and the possible oversampling of governmental units with less than 2,500 population could reduce, but likely not eliminate, the need for some kind of borrowing of information for the smallest areas.) Assuming linear changes in the true response of interest over time, moving averages will be estimates of the situation in an area 6 months to 2 years in the past, which would still be preferable to use of the decennial census information, which is generally less current and can be as much as 10 or more years out of date. Rather than use a moving average, particularly the equally weighted moving averages under consideration, other time-series approaches are possible, especially when one considers that ACS information (unofficially) can be tabulated on a monthly basis, therefore providing 60 observations over a 5-year period. Alternate forms of time-series modeling (e.g., ARIMA 1) could reduce the variance of the resulting ACS estimates compared 1 ARIMA, autoregressive integrated moving average, is a broad class of time-series models.

OCR for page 12
Page 13 with the use of moving averages. An additional advantage of such alternatives, in comparison with equally weighted moving averages, is that the resulting estimates could be used as predictions for the current year and therefore would have less time bias. Use of these methods would also reduce some equity concerns when estimates from the ACS are used as inputs into fund allocation formulas, since it would be helpful to provide estimates with as little difference in variance as possible for areas (e.g., counties) regardless of their size. Borrowing ACS information across time also raises a broader combination-of-information challenge than represented in the discussion of the cross-sectional models discussed above, since the ACS, most household surveys, and most administrative records data are collected annually (and sometimes monthly). These inputs for other time periods could be used in several ways to improve the above cross-sectional models. RESEARCH DIRECTIONS Bill Bell organized his presentation into three pieces: (1) general borrowing of information over time in repeated surveys, using time-series models, (2) the specific use of moving averages or ad hoc smoothing models for the purpose of borrowing information over time in repeated surveys, and (3) methods used in the project on small-area estimates of poverty to combine information across time and geography and the relevance of these methods to ACS. Borrowing Information Over Time in Repeated Surveys The original work on borrowing information over time in repeated surveys was by Scott and Smith (1974). They assume that yt = θt + et, where yt is the time series that is observed, θt is the true process of interest, and et is the sampling error. In this situation, data and models are needed for both the time series of the sampling errors et —whose distribution is primarily determined by the sampling autocovariances—and that of the true response θt, which is assumed to have a stochastic (error) term (e.g., a common assumption is that θt = λθt−1 + εt) and may depend on regression variables. In this context, the stochastic term for the true process is generally assumed to be correlated over time and nonstationary. Best linear unbiased prediction (based on multivariate normal conditional expectations and variances) is used to estimate the model's parameters. One difference between this and cross-sectional models is that it is more difficult to recognize uncertainty in model parameters. Variance estimation of the estimates is complicated, but there are simulation approaches to this problem. Bell was aware of two implementations of this method: by Dick Tiller at the Bureau of Labor Statistics on the

OCR for page 12
Page 14 state labor force time series and by the Australian Bureau of Statistics to estimate regional time-series estimates. This approach has been researched for some years, but it has been difficult for researchers to demonstrate substantial gains for the following reasons: when sampling error is low, substantial gains are not possible; when sampling error is high, although there is the potential for substantial gains, estimating the model's parameters is more difficult. A key problem in this area concerns the need for a model of the sampling errors, especially the autocovariances. Current plans do not exclude the possibility that in the ACS design the sampling errors will be approximately uncorrelated. If that turns out to be the case, it would permit a great simplification. Another issue is the consequences of uncertainty about the variances and time-series parameters, in particular the signal-to-noise ratio (i.e., model error variance relative to sampling error variance). Furthermore, there has been little study of the robustness of the resulting estimates to model misspecification. A related topic is that of benchmarking. This is the adjustment of estimates from, for example, a monthly survey so that the estimates for a given year, when summed, agree with annual data produced from another survey or a census. This adjustment would be supported by the assumption that the annual survey had less sampling error (which is reasonable) and possibly less nonsampling error. If so, benchmarking can reduce both sampling and nonsampling error. In the small-area poverty work (National Research Council, 1998, 1999), the Census Bureau made the assumption that the CPS contained less nonsampling error than the census, which is the reverse of the more usual situation. As a result, the Census Bureau did not constrain the CPS results to agree with the census. The interesting technical question was therefore how to use the census data to reduce variances without substantially increasing nonsampling error. This was accomplished by using the census data to define regression predictors in the CPS equation and using the fitted CPS equation to carry out empirical Bayes' smoothing, which effectively calibrates the estimates to a CPS basis. This issue will need to be addressed with the ACS with respect to surveys that are considered to be highly reliable for various outputs. USE OF MOVING AVERAGES FOR BORROWING INFORMATION OVER TIME IN REPEATED SURVEYS One suggested method for borrowing information that is under consideration for the ACS is the use of moving averages, e.g., estimating the true series at time t by averaging values for the observed series for the k closest time periods (where k is some small integer). Moving averages are a simple way to achieve reduction in the size of sampling errors associated with an estimate,

OCR for page 12
Page 15 assuming that the sampling errors are relatively uncorrelated, with the downside that one is not then directly estimating the quantity of interest. So the use of moving averages results in a particular bias-variance tradeoff. Also, moving averages have a time delay for contemporaneous estimates, though asymmetric moving averages can provide estimates with less time delay. An interesting complication is that the application of moving averages to survey data later to be used as inputs in a model (e.g., for use in regression models as a dependent variable) may be problematic, as the moving averages will alter the statistical properties (particularly the autocorrelations) of the data. With model-based smoothing over time, one is typically producing minimum mean square error estimates of a response at the current time, assuming that the model is correct. Moving averages, as ad hoc procedures but still model-based, tend to disguise the underlying model that one is assuming. In various applications, the underlying model may or may not be sensible. For the specific situation where the true series follows a random walk (i.e., θt = θt−1 + et) and the sampling errors are uncorrelated over time, the resulting optimal time-series smoothing weights depend only on the signal-to-noise ratio. If one has 5 years of data, with a small signal-to-noise ratio, the optimal method approaches equal weighting. As the signal-to-noise ratio increases, one gets close to using only the direct estimate. Similar results could be obtained from other models, and further study is needed for other situations. COMBINING INFORMATION CROSS-SECTIONALLY AND ACROSS TIME The third part of Bell's presentation concerned how the Census Bureau might put together ACS, household survey, decennial census, and administrative records data over several years to produce small-area estimates, based on the work of the small-area poverty estimates program at the Census Bureau (described in the Appendix). The estimation strategy for small-area poverty estimates starts with a base model, representing the use of direct estimates for small areas, which is simply true process plus sampling error, as in the model by Scott and Smith (1974). It also uses a regression model for the true process, with an additive model error. The regression variables come from administrative records data. In the county model, the census data are also brought in as an additional covariate. In the state model, the Census Bureau incorporates information from the decennial census by adding the residuals from the analogous regression fit using census data as the dependent variable. The Census Bureau is examining a more recent approach for the county-level model, referred to as the bivariate model, which uses two linked equations, one for the census estimate and one for the CPS estimate. They are both true process plus sampling error models in which true process is modeled using

OCR for page 12
Page 16 multiple regression and the error terms for these models are assumed to be correlated, which links the two models. There are formulations of this approach in which the census residuals show up in the CPS regression, but with a coefficient that varies according to the sampling variance in the census. Since the sampling variances in the state model are small, this approach makes little change to the state model, but it has an effect in the county model. A related approach would be to include the use of a measurement error model to link aggregate CPS and census responses, which can be thought of as a restricted form of the bivariate model. The small-area poverty estimates have not yet incorporated multiple years of CPS data into the model, though the Census Bureau has experimented with state-level models that use up to 5 years of CPS estimates using a multivariate generalization of the bivariate model. The problem, similar to that discussed above, was in developing a time-series model for the model errors. Further, given the high-level of sampling error for counties and most states, it was unlikely that including previous years of CPS data would be that helpful. When considering the ACS as an added source of information, most likely as a replacement for the census in the above models, especially the bivariate model or multivariate generalizations, its large sample size could result in at least two modifications to small-area poverty modeling. First, the census may be less useful as a covariate when the ACS is included. Also, for the same reason, using past years of ACS data may have greater value than using past years of CPS data. Clearly, this area is only beginning to be explored. The most troublesome possibility is when one cannot conclude that the regression model is stable over time, because in that case one has many fewer data points with which to work. Another difficulty occurs when modeling discrete outcomes, since there the theory is even less well developed. Discussant Eric Slud stated that there are two distinct types of information involved in this area: statistical variation and the variation of the signal over time. Borrowing strength over time involves understanding the models that generate both, otherwise the bias could be substantial. Understanding the autocovariance of the sampling errors is extremely important. More generally, this borrowing of information over time is a highly model-dependent activity. Therefore, it is important to emphasize model-checking assessments and whether the validity of assumptions can even be assessed. Another key question is how to model nonsampling error, which does not appear to be treated in the literature, except in applications with longer time series. A key nonsampling error here is that the CPS, the census, and the ACS present different approaches to measuring various quantities—for example, in the primary example cited, for measuring poverty. In order to make full use of these measurements in combination, one would benefit greatly from the use of a measurement error model. To support development of such

OCR for page 12
Page 17 a model, more matching studies are needed between the CPS and the census, and, when possible, three-way match studies involving all three data sources. These studies might also help in measuring the degree of cross-correlation among the measurements, which would be useful in a Kalman filter 2 approach to this problem, and they might help support simple weighted combinations of direct estimates and model-based estimates. Generally speaking, every small area will follow its own time-series model. The only hope is that one can model these collectively in a simple form, in which case it will be possible to share information across small areas through a regression model that includes a shared random-effects component. This is a multiple time-series problem, and it has not been well researched. The development of time-series methods that use auto- and cross-correlations that do not exist or are poorly estimated is an important point. It supports the need for some kind of study of whether there are substantial divergences from these estimated or assumed auto- and cross-correlations and, if so, what the implications are. A first step would be checking the sensitivity of the results to the use of alternate auto- or cross-covariance forms. This could be a component of a larger study of model form that is related to nonsampling error. This step will be needed in the effort to calibrate the long form with the ACS (discussed in Chapter 7) and in calibrating the ACS with various household surveys. Such studies will also help in the interpretation of standard errors that would be produced from time-series analysis. The floor discussion raised additional modeling ideas. One possibility would be to use direct estimates at a higher level of aggregation and moving averages of shares over time to allocate these estimates to smaller areas. The question was raised whether the ACS would release monthly estimates, which would facilitate time-series modeling. Other issues raised included the benefits from incorporation of spatial autocorrelation structure in these models and ways of addressing the effects of census undercoverage in these modeling problems. FINAL POINTS The various requirements for the development of time-series models for combining information in this context, including estimating the sampling autocorrelation structure, were put forward. The difficulties of parameter 2 Kalman filters involve a “state space” representation of a time series, which assumes that a linear model (with an additive error term) represents the relationship between the observed series and a set of state space variables (representing aspects of the time series such as trend or seasonality), and a second linear model represents the relationship between the state space variables at time t and at time t − 1.

OCR for page 12
Page 18 fitting and model validation were emphasized. As a result, the borrowing of information across time could be difficult, and even if accomplished might not yield substantial gains. In particular, it may not be easy to find an estimation procedure that provides a substantial improvement over the current decision by the Census Bureau to use moving averages. However, the potential remains for improvement, and methods were discussed that might be used to move forward.