The Census Bureau's Small Area Income and Poverty Estimates (SAIPE) Program produces income and poverty estimates for states and counties, including estimates of median household income, total poor, poor under age 5 (states only), poor aged 5-17 in families, and poor under age 18. These estimates, which are updated every year for states and every 2 years for counties, are termed “indirect estimates.” They are indirect because they are developed from statistical models that use data from other areas and time periods, unlike “ direct estimates,” which are based solely on a survey's sample cases in the given area and period.^{1} The use of indirect estimation for producing updated state and county income and poverty estimates is necessary because there is currently no survey or administrative record data source that can provide the required estimates with sufficient reliability for intercensal years. Indirect estimates of poor school-age children for school districts are derived by using decennial census data to allocate the updated county estimates among districts.
The March Current Population Survey (CPS) collects the detailed in-
^{1 } |
Other terms are also used in the research literature for these concepts: for example, direct estimates are sometimes called “sample-based” estimates, and indirect estimates are sometimes called “synthetic,” “model-based,” or “model-dependent” estimates (see U.S. Office of Management and Budget, 1993). |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
3
Current SAIPE Models
USER OVERVIEW
The Census Bureau's Small Area Income and Poverty Estimates (SAIPE) Program produces income and poverty estimates for states and counties, including estimates of median household income, total poor, poor under age 5 (states only), poor aged 5-17 in families, and poor under age 18. These estimates, which are updated every year for states and every 2 years for counties, are termed “indirect estimates.” They are indirect because they are developed from statistical models that use data from other areas and time periods, unlike “ direct estimates,” which are based solely on a survey's sample cases in the given area and period.1 The use of indirect estimation for producing updated state and county income and poverty estimates is necessary because there is currently no survey or administrative record data source that can provide the required estimates with sufficient reliability for intercensal years. Indirect estimates of poor school-age children for school districts are derived by using decennial census data to allocate the updated county estimates among districts.
The March Current Population Survey (CPS) collects the detailed in-
1
Other terms are also used in the research literature for these concepts: for example, direct estimates are sometimes called “sample-based” estimates, and indirect estimates are sometimes called “synthetic,” “model-based,” or “model-dependent” estimates (see U.S. Office of Management and Budget, 1993).
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
formation on income needed to produce the required income and poverty estimates. However, the sample is too small to produce sufficiently reliable direct estimates for states, let alone counties. Indeed, most counties have no CPS sample. Therefore, state and county income and poverty estimates are obtained from statistical regression models, and the SAIPE estimates are produced by using weighted averages of the regression predictions and the direct CPS estimates, when the latter are available. The weighted average approach for combining the model predictions and the direct estimates is advantageous in that it strikes an effective tradeoff of the model error of the model predictions and the sampling error of the direct estimates.
The state-level model predictions are obtained from regression models in which a state's direct CPS estimate for the reference year is the dependent variable and the predictor variables are obtained from such sources as Internal Revenue Service (IRS) tax returns, food stamp records, population estimates from the Census Bureau's demographic estimates program, and the previous census. The SAIPE estimate for a state is then a weighted average of the model prediction and the direct estimate for the state.
The same general approach is used for the SAIPE county estimates, with the same sources of data for the predictor variables in the regression models. One difference is that 3 years of March CPS information are combined to form the dependent variables in the regression models and to calculate the direct estimates. For the poverty models, another difference is that the county models estimate numbers of poor (in logarithms), while the state models estimate the proportions of poor. For the one-third of counties that have households in the CPS sample, the model predictions are combined with the direct estimates, as is done for the state models. For the other two-thirds of counties, the model predictions are taken to be the estimates. As a last step in developing the SAIPE county poverty estimates, each of the county estimates in a state is multiplied by a constant factor that makes the sum of the adjusted county estimates equal the SAIPE state estimate.
For school districts, no administrative data are currently available from which to form predictor variables for use in poverty models. IRS and food stamp data are not available at the school district level. Counts of students approved to receive free school lunches are a potential source for all districts, but they are not now nationally available, and there are serious concerns about the comparability of the counts across all districts. Hence, the Census Bureau produces estimates for districts using a “shares” approach. This approach assumes that each school district in a county has the same proportion (share) of that county's poor school-age children in the estimation, or reference, year as it did in the 1990 census.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
Then the 1990 census shares of poor school-age children for school districts within counties are applied to the updated SAIPE county estimates to produce the SAIPE school district estimates for the reference year.
The production of indirect estimates like those from the SAIPE program is a complex operation that needs to be fully evaluated. The evaluation should check on the input data from the multiple sources, it should examine the adequacy of the models used to produce the model predictions, and it should carefully assess the resulting estimates. Since flaws in any aspect of the estimation process can distort indirect estimates, an evaluation scheme of this form should be a standard component of a small-area estimation program. Moreover, the evaluation should be done every time that estimates are produced.
The panel and the Census Bureau performed detailed evaluations of the SAIPE state and county estimates of poor school-age children, which are described in the companion volume to this report (National Research Council, 2000c). These evaluations include internal assessment of the structure and functioning of the regression models, external comparisons with census data, and, for counties, external comparisons with aggregate CPS estimates. Census and CPS aggregate data are not ideal for evaluation purposes. Yet they can help answer the key question of whether the model estimates show any strong, persistent biases for areas with specific attributes (e.g., areas with large or small populations, high or low poverty rates, rapid or slow changes in poverty rates) that could have adverse consequences when the estimates are used for fund allocation or other program purposes.
SAIPE county estimates of poor school-age children have also been evaluated by consulting state demographers and others with local knowledge. Since estimates are always subject to error, whether they are produced by a model or from local (or other) information sources, one should not be overly concerned by discrepancies between individual estimates and local sources. However, local assessment may indicate persistent patterns of marked discrepancies for areas with common attributes that should be investigated.
The internal and external evaluations of the 1993 and 1995 state and county estimates led the panel to conclude that the models are working reasonably well and that these estimates are preferable to 1990 census estimates as a basis for Title I allocations (National Research Council, 1998, 1999). According to Census Bureau calculations, the SAIPE estimates, on average, have more variability due to sampling error and prediction error than the census estimates. However, the out-of-date census estimates have considerably more bias. For example, estimates produced for 1989 using the modeling approach differed much less from the 1990 census estimates than did estimates from the 1980 census.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
Although the evaluations of the SAIPE state and county estimates have supported their use for fund allocation, they have identified aspects of the models that require additional research and development. Some priorities for SAIPE model development are presented later in this chapter (see also National Research Council, 2000c). In addition to research to improve the existing models, research is needed to examine how data from new sources, such as the 2000 census and the proposed American Community Survey, may contribute to the production of the SAIPE estimates. (The potential uses of these sources in the SAIPE program are discussed in Chapter 4.)
As noted above, the lack of administrative data at the school district level led the Census Bureau to use a simple shares approach based on 1990 census data for allocating the updated SAIPE county estimates of poor school-age children among school districts. Only limited evaluations of the school district estimates are possible, but it is clear that the estimates are not very reliable for most school districts. Nevertheless, the evaluations led the panel to conclude that the 1995 school district estimates were the best available for Title I allocations–for example, as good as or superior to 1990 census estimates or estimates based on school lunch counts. Marked improvement of the SAIPE poverty estimates for school districts and other subcounty areas will require investment in new or modified data sources that can provide the basis for improved models for these areas. (Chapter 5 identifies possible new administrative data sources that would likely improve SAIPE subcounty estimates.)
The next few sections of this chapter present a technical overview of the SAIPE models for estimates of poor school-age children for states, counties, and school districts, including a description of the Census Bureau's methods for estimating variability in the state and county estimates, and a summary of the evaluations conducted to date. The chapter then briefly summarizes the other SAIPE models (e.g., median household income and poverty for other age groups) and the Census Bureau's methods for producing small-area population estimates and their evaluation. (Population estimates are used both in the SAIPE poverty models and in Title I and other fund allocation programs.) The last section of the chapter provides recommendations to the Census Bureau for research and development to improve the current SAIPE models.
MODELS FOR POOR SCHOOL-AGE CHILDREN
State and County Models
The Census Bureau constructs separate regression models for estimating the numbers of poor school-age children at the state and county
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
levels.2 In the state model, the dependent variable is an estimate of the proportion of school-age children who are poor; in the county model, it is the logarithm of the number of poor school-age children. In both cases, the dependent variable is constructed from CPS data. For both models, the deviations from the regression are assumed to follow a variance components model with two components. One component represents sampling error in the dependent variable. The other component represents the deviations in the model predictions from the true values that would occur in a model in which the dependent variable is not subject to sampling error; the Census Bureau, as is commonly done, refers to this component as model error. The state and county estimates are weighted averages of the direct CPS estimates (where available) and the regression predictions, where the weights are functions of the variance components. School district estimates are derived from county estimates under the assumption that the relative proportion (share) of the poor school-age children in a county who are in a particular school district in the reference year is the same as it was in the 1990 census.
Input Data
Both the state and county models of poor school-age children use input data from five sources: the March CPS; the previous census; the Census Bureau's population estimates program; food stamp administrative records; and IRS individual income tax returns. The dependent variable in the state regression model is formed from data from the March CPS for the reference year. The dependent variable in the county model is created as a weighted average of estimates calculated from 3 years of March CPS data, centered on the reference year, in order to improve the precision of the CPS estimates. The other four sources are used to form predictor variables for the regression models.
After examining a variety of administrative records, the Census Bureau chose food stamp and tax return data as sources of predictor variables. These sources were chosen because they contain data from which variables related to poverty can be constructed, because they are available for all states and counties, and because they are, as far as possible, constructed using the same definitions and procedures nationwide (see National Research Council, 2000c, for details of how these data are obtained). The Census Bureau receives an extract of information on tax returns each fall that were filed in April for the preceding year (the extract omits some
2
More precisely, the Census Bureau's estimates pertain to related children aged 5-17 in poor families, termed “poor school-age children” in this report; see Chapter 1:fn 2.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
returns, such as those filed late). The Census Bureau receives monthly counts of food stamp recipients from the U.S. Department of Agriculture for states. For most counties, the Bureau receives food stamp counts that pertain to July 1 of the reference year; for some counties the counts are an average of the monthly counts for the year. A concern with using food stamp recipient data in the state and county models is that participation rates (recipients as a proportion of people who are eligible to apply) differ across areas. These differences may have become larger due to the effects of the 1996 legislation that changed several social welfare programs (see Chapter 5).
State Model
As noted above, the state model for the proportion of school-age children who are poor is estimated for the year of interest—the reference year—using CPS data for that year (the year subscript is suppressed below). The state model is
yj = α0 + α1x1j + α2x2j + α3x3j + α4x4j + uj + ej, (3.1)
where:
yj
=
estimated proportion of school-age children in state j who are in poverty based on the March CPS that collects income data pertaining to the reference year,
x1j
=
proportion of child exemptions reported by families in poverty on tax returns in state j,
x2j
=
proportion of people receiving food stamps in state j,
x3j
=
proportion of people under age 65 not included on an income tax return in state j,
x4j
=
residual for state j from a regression of the proportion of poor school-age children estimated from the prior decennial census on the three predictor variables, (x1j, x2j,x3j), for the census reporting period,
uj
=
model error for state j, and
ej
=
sampling error of the dependent variable for state j.
The uj are independent of ej for all j and i. Also, it is assumed that uj ~ NI(0, ) and that ej ~ NI(0, ), where ~NI(µ, σ2) is read “distributed normally and independently with mean µ and variance σ2.” The are estimated from CPS data using a generalized variance function (GVF) procedure documented in Otto and Bell (1995).
The coefficients for model (3.1) and the model error variance () are
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
estimated by maximum likelihood, treating the estimated as known. The SAIPE estimate of the proportion of school-age children living in poverty in a state is a weighted average of the model-based estimate (ŷj) and the CPS-based direct estimate for the state (yj), where the weights are proportional to the estimated precision of the two components. The SAIPE estimate for the proportion of school-age children in poverty in state j is
is the maximum likelihood estimate of , () is the maximum likelihood estimate of (α0, α1, α2, α3, α4) and is the estimate of the variance of the CPS estimate yj, based on CPS data. (Both “estimator” and “predictor” are used in the literature to describe .)
An initial estimate of the number of poor school-age children for a state is obtained by multiplying the estimated proportion poor () by the estimated total number of noninstitutionalized school-age children in the state, which is obtained from the Census Bureau' s program of population estimates.
The initial state-level estimates of the number of poor school-age children are then ratio adjusted to sum to the CPS national estimate of poor school-age children. Thus, the final estimate of the number of poor school-age children in state is
where is the CPS estimate of the number of poor school-age children in state j, j is the estimated number of noninstitutionalized school-age children in state j from the Census Bureau population estimates, and the summation is over all states. Historically, the ratio adjustment in (3.3) has changed the estimates by less than 1 percent.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
County Model
The state model uses proportion poor as the dependent variable and proportions as explanatory variables. The county model is slightly different in that it uses the logarithm of number poor as the dependent variable and is a model linear in logarithms. The county model is
where:
zji
=
log (3-year weighted average of number of poor school-age children in county i of state j based on 3 years of March CPS data),3
w1ji
=
log (number of child exemptions reported by families in poverty on tax returns in county i of state j),
w2ji
=
log (number of people receiving food stamps in county i of state j),
w3ji
=
log (estimated population under age 18 in county i of state j),
w4ji
=
log (number of child exemptions on tax returns in county i of state j),
w5ji
=
log (number of poor school-age children in county i of state j in the previous census),
vji
=
model error for county i of state j, and
aji
=
sampling error of the dependent variable for county i of state j.
It is assumed that vji ~ NI(0, ), that vji is independent of vkm for all ji and km, and that aji ~ NI(0, nji−1 ), where nji is the CPS sample size for county i of state j.4 Although the variables carry a state identification, there are no state effects in the model.
The between-county variance component, , is estimated using data from the 1990 census. A model, analogous to (3.4), is constructed in which the dependent variable is obtained from the 1990 census long form and the predictor variables are for the census reporting year. In this model, the census sampling variance (corresponding to nji−1 ) is estimated using a generalized variance function and is then treated as fixed
3
The number of poor school-age children is the product of the weighted 3-year average CPS poverty rate for related children aged 5-17 and the weighted 3-year average CPS number of related children aged 5-17; see National Research Council (2000c:Ch.4) for derivation of the weights.
4
The assumption that the variance of aji is simply inversely proportional to sample size is only an approximation, given the clustered CPS sample design. A different formulation may be preferable; see the discussion below of improved estimation of variance components.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
in fitting the model by maximum likelihood. The maximum likelihood parameter estimates obtained from the census data are estimated census regression coefficients and the estimated model error variance, . The assumption is made that the model error variance in the census regression and the county model regression (3.4) are the same. Documentation of the estimation approach is provided by Fisher (1997); see also National Research Council (2000c:Ch.4).
Data from the CPS and from the census regression are used to estimate and the vector (β0, β1, β2, β3, β4) of equation (3.4). The estimate is treated as fixed in the final estimation. Counties that are in the CPS sample and that have one or more poor school-age sampled children are included in the estimation data set for the county model, and those with no poor school-age sampled children are omitted.
The predictor of the logarithm of the number of poor school-age children in county i of state j is
and () is the maximum likelihood estimator of the regression vector. An initial predictor of the number of poor school-age children for county ji is obtained by transforming back to the initial scale:
where ji adjusts for the bias introduced by exponentiation, which is a nonlinear transformation. This bias adjustment is derived from the expression for the mean of the lognormal distribution (see Fisher, 1997).5
The final county estimates for a state are ratio adjusted so that the sum of the county estimates in a state is equal to the estimated state total obtained from the state model. Thus, the estimate for county ji is
5
Another possibility would be to use the procedure in Duan (1983).
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
where the summation is over the counties in state j, and is the state estimate defined in equation (3.3). Unlike the ratio adjustment for the state estimates, these adjustments are often large and highly variable across states. For the final county estimates of poor school-age children in 1993, the average state ratio adjustment –the SAIPE state estimate divided by the sum of the initial county estimates, known as the state raking factor–was 1.07; two-thirds of the factors were between 0.98 and 1.16. For 1995, the average state raking factor was 0.97; two-thirds of the factors were between 0.88 and 1.06. The correlation between raking factors for states in 1993 and 1995 is low, which implies that there was little systematic variation by state across these years.
School District Procedure
Because of the lack of administrative data at the school district level for constructing predictor variables, the school district estimates of poor school-age children are produced by a shares approach rather than by regression modeling. This shares approach allocates the updated county estimates among school districts in the same proportions that poor school-age children were distributed across the districts in the 1990 census. Although the general approach is simple, a number of complications arise in its application (see National Research Council, 2000c:Ch.7, for further details).
First, school district boundaries change over time. To address this problem, the Census Bureau conducts a survey every 2 years in which officials in every state are asked to update the boundaries for the districts in their state. Using these boundaries, the 1990 census blocks are allocated to school districts, and the census counts of poor school-age children are summed for the blocks in each district. When school district boundaries cut through blocks, the block counts are proportionately allocated.
Second, some school districts cross county boundaries. These districts are divided into parts by county, and the shares approach is applied to school district parts within each county. The estimate for a school district is then obtained by adding together the estimates for its parts.
Third, some school districts cover only selected grades (e.g., kindergarten through grade 8), with the result that some blocks are in more than one school district. This problem is addressed by allocating the poor children in the appropriate age range to each district.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
Fourth, for many districts the census estimates of poor school-age children are subject to substantial levels of sampling error because they are derived from data collected from the census long-form sample. To reduce this sampling error, the estimates for the district parts are ratio adjusted to make the total number of school-age children from the long-form sample conform to the number of school-age children from the complete census.
The estimated number of poor school-age children in school district part d in county i in state j for the reference year is given by
where Rjid is the ratio-adjusted estimate of the proportion of poor school-agechildren in that district part in the 1990 census, and ji is the updated county estimate given by (3.7). The ratio-adjustedestimate Rjid is given by Rjid = CjidA′jidAjid−1, where in district part d in county i in state j, Cjidis the estimated number of poor school-age children from the long-form sample, A′jid is the number of school-age children from the complete census, and Ajid is the estimated number of school-age children from the long-form sample.
Evaluations
As recommended by the National Research Council panel, the Census Bureau conducted an extensive set of evaluations of the SAIPE estimates of poor school-age children for states and counties. Due to data constraints, more limited evaluations were conducted of the estimates of poor school-age children for school districts. The companion technical documentation volume to this report describes the methods and results of the state, county, and school district evaluations in detail (National Research Council, 2000c:Ch.6, Ch.7). Below we summarize the principal evaluation methods under two headings—internal evaluation and external evaluation—and highlight key results.
Internal Evaluations of State and County Models
For each year for which the state and county models were estimated, an internal evaluation was conducted of the underlying assumptions and features of the models. Internal evaluations were also conducted of alternative forms of the county model. Such evaluations, which principally involved examination of the residuals from the regression before taking
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
for school districts. Because school district boundaries change, it is necessary in estimating numbers of school-age children (and total population) for school districts to obtain updated boundaries for the reference year and to retabulate the 1990 census within-county shares according to the new boundaries.
Evaluations
Repeated evaluations of the accuracy of the population estimates, conducted by comparing estimates developed from the previous census to counts from the current census, show several patterns. The proportional differences of the estimates in comparison with the census are larger on average for small areas than for large ones; the proportional differences tend to be larger for areas in which the population is changing rapidly than for areas that are more stable; and the proportional differences for age groups tend to be higher than those for the total population. Furthermore, estimates produced by using components of population change are usually more accurate than those produced by such methods as the raking-ratio adjustment (used for county age estimates) or the shares method (used to produce school district estimates).
Evaluations of 1990 population estimates for counties and school districts show that, for the total population, the average absolute difference between the 1990 population estimates based on updating the 1980 census values and the 1990 census counts was 2.3 percent of the average population for counties and 9.6 percent of the average population for school districts. For all children aged 5-17, the average absolute difference between the 1990 population estimates and the 1990 census counts was 4.9 percent of the average number of school-age children for counties and 12.0 percent of the average number of school-age children for school districts. These differences are much smaller than the average absolute difference for poor children aged 5-17, which was 10.7 percent of the average number of poor school-age children for counties and 22.2 percent of the average number of poor school-age children for school districts (National Research Council, 2000c:Ch.7; see fn. 12 above for the average absolute difference formula).21 It will be important to repeat these evaluations using 2000 census data.
21
A difference between the comparisons of population estimates and those of poverty estimates is that the census comparison estimates for poor school-age children are from the long-form sample and, hence, are subject to error from sampling variability. This error results in an overestimate of the difference between the SAIPE poverty estimates and the census poverty numbers that would be obtained from a complete enumeration.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
An additional evaluation found that use of population estimates instead of census counts had only a modest effect on the accuracy of the estimated numbers of poor school-age children for counties. The analysis compared 1990 census estimates of poor school-age children in 1989 with 1989 estimates from two variants of the SAIPE county model. Each variant predicted the log poverty rate for school-age children; one variant converted estimated poverty rates to estimated numbers of poor school-age children by using 1980 census-based population estimates for school-age children for 1990; the other variant converted rates to numbers by using 1990 census population counts. The average absolute difference between the model-based estimates of poor school-age children and the 1990 census estimates was only slightly higher for the first variant than for the second variant (see National Research Council, 2000c:App.C).
PRIORITIES FOR SAIPE MODEL DEVELOPMENT
Evaluations of the SAIPE estimates indicate that, although the estimates are generally better than the available alternatives for states and counties and at least as good as the available alternatives for school districts, they are subject to appreciable levels of error, particularly for small counties and school districts. Thus, efforts to improve the accuracy of the estimates for such purposes as fund allocations are well warranted. In addition, since there is currently a 3- to 4-year lag between the production of the estimates and the year to which they relate, it is highly desirable to seek ways to improve the timeliness of the estimates. This section describes some research priorities for improving the accuracy and timeliness of the state, county, and school district estimates, which the panel believes could be implemented in the next estimation cycle.
Research and development for the population estimates is heavily dependent on enhancements to administrative records. Possible improvements to these estimates are discussed in Chapter 5, which deals with such enhancements.
Research Priorities for the State and County Models
The focus of this discussion is on research activities that should be undertaken in an attempt to improve the SAIPE state and county estimates in the near term. The following areas for research and development are discussed below: the incorporation of state random effects in the county model; the incorporation of counties with CPS households but with no sampled poor school-age children in the county modeling; the possible use of time-series and multivariate models; and improved
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
estimation of the components of variance in both the state and county models.
However, before turning to those activities, the panel offers a broader perspective on the SAIPE Program. The program produces a variety of different estimates (e.g., numbers in poverty in different age bands) at different levels (states, counties, and school districts). Currently, these estimates are produced somewhat independently of one another, and the state and county models are formulated differently in a number of respects. From a theoretical perspective, a preferred approach would be to use a single integrated hierarchical model that would produce all the estimates at both the state and county levels. This approach would not only ensure consistency for the estimates, but it would also likely improve their precision, in part because the estimates for one age band would be able to “borrow strength” from the data available for another age band through the use of a multivariate model.
A further extension of this approach would be to incorporate data for other time periods in the model. For example, sample data are available from the March CPS every year, and data from prior years can provide valuable information in predicting the values for the current year. The same will also be true for the American Community Survey after 2003, if it is implemented as currently planned.
Although such an overarching model may be attractive from a theoretical perspective, its full implementation is almost certainly impracticable, at least in the near term. Nonetheless, the panel considers that it would be useful for the Census Bureau to keep such a model in mind as it develops its longer term plans for the SAIPE program. Even if the single overall model cannot be achieved, model enhancements that move the estimation procedures closer to the ideal may be possible and should be pursued.
Incorporation of State Random Effects in the County Model
State estimates obtained from the county model by aggregating the county estimates within each state are made to conform to the state estimates from the state model by a ratio adjustment, the state raking factor. As noted above, these raking factors vary considerably across states. Several sources could contribute to this variability, including the different measurement scales used in the state and county models (proportions for the former, logarithms of numbers for the latter), the use of 3-year averages of CPS estimates as the dependent variable in the county model versus single-year estimates in the state model, sampling variability, and, possibly, individual state effects that are not captured in the county model. Preliminary work by the panel suggests that a sizable proportion of the
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
variation in the state raking factors is due to sampling variability. Further investigation should be carried out to better understand the causes of this variation.
In an effort to determine whether the state raking factors could reflect state effects that are missing from the county model, the Census Bureau examined a county regression model that included fixed state effects. The use of this model did not reduce the spread of the raking factors; rather, it increased it. Also, while the addition of fixed state effects reduced some nonrandom residual patterns in the regression output, a fixed state effects model estimated for 1989 did not perform better than other models in comparison with 1990 census estimates.
An alternative approach for incorporating state effects in the county model is to treat them as random rather than fixed effects. This formulation leads to a nested model in which the model error is the sum of a county-within-state random effect and a state random effect. Fuller and Goyeneche (1998) describe the model and report on a preliminary evaluation of it. Their evaluation suggests the presence of a small state random effect. The Census Bureau should conduct a thorough evaluation of this model to examine all of its properties.
Including Counties with No Poor Sampled School-Age Children
As described above, the current county model is expressed in terms of logarithmic transformations of the 3-year average numbers of poor school-age children (the dependent variable) and the values of the predictor variables. Although this form of transformation makes the distributions of the variables more symmetric, possibly makes the functional relationship between the dependent variable and the predictor variables more linear, and provides reasonably homogeneous error variances, it has the disadvantage of not accommodating zero input values. Thus, counties with some CPS-sampled households but no CPS school-age children living in poverty in the 3-year average are excluded from the estimation of the regression coefficients in the county model. A large number of CPS counties are excluded from the regression data set for this reason: 304 of 1,488 counties for the 1993 model and 262 of 1,247 counties for the 1995 model.22 Although the model estimates the numbers of poor school-age children in these excluded counties relatively well (see National Research Council, 2000c:Ch.6), dropping such a large fraction of counties dimin-
22
In addition, a small number of counties with CPS sampled households (41 for the 1993 model and 27 for the 1995 model) are excluded from the regression data set because the sampled households lacked any school-age children.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
ishes the model's face validity and produces estimates with higher variability than if these counties were included.
One solution to this problem is to shift the starting point of the logarithmic transformation (i.e., using log (z + c), c > 0) to allow inclusion of all counties that have sampled households in the CPS or to use some other form of transformation. A preferable, but less straightforward, solution is to use generalized linear modeling (see McCullagh and Nelder, 1989), an approach that has been developed to provide models for variables with a wide variety of distributional forms. In this particular case, the Poisson distribution is a natural one to consider, since data on counts–for which zero is a natural observation–are typically modeled well using this distribution. Applying the generalized linear modeling framework, all counties included in the CPS can be used to estimate the regression coefficients, and best linear unbiased predictors (BLUPs) can be used to combine the model and direct estimates.
While the application of generalized linear modeling is fairly routine in many applications, the complex sample design of the CPS must be taken into account in the estimation of the regression coefficients and in estimating the variances of the model predictions. Recent developments in generalized linear mixed models (e.g., Robinson, 1991; Zeger and Karim, 1991) provide the basis for developing approaches that can reflect the sampling design.
The Census Bureau has recently conducted research on a hierarchical Bayesian modeling approach that makes it possible to include counties in the model that have some sampled CPS households but none with poor school-age children (see Fisher and Asher, 1999b). This work should continue.
Time-Series and Multivariate Modeling
As noted above, a unified overall model that provides all the SAIPE estimates and that incorporates data from other time periods is theoretically attractive, but not practical, at least in the immediate future. However, there are possibilities for using multivariate and time-series approaches in more limited ways. The panel recommends that the Census Bureau continue and expand its research in these areas.
Fay (1987) provides an early example of a multivariate approach, applied to the estimation of median income in four-person families by state. The dependent variables in his trivariate model were the state median incomes of four-person, three-person, and five-person families. In estimating the median income for four-person families, the model borrows strength from the regressions for the other two dependent variables by allowing for a correlation of the model errors in the regressions. This
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
kind of approach could, for instance, be applied in SAIPE in an attempt to improve the estimates of poor children aged 5-17 by incorporating estimates for other age ranges in the state and county models.
Bell (1997a) applied a bivariate model for the county estimates of poor school-age children in which the two dependent variables were the 3-year average of CPS data for the reference year (described above) and the 1990 census estimate. The purpose of this model was to make more complete use of census data, through a correlation of the model errors for the two regressions. The panel evaluated several versions of the bivariate model for 1993 estimates, and the results were promising (National Research Council, 2000c:App.B). These models were not pursued for use at that time, primarily because it was not possible to conduct external evaluations of them. However, they have the potential to improve the county estimates, and further research on their application in SAIPE should be conducted.
The above approach could also be generalized to a time-series structure. Census Bureau staff have begun work on assessing the potential benefits of using multiple years of CPS data in the state model but have not yet completed their analyses.
Multivariate and time-series approaches will become increasingly important as data from new sources–such as data from several years of the American Community Survey–become available. The Census Bureau should pursue work on these types of models, which will need extensive development and evaluation to see if they have advantages and to ensure that they do not introduce unanticipated problems. In the longer term, it may be possible to adapt time-series approaches to develop forecasts of income and poverty in order to make the estimates more timely for program use (see “Improving Timeliness” below for approaches to improve timeliness in the near term).
Improved Estimation of Variance Components
Both the state and county models have two variance components, model error and sampling error. Model error is assumed to be independently and identically distributed across areas (states or counties). Sampling error depends on the CPS sample size and poverty rate in the area, as well as the complex stratified multistage CPS sample design. Estimates of these variance components are needed for three purposes: they are used in the maximum likelihood estimation of the regression coefficients in the models; they are used in computing the standard errors of the state and county estimates; and they are used to determine the weights for forming the weighted averages of the model estimates and direct estimates in equations (3.2) and (3.5). The last purpose is most important for
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
the state estimates since, unlike most counties, all states have CPS samples of sufficient size to produce direct estimates that can usefully contribute to the weighted average.
Different approaches are used to estimate the two variance components in the state and county models. In the state model, sampling error variance is estimated by using a generalized variance function (GVF) that reflects the effects of the CPS sample design, and the model error variance is then obtained through maximum likelihood estimation, essentially subtracting the total sampling error variance from the total variance. In the county model, the model error variance is equated to the model error variance in a corresponding regression model for 1990 census data; that model error variance is estimated in the manner described for the state model error variance, with the census sampling error being estimated with a GVF for the census long-form sample. The total sampling error variance in the county model is then obtained by maximum-likelihood estimation and partitioned among counties in inverse proportion to CPS sample size. Both of these approaches are problematic, and further research is needed for both models.
In the case of the state model, the maximum-likelihood estimation has led to zero estimates of model error variance in 6 of the 7 years for which the state model was estimated, with the consequence that the direct estimates are assigned zero weight in the weighted averages. The untenable result of a zero model error variance likely derives from a misspecification of the GVF for the CPS that results in overestimation of the sampling error variance.
Research is needed to improve the estimation of the sampling error variance for the state model. The use of a Bayesian model to account for the uncertainty in the estimates of the model error variance is another approach that should be pursued. Bell (1999) has explored such a model, which yields positive estimates of model error variance that could be useful for producing the state model estimates. Pending the outcome of these two areas of research, some simple adjustments should be examined and applied as appropriate. For example, minimum weights that are a function of the CPS sample size in each state could be assigned to the direct estimates for each state.
For the estimation of the variance components in the county model, reliance on the assumption that the model error variance for the CPS equation is the same as that for the 1990 census equation is questionable. An alternative approach is that used with the state model, that is, estimating the sampling error variance from a GVF and obtaining the model error variance by maximum likelihood estimation. The Census Bureau has examined an empirically based GVF in which sampling error variance of the county direct estimates is inversely proportional to the square
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
root of CPS sample size. This approach improves upon the current method (see Fisher and Asher, 1999a), but more research is needed. An alternative approach that should also be explored is to estimate a withincounty design effect based on counties with reasonable numbers of CPS sample segments. This design effect could then be used to develop a GVF from which sampling errors could be estimated for all counties with some CPS sample.
A complication that arises in modeling GVFs for the direct county estimates is that the sampling errors of these estimates are affected not only by the clustered CPS sample within counties, but also by the poverty rates in those counties, rates that can be estimated only imprecisely. Future research should consider alternative methods of estimating county poverty rates for use in the GVFs, including smoothing the estimates in some manner.
Reducing the Variability in the 1990 Census School District Estimates
Essentially, the school district model distributes the updated county estimate of the number of poor school-age children between the school districts (or parts of school districts) in the county in proportion to the estimated shares that the districts (or parts) had of the county's poor school-age children at the last census (see “School District Procedure” above). The census numbers of poor school-age children in the school districts are estimated from the census long form. Since these estimated numbers are based on small long-form sample sizes for many school districts, they are subject to substantial sampling error (see National Research Council, 2000c:Ch.7).
To improve the precision of census long-form estimates, the Census Bureau builds in adjustments as part of regular census data processing to make long-form totals conform to short-form totals for key short-form items for weighting areas (subcounty areas or sometimes entire counties that have a specified minimum number of sample persons). For the purpose of estimating school district shares, the Census Bureau extended this approach by forcing the long-form estimate of the number of school-age children in each school district to conform to the short-form number of such children. In essence, the procedure estimated the proportion poor of school-age children in a district from the long form and then applied that proportion to the short-form number of school-age children in the district.
This adjustment improved the precision of the school district census estimates of poor school-age children by a small, but important, amount. Further improvements might be obtained by extending the adjustment to forcing long- and short-form totals to agree on characteristics that are related to poverty, such as race, ethnicity, home tenure (owner, renter),
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
family type, and type of residential area (central city, urban, rural), at the school district level. Although only a modest improvement in the school district census estimates may be achieved with these further adjustments, any improvement would be helpful.
Another approach for improving the census school district estimates is to use a smoothing procedure to reduce the sampling errors in the long-form estimates of the proportions poor of school-age children. These smoothed proportions would then be multiplied by the short-form numbers of school-age children to produce the census estimates of numbers of poor school-age children. Thus, for example, a school district's proportion poor could be estimated by a weighted average of its estimated proportion poor from the long form and the overall proportion poor for the county in which it is located, with the weight given to the long-form estimate depending on the school district 's long-form sample size. This procedure, which reduces sampling error at the cost of potentially introducing some bias, is likely to be effective for school districts (or parts of districts) that have small long-form samples.
Improving Timeliness
The Census Bureau currently produces income and poverty estimates from the SAIPE Program with a lag of about 3 years. So the school district estimates of school-age children in 1996 who were in poverty in 1995 were released in early 1999 for use in Title I allocations for the 1999-2000 and 2000-2001 school years. Although these estimates are considerably more current than estimates based on the 1990 census, they are still out of date by 3 or 4 years. Since there can be substantial changes in income and poverty in short time periods (see National Research Council, 2000c:Ch.3), it is important to explore methods for reducing this time lag.
One reason for the time lag for SAIPE poverty estimates is the length of time it takes to obtain population estimates for use in the state and county models. The population estimates are not available until more than 2 years after the income reference year.23 A different approach would be to use the population estimates for July of the income reference year rather than the population estimates for July of the following year. This approach would have the advantage of reducing the time lag of the poverty estimates. Alternatively, population estimates could perhaps be developed for January of the year following the income reference year, which would be more timely than the estimates for July of the following
23
preliminary estimates are available a year earlier (e.g., spring 1999 for July 1998 estimates), but evaluation has shown that they may differ from the second round of estimates by as much as 3 percent for state estimates and more than 5 percent for county estimates.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
year and yet would reflect the CPS concept of measuring poverty for the previous calendar year.
Another source of delay for the SAIPE poverty estimates is the lag in obtaining the food stamp data used in the county model. Monthly food stamp counts for states are available with little delay from the U.S. Department of Agriculture, so the state model uses a 12-month average of food stamp data, centered on January 1 following the income reference year, as a predictor variable. The delay results from the construction of the food stamp predictor for the county model. That predictor makes use of county-level food stamp counts for July of the income reference year (for some counties, the data are the average of the monthly counts for the year), which take much longer to obtain than the state totals. In some instances, the counts must be collected from individual states, and the complete data set is not usually available until 2 years after its reference date. The food stamp predictor in the county model is then formed by raking the county counts to the slightly more current state numbers used in the state model.
In the interest of timeliness, a study should be carried out to investigate the effects of basing the food stamp predictor in the county model on counts from an earlier period, such as data for July of the year prior to the income reference year. Even though the county-level data for July are raked to the state food stamp numbers for the reference year, the use of earlier data for counties may affect the performance of the food stamp predictor variable in the county model. The recommended study should evaluate the extent of any such effects.
Yet another issue that should be examined is the year of the state estimates to which the county estimates are raked. The current practice is to rake the county estimates to state estimates for the middle year of the 3 years of CPS data that are used for the dependent variable in the county model. An alternative approach would be to rake the county estimates to state estimates for the most recent of the 3 years. In effect, such raking would update the county, and hence the school district, estimates by 1 year under a modeling assumption about the uniformity of the distribution of the temporal changes in poverty across counties within states. This assumption only has to be approximately correct for this procedure to provide a benefit. Another possible approach–that could be combined with raking the state estimates to the latest year–would be to construct the dependent variable in the county model as a weighted average of the 3-year CPS estimates that gives more weight to the most recent year.
OCR for page 44
Small-Area Income and Poverty Estimates: Priorities for 2000 and Beyond
CONCLUSION
The panel commends the Census Bureau for investigating several of the research topics the panel identified for the current SAIPE state and county models. Work on technical aspects of the models and on the timeliness of the estimates is important in the near term. Also important is work on the role that new data sources could play in improving the state and county income and poverty estimates and the estimates of poor school-age children for school districts. We discuss data sources in the next two chapters.