The development of model-based estimates for small areas is a major, continuing research and development effort for which extensive evaluation is required. For updated estimates of poor school-age children for counties, a thorough assessment of all aspects of the estimation procedure is necessary to have confidence in the estimates–whether the estimates are used by the Department of Education to allocate Title I funds to counties (as was the practice before the 1999-2000 school year) or whether they are used to develop estimates for school districts.
The Census Bureau's county estimates of poor school-age children are produced by using a county regression model and a state regression model (see Chapter 4).^{1} A comprehensive evaluation of these two components of the estimation procedure should include both “internal” and “external” evaluations.
The first test of a regression model is that it perform well when evaluated internally, that is, for the set of observations for which it is estimated. Such an internal evaluation is primarily an investigation of the validity of the model's underlying assumptions and features, which for a regression model is typically based on an examination of the residuals from the regression–the differences between the predicted and reported values of the dependent variable for each observation.
^{1 } |
Population estimates of school-age children are provided to accompany the estimates of poor school-age children to permit calculating poverty rates–see Chapter 8 for a description of the methods used for postcensal population estimates and for evaluation results. |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
6
Evaluations of County Estimates
The development of model-based estimates for small areas is a major, continuing research and development effort for which extensive evaluation is required. For updated estimates of poor school-age children for counties, a thorough assessment of all aspects of the estimation procedure is necessary to have confidence in the estimates–whether the estimates are used by the Department of Education to allocate Title I funds to counties (as was the practice before the 1999-2000 school year) or whether they are used to develop estimates for school districts.
The Census Bureau's county estimates of poor school-age children are produced by using a county regression model and a state regression model (see Chapter 4).1 A comprehensive evaluation of these two components of the estimation procedure should include both “internal” and “external” evaluations.
The first test of a regression model is that it perform well when evaluated internally, that is, for the set of observations for which it is estimated. Such an internal evaluation is primarily an investigation of the validity of the model's underlying assumptions and features, which for a regression model is typically based on an examination of the residuals from the regression–the differences between the predicted and reported values of the dependent variable for each observation.
1
Population estimates of school-age children are provided to accompany the estimates of poor school-age children to permit calculating poverty rates–see Chapter 8 for a description of the methods used for postcensal population estimates and for evaluation results.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
In an external evaluation, the estimates from a model are compared with target or “true” values that were not used to develop the model. Ideally, an internal evaluation of regression model output should precede external evaluation. Changes made to the model to address concerns raised by the internal evaluation would likely improve its performance in the external evaluation.
Since there are no absolute criteria for what are acceptable evaluation results, one method for determining if the performance of a model can be improved is to examine alternative models. Such comparisons may indicate changes that would be helpful for a model; they may also suggest that an alternative model is preferable. Both internal and external evaluations should be carried out for alternative models.
OVERVIEW OF EVALUATIONS
1993 Estimates
When the original 1993 county estimates of poor school-age children were provided to the panel, the Census Bureau had not had time to complete a full evaluation of them. Subsequently, the panel developed a set of evaluation criteria, and the panel and the Census Bureau conducted a series of internal and external evaluations. The focus of the evaluation effort was on alternative county models, particularly the assumptions underlying the regression equations and how the estimates of poor school-age children in 1989 from each model compared with 1990 census estimates. The state model was examined as well, both directly and as it contributed to the county estimates of poor school-age children. The evaluations included:
internal evaluation of the regression output for alternative county models estimated for 1993 and 1989;
comparison of estimates of poor school-age children for 1989 from alternative county models with 1990 census estimates, a form of external evaluation;
examination of the original 1993 county estimates to identify possibly anomalous estimates that were then reviewed with knowledgeable local people, another form of external evaluation; and
evaluation of the state model, including examination of regression output and external evaluation in comparison with 1990 census estimates.
The internal evaluation of regression output and the comparison of modelbased estimates of poor school-age children for 1989 with 1990 census estimates–evaluations (1) and (2) above–were carried out for the four single-equation county models that were considered serious candidates to produce re-
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
vised 1993 county estimates of poor school-age children (see Chapter 5 and Appendices B and C):
log number model (under 21), the original model that the Census Bureau used to produce the original 1993 county estimates of poor school-age children;
log number model (under 18), the revised model that the Census Bureau used to produce the revised 1993 county estimates of poor school-age children;
log rate model (under 21); and
log rate model (under 18).
In addition, the 1990 census comparisons (2) were performed for some other estimation procedures that relied much more heavily than did the four candidate models on estimates from the 1980 census (see below, “Comparisons with 1990 Census Estimates”). Since the Department of Education used estimates of poor school-age children from the previous census for allocations of Title I funds prior to the 1997-1998 school year, these estimation procedures were included in the evaluation in order to see how well the regression models compared with some simple procedures for updating the census estimates.
The internal evaluation of regression output (1) and the comparison of estimates of poor school-age children for 1989 with 1990 census estimates (2) examined residuals and model differences from the census, respectively, for categories of counties. The following characteristics were used for categorizing counties: census geographic division; metropolitan status of county; population size in 1990; population growth from 1980 to 1990; percentage of poor school-age children in 1980; percentage of Hispanic population in 1990; percentage of black population in 1990; persistent poverty from 1960 to 1990 for rural counties; economic type for rural counties; percentage of group quarters residents in 1990; number of households in the CPS sample in 1988-1991 (or whether the county had sampled households); and (for 1990 census comparisons only) percentage change in the poverty rate for poor school-age children from 1980 to 1990 (see details in Table 6-4, below).
1995 Estimates
Because the 1995 county estimates were developed by using a procedure similar to that used to develop the revised 1993 county estimates, the focus of the evaluation effort for the 1995 estimates shifted to how the state and county models behaved over several time periods, and specifically, to determining whether there were persistent biases or other problems. The evaluations of the 1995 county estimates included:
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
internal evaluation of the regression output for the 1995 county model estimated for 1995, 1993, and 1989 (using uncorrected and corrected tax return data);
comparison of estimates of poor school-age children that were developed from the 1995 form of the county model for 1995, 1993, and 1989 with CPS estimates for groups of counties, a form of external evaluation; and
evaluation of the state model, including examination of regression output for 1996, 1995, 1993, 1992, 1991, 1990, and 1989 and consideration of the state raking factors by which county model estimates are adjusted to make them consistent with the state model estimates.
COUNTY MODEL INTERNAL EVALUATION
1993 Evaluations
The panel and the Census Bureau examined the underlying assumptions and other features of the four models, (a)-(d), that were considered candidates for producing revised 1993 county estimates of poor school-age children, through evaluation of the regression model output for 1989 and 1993.2 Although such an evaluation is not likely to provide conclusive evidence with which to rank the performance of alternative models, particularly when they use different transformations of the dependent variable, examination of the regression output is helpful to determine which models perform reasonably well.
The assumptions and features investigated for the four models fall into two groups: those concerning the functional form of the regression model and those concerning the error distribution. Because properties of the error distribution affect the ability to fit a model, studies of these two types of assumptions are not entirely separable.3
The assumptions and features examined in the first group are linearity of the relationship between the dependent variable and the predictor variables; constancy of the assumed linear relationship over different time periods; and whether
2
The evaluation of the county regression output pertains to the regression models themselves, that is, before the predictions are combined with the direct CPS estimates in a “shrinkage” procedure or raked to the estimates from the state model (see Chapter 4). For these models, the regression output comprises the model predictions for counties with at least one household with poor school-age children in the CPS sample. For the two log number models, the predictions are the log number of poor school-age children; for the two log rate models, the predictions are the log proportion of poor school-age children.
3
These assumptions were also examined for the analogous 1990 census regressions. However, since the census equations only affected the weights for the weighted least squares regression and the extent of “shrinkage” in combining model estimates and direct estimates for counties with households in the CPS sample, analyses of the 1990 census regressions are not discussed here.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
any of the included predictor variables are not needed in the model and, conversely, whether other potential predictor variables are needed in the model. The assumptions examined in the second group are normality (primarily symmetry and moderate tail length) of the distribution of the standardized residuals;4 whether the standardized residuals have homogeneous variances, that is, whether the variability of the standardized residuals is constant across counties and does not depend on the values of the predictor variables; and absence of outliers. Each assumption is discussed in terms of the methods used for evaluation and the results of the evaluation for the four candidate models.
Linearity of the relationships between the dependent variable and the predictor variables was assessed graphically, by observing whether there was evidence of curvature in the plots of standardized residuals against the predictor variables in the model. In addition, plots of standardized residuals against CPS sample size and against the predicted values from the regression model were also examined for curvature.
The only evidence of nonlinearity is for the log number (under 21) model (a) for 1989. For that year, the standardized residuals appear to have a very modest curvature when plotted against the predicted values.
Constancy over Time of the assumed linear relationship of the dependent and predictor variables was assessed through comparison of the regression coefficients on the predictor variables for 1989 and 1993. While major changes in economic conditions are expected to cause some changes in the coefficients, a relatively stable regression equation would be desirable.
Table 6-1 shows the regression coefficients for the predictor variables for the four candidate models for 1989 and 1993. In the log number models (a, b) for 1989 and 1993, the coefficients for the three “poverty level” predictor variables— child exemptions reported by families in poverty on tax returns (column 1), food stamp recipients (column 2), and poor school-age children from the previous census (column 5)—are similar. There are substantial differences across the two time periods in the estimated coefficients for the other two variables—population (under age 21 or under age 18, column 3) and total number of child exemptions on tax returns (column 4). However, the sum of these two coefficients is generally close to 0 in each model in each year. Because these two variables are highly positively correlated, the predictions from equations with a similar sum for the two coefficients will be similar.
4
The standardization of the residuals involved estimating the predicted standard errors of the residuals, given the predictor variables, and dividing the observed residuals by the predicted standard errors. The predicted standard error of the residual for a county is a function of the estimated model error variance and the estimated sampling error variance (see Belsley, Kuh, and Welsch, 1980).
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
TABLE 6-1 Estimates of Regression Coefficients for Four Candidate County Models for 1989 and 1993
Predictor Variablesa
Model
Counties(Number)
1
2
3
4
5
(a) Log Number (under 21)
1989
1,028
0.52
0.30
0.76
−0.81
0.27
(.07)
(.05)
(.22)
(.22)
(.07)
1993
1,184
0.31
0.30
0.03
0.03
0.40
(.08)
(.07)
(.21)
(.21)
(.09)
(b) Log Number (under 18)
1989
1,028
0.50
0.23
1.79
−1.80
0.32
(.06)
(.05)
(.27)
(.27)
(.07)
1993
1,184
0.38
0.27
0.65
−0.59
0.34
(.08)
(.07)
(.24)
(.24)
(.09)
Predictor Variablesb
(c) Log Rate (under 21)
1989
1,028
0.32
0.29
−0.73
0.40
(.07)
(.04)
(.19)
(.07)
1993
1,184
0.23
0.31
−0.07
0.41
(.08)
(.06)
(.18)
(.09)
(d) Log Rate (under 18)
1989
1,028
0.29
0.26
−1.13
0.43
(.07)
(.04)
(.24)
(.07)
1993
1,184
0.26
0.30
−0.42
0.38
(.08)
(.06)
(.20)
(.09)
NOTES: All predictor variables are on the logarithmic scale for numbers and rates. Standard errors of the estimated regression coefficients are in parentheses. The four models were estimated for each year with maximum likelihood. The original 1994 population estimates were used for the 1993 models; 1990 census population estimates were used for the 1989 models.
aPredictor variables: (1) number of child exemptions reported by families in poverty on tax returns; (2) number of people receiving food stamps; (3) population (under age 21 or under age 18); (4) total number of child exemptions on tax returns; (5) number of poor school-age children from previous (1980 or 1990) census.
bPredictor variables: (1) ratio of child exemptions reported by families in poverty on tax returns to total child exemptions; (2) ratio of people receiving food stamps to total population; (3) ratio of total child exemptions on tax returns to population (under age 21 or under age 18); (4) ratio of poor school-age children to total school-age children from previous (1980 or 1990) census.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
The sum of all coefficients in each equation for models (a) and (b) ranges from 1.04 to 1.07 and is significantly greater than 1. A sum equal to 1 would mean that county population size itself has no effect on the estimated number of poor school-age children and that the model is expressible as a model with the poverty rate as the dependent variable and rates as predictor variables. Because the sum is greater than 1, the estimated number of poor school-age children is a larger percentage of the population in the larger counties. While this result is difficult to explain as a function of county size, it may be that size reflects the effects of variables not included in the models.
In the log rate models (c, d), the coefficients for the three “poverty rate” predictor variables—ratio of child exemptions reported by families in poverty on tax returns to total child exemptions (column 1), ratio of food stamp recipients to the total population (column 2), and ratio of poor school-age children to total school-age children from the previous census (column 4)—are all positive and about the same size.5 The coefficients for the ratio of total child tax exemptions to the population (under age 21 or under age 18, column 3) are negative, as is also generally the case for the coefficients of the related variable (total number of child tax exemptions) in the log number equations. There are substantial differences in the estimated coefficients for the ratio of total child tax exemptions to the population in the log rate models across time periods and some differences between the coefficients in the two models.
Inclusion or Exclusion of Predictor Variables The possibility that one or more predictor variables should be excluded from a model was assessed by looking for insignificant t-statistics for the estimated values of individual regression coefficients.6 The need to include a predictor variable, or possibly to model some categories of counties separately, was assessed by looking for nonrandom patterns, indicative of possible model bias, in the distributions of standardized residuals displayed for the various categories of counties.7
The only predictor variables with nonsignificant t-statistics are the population under age 21 (column 3 in Table 6-1) and total child exemptions on IRS income tax returns (column 4) for the log number (under 21) model (a) in 1993, and the ratio of child tax exemptions to the population under age 21 (column 3) for the log rate (under 21) model (c) in 1993. All other regression coefficients are
5
The coefficients are also similar to the coefficients for the corresponding variables—number of child exemptions reported by families in poverty on tax returns, number of food stamp recipients, and number of poor school-age children from the previous census—in the log number equations.
6
Although the performance of a predictive regression model is best assessed in terms of the joint impact of the predictor variables, examining the individual predictor variables can suggest ways in which a model might be improved.
7
The distributional displays examined for this and other model assumptions were box plots.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
significantly different from 0 at the 5 percent level. Application of Akaike's information criterion (AIC) confirmed the superiority of using the population under age 18 as a predictor variable in preference to the population under age 21 in the log number model. (The test was not performed for the log rate model.)
For most ways of categorizing counties, the standardized residuals do not exhibit systematic patterns. The exceptions are that all four models in 1989 tend to overpredict poor school-age children in counties with a high percentage of Hispanic residents (i.e., the standardized residuals tend to be negative for these counties) and that the log number (under 21 and under 18) models (a, b) in 1993 and 1989 tend to overpredict poor school-age children in counties that are in metropolitan areas but are not the central county in the area.
Normality of the standardized residuals was evaluated through use of Q-Q plots, which match the observed distribution of the residuals with the theoretical distribution, and other displays of the distribution. All four models exhibit some skewness in their standardized residuals, with the log rate models (c, d) showing somewhat more skewness than the log number models (a, b). For none of the models does the skewness appear sufficiently marked to be a problem.
Homogeneous Variances The homogeneity of the variance of the standardized residuals was assessed using a variety of statistics and graphical displays (see Appendix B). Examination of them clearly demonstrates some variability in the size of the absolute standardized residuals as a function of the predicted value (number or proportion of poor school-age children) and the CPS sample size for all four models. With regard to CPS sample size, one would expect the standardized residual variance to remain constant over the distribution of CPS sample size; however, it increases with increasing CPS sample size.
The heterogeneity of the variance of the residuals suggests that there may be a problem with the model specification or in the assumptions that were used to calculate the standardized residuals. However, adjusting a model to remove this type of heterogeneity is likely to have only a small effect on the estimated regression coefficients or the model estimates. The effect on estimates of poor school-age children would stem from two factors: a shift in the weights assigned to each county in fitting the regression model, which would very likely result in only a modest change in the estimated regression coefficients; and a change in the weight given to the direct estimates, which could have an appreciable effect on the estimates only for the few counties with large CPS sample sizes.
Outliers The existence of outliers was evaluated through examination of plots of the distributions of the standardized residuals and plots of standardized residuals against the predictor variables and through analysis of patterns in the distribution of the 30 largest absolute standardized residuals for the various categories of counties. However, it is difficult to evaluate the evidence for outliers
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
that results from a least squares model fit, which has the property that it may miss influential outliers. In addition, since the four models are so similar and make use of the identical data, it is unlikely that an observation that was a marked outlier for one model would not also be a marked outlier for the other models.
An examination of the distributions of the standardized residuals suggests that none of the four models is especially affected by outliers, although the 1993 models have more outliers than the 1989 models, and nonrural counties and metropolitan counties that are not central counties have somewhat more outliers than other categories of counties. This analysis is only a start. It would be useful to extend this analysis, using other statistics and various graphical techniques, to identify the counties that are not well fit by robustly estimated versions of these models in order to determine any characteristics that outlier counties have in common.
Summary The panel concluded that the analysis of the regression output for the four candidate county models for 1989 and 1993 largely supports the assumptions of the models: there is little evidence of important problems with the assumptions. The analysis does not strongly support one model over another, although it does support use of the population under age 18 instead of the population under age 21 as a predictor variable in the log number model.
All of the models exhibit a few common problems. First, they all behave somewhat differently for larger urban counties and counties with large percentages of Hispanic residents than for other counties. Second, all models show evidence of some variance heterogeneity with respect to both CPS sample size and the number or proportion of poor school-age children.
1995 Evaluations
The internal evaluation for the 1995 county model, which is essentially the log number (under 18) model (b) evaluated above, focused on comparisons of the properties of the model when estimated for different time periods. The analysis looked in particular at three characteristics: the constancy of the regression coefficients for the predictor variables over time; distributions (box plots) of the standardized residuals for categories of counties to determine if there were any nonrandom patterns that persisted over time; and the phenomenon observed in the 1993 evaluations by which the variance of the standardized residuals was related to CPS sample size and the predicted value of the dependent variable (variance heterogeneity).
Constancy of the Regression Coefficients Because the county model is refitted for each prediction year, constancy of the regression coefficients for the predictor variables over time is not as important as it would be if the estimated regression coefficients from the model were used for predictions for subsequent
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
years. Also, major changes in economic conditions would be expected to cause some changes in the coefficients. Nonetheless, it is desirable for the coefficients to be in the same direction and not fluctuate wildly in size over time.
TABLE 6-2 Estimates of Regression Coefficients for Census Bureau 1995 County Model, Estimated for 1989, 1993, and 1995
Predictor Variablesa
Year
No. of Counties
(1)
(2)
(3)
(4)
(5)
1989 (revised IRS data)
1,028
0.52
0.29
1.55
−1.56
0.26
(.06)
(.06)
(.31)
(.30)
(.06)
1989 (original IRS data)
1,028
0.50
0.23
1.79
−1.80
0.32
(.06)
(.05)
(.27)
(.27)
(.07)
1993
1,184
0.38
0.27
0.65
−0.59
0.34
(.08)
(.07)
(.24)
(.24)
(.09)
1995
985
0.31
0.29
0.88
−0.80
0.33
(.10)
(.08)
(.25)
(.25)
(.09)
NOTE: All predictor variables are on the logarithmic scale for numbers. Standard errors of the estimated regression coefficients are in parentheses.
aPredictor variables: (1) number of child exemptions reported by families in poverty on tax returns; (2) number of people receiving food stamps; (3) population under age 18; (4) total number of child exemptions on tax returns; (5) number of poor school-age children from previous (1980 or 1990) census.
Table 6-2 shows the regression coefficients for the predictor variables for the 1995 county model estimated for 1995 and 1993 and for 1989 with both the original and revised IRS data (see Chapter 4).8 The coefficients for the three “poverty level” predictor variables—child exemptions reported by families in poverty on tax returns (column 1), food stamp recipients (column 2), and poor school-age children from the previous census (column 5)—are fairly similar in the equations for all three time periods. There are more substantial differences across the three time periods in the size of the estimated coefficients for the other two variables—population under age 18 (column 3) and total number of child exemptions on tax returns (column 4). However, the sum of these two coefficients is close to zero in each year. Because the two variables are highly posi-
8
The regressions for 1995 and for 1989 with corrected IRS data also used modified food stamp data (i.e., the county food stamp data were raked to the adjusted state food stamp data, as described in Chapter 4).
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
tively correlated and close in magnitude, the predictions from equations with a similar sum for the two coefficients will be similar.
Finally, the sum of all the coefficients is close to 1 for all 3 estimation years: 1.01 for 1995, 1.05 for 1993, and 1.06 for 1989 with the revised IRS data. It is desirable for the coefficients in a model of this form to sum to 1, which indicates that the model predictions do not vary by the scale of the predictor variables. If the sum of the coefficients is much greater than or less than 1, the model should be examined to determine if additional predictor variables or other changes in the model may be needed.
Patterns of Residuals Given typical random variation, it is likely that the distributions of standardized residuals will display apparently nonrandom patterns for some categories of counties in a particular year. However, if the distributions display the same patterns across years, it is evidence of model bias. The persistence of the same patterns should be investigated to determine ways to eliminate or reduce the bias, for example, by adding a variable to the equation. (There are ample degrees of freedom in the county model to permit the inclusion of additional predictor variables.)
Investigation of the standardized residuals for categories of counties for the county model estimated for 1995, 1993, and 1989 reveals little evidence of persistent bias. However, there is some suggestion that the model tends to consistently overpredict the number of poor school-age children in smaller size counties (i.e., the model estimates are somewhat higher than the CPS direct estimates for smaller counties). It also tends to overpredict the number of poor school-age children in counties that are in metropolitan areas but are not the central county in the area. These patterns, while not strong, are evident in the regression output for all 3 years. The tendency for the model to overpredict the number of poor school-age children in counties with a high percentage of Hispanics that was evident for 1989 in the 1993 model evaluations did not persist over time.
Variance Heterogeneity The regression output for the 1995 county model clearly demonstrates variability in the size of the absolute standardized residuals as a function of the predicted value (log number of poor school-age children) and the CPS sample size. If the variance estimates for the model are correct, then the standardized residual variance should remain constant over the distribution of CPS sample size. However, it increases with increasing CPS sample size. This phenomenon was evident in the evaluations conducted for the 1993 county model, and it is evident in all 3 years for which the 1995 county model was estimated.
As noted for the 1993 evaluations above, adjusting a model to remove this type of heterogeneity is likely to have only a small effect on the estimated regression coefficients or the model estimates (although it will affect the estimated confidence intervals around the model estimates). Nonetheless, it is clear that the current method for estimating the variance of the sampling errors—ai in equation
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
categories of counties, in large part because of the small sample sizes for the CPS estimates, even when aggregated for 3 years. Some of the differences are very large, larger than any of the differences seen in the model-1990 census comparisons above. Generally, the larger model-CPS aggregate differences are for categories of counties with smaller numbers of CPS sample households. For example, the model-CPS aggregate differences often exceed 5 percent for counties grouped into the nine geographic divisions, but they are all less than 5 percent for counties grouped into the four geographic regions.19
In addition, the model-CPS aggregate differences for 1989 frequently differ from the model-1990 census differences. This finding is expected, given that the measurement of poverty differs between the census and the CPS because of the many differences in data collection procedures.
Despite the sample size limitations, Table 6-6 can inform an assessment of the performance of the county model if the results are used with caution. Of particular interest are instances in which the model-CPS aggregate differences are both large and in the same direction (plus or minus) for all 3 years for which the county model is estimated. Such findings suggest a possible systematic bias in the model that should be investigated to determine the nature of the bias and what steps could be taken to eliminate or reduce it (e.g., by adding a predictor variable to the model). Several persistent patterns are evident in the model-CPS aggregate differences:
The model shows a tendency to underpredict the number of poor school-age children in the largest counties, those with 250,000 or more population. This finding is consistent with the results from analyzing the distribution of the standardized residuals from the regression output. The extent of the underprediction is not large, but it appears to be significant given the large number of CPS households in the largest counties.
The model shows a tendency to underpredict the number of poor school-age children in counties with large percentages of Hispanic residents (10% or more). There is a similar, although less pronounced, tendency for the model to underpredict the number of poor school-age children in counties with large percentages of blacks. It is likely that counties with large percentages of Hispanics or blacks are not homogeneous (e.g., large-percentage black counties include both inner-city and rural areas). Hence, further research is needed to determine whether the underprediction is more or less pronounced for particular subgroups of these counties and, consequently, what steps are appropriate to ameliorate the bias in the model.
19
For future evaluations of this type, the standard errors of the differences should be computed so that significant differences between the model estimates and the CPS 3-year aggregate estimates can be identified.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
The model estimates are consistently very different from the weighted CPS estimates for some categories of rural counties classified by economic type. In particular, the model estimates for rural counties characterized as government are much higher than the corresponding weighted CPS estimates. Although the comparisons by economic type are based on small CPS sample sizes, it seems worthwhile to examine some of these counties to see if a reason for these large differences can be found.
Finally, the model shows a tendency to underpredict the number of poor school-age children in counties that experienced the largest declines in the poverty rate for school-age children from 1980 to 1990. As was noted above, this finding is consistent with the knowledge that any regression model can only partially predict which cases will have the most extreme values of the outcome variable.
Local Assessment of 1993 County Estimates
The panel performed another type of external evaluation of the original 1993 county estimates of poor school-age children—the use of local knowledge.20 Using the original 1993 model estimates for all 3,143 counties in the United States, the analysis first sought to identify groups of counties for which the 1993 estimates seemed unusually high or low in relation to prior levels and trends (e.g., from 1980 to 1990) in the number and proportion of poor school-age children and known social and economic trends for these groups of counties. Then, local informants—including staff and members of local councils of government, economic development authorities, welfare agencies, state demographic units, state data centers, and other agencies—were contacted to obtain their assessment of the reasonableness of the implied trends in poverty for school-age children given their knowledge of local socioeconomic conditions.21
County Analysis
Changes in the number and proportion of poor school-age children implied by the 1993 estimates were examined for counties categorized by several characteristics, including: population size and metropolitan status; population change; percentage of immigrants; college-dominated counties; reservation and Native American counties; for nonmetropolitan counties, whether predominantly agri-
20
This evaluation was carried out at the University of Wisconsin-Madison by Dr. Paul Voss, a member of the panel, with the assistance of Richard Gibson and Kathleen Morgen (see Voss, Gibson, and Morgen, 1997).
21
The discussion refers to “implied” trends because the Census Bureau's county model is not designed to directly estimate change over time.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
cultural; and several classifications by geographic location (e.g., state and the regions identified by the U.S. Department of Agriculture).
The analysis identified a number of categories of counties for which further investigation of the reasonableness of the 1993 estimates seemed warranted:
Large metropolitan central city counties had a high implied percentage change in the number of school-age children in poverty between 1989 and 1993—42 percent. This change declined systematically with decreasing size for metropolitan counties and continued to decline to the most remote, rural nonmetropolitan counties, for which the implied change in the number of school-age children in poverty was −6 percent.
Counties with higher levels of international immigration had higher implied increases in the number and proportion of poor school-age children.
Counties with higher percentages of Native Americans had lower implied increases in the number and proportion of poor school-age children. There was no particular pattern for counties with reservations.
Farm counties had an implied decline in the number and proportion of poor school-age children, while nonfarm metropolitan counties had an implied increase.
When the country was divided into the 26 regions identified by the U.S. Department of Agriculture, several regions were identified on the extremes of change in the number and proportion of poor school-age children. High implied increases were found in the Northern Metropolitan Belt, the Florida Peninsula, the Southwest, Northern New England, Mohawk New York and Pennsylvania, Lower Great Lakes Industrial, Southern Piedmont, and the Northern Pacific Coast. Small implied increases were found in the Central Corn Belt, the Southern Appalachian Coal Region, the Coastal Plain Cotton Region, the Northern Great Plains, and the Rockies, Mormon, Columbia River Region. The single region with an implied decrease in the number and proportion of poor school-age children was the Mississippi Delta.
Some of these implied changes are apparently related to the general effect of population size, discussed above. However, the findings in this regional analysis, in particular, suggested which states and counties to follow up in discussions with local officials.
Local Input
When counties that share certain characteristics appeared also to share a common pattern of change in the number and proportion of poor school-age children, a variety of individuals with local knowledge were contacted. Initially, 70 individuals associated with state data centers or state data center affiliate units were contacted; they provided a series of responses and referrals to other state
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
and local officials. In addition, 26 states that appeared to have a sizable number of counties that shared a common implied trend in poverty for school-age children were targeted for intensive contact.
The nature of responses varied considerably. In some states, the original 1993 county estimates released by the Census Bureau had not been examined, and there appeared to be little interest in discussing them. In other states, the estimates had been looked at, but the general admonitions about standard errors that accompanied their release had dampened interest in studying them in detail. In contrast, several states had carried out in-depth analyses of the estimates. Of the 26 states targeted for intensive follow up, 8 provided detailed explanations (supported by examples) of trends suggested by the original 1993 county estimates, and 7 more states provided in-depth responses supported by their own analyses.
Almost every state agency contacted expressed specific doubts about the original 1993 estimates for one or more counties—too high here, too low there. In general, however, there was no consensus that the trends implied by the original 1993 county estimates were wrong, even in states for which large numbers of counties experienced apparent declines in the number and proportion of poor school-age children. Of the 26 states, 21 provided explanations as to why the original 1993 estimates appeared to show poverty trends in a specific direction or why the direction of change is too difficult to know. The most common explanations included comments about the size of the county, its rural agricultural nature, the fact that it is a diverse metropolitan county, immigration from abroad, and economic growth or economic decline. Occasionally, reference was made to a military base, an Indian reservation, or a university as an explanation for an apparent trend in poverty for school-age children. In three states, concern was expressed about the role of Food Stamp Program data in the estimation model, as these data were deemed to be unreliable.
In summary, a high level of concern was expressed by individuals with local knowledge about the statistical reliability of the original 1993 county estimates, which is largely due to the Census Bureau 's own cautions in this regard, coupled with specific county estimates that seem on the basis of local knowledge to be highly doubtful. These concerns notwithstanding, no categories of counties were identified that experienced apparent trends in the number and proportion of poor school-age children between 1989 and 1993 that were not accepted by local informants. Although the trends for a few counties were not accepted locally, the analysis found no strong indicators of potential bias for groups of counties sharing common characteristics in the county model.
Summary
Considering the external evaluations of alternative models that were conducted by comparison with 1990 census estimates, the external evaluations of 3
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
years of estimates that were conducted for the 1995 county model by comparison with weighted direct CPS estimates, and the local assessment of the 1993 county estimates, the panel concluded that the county model is working reasonably well. However, further investigation is needed of categories of counties for which the model appears to overpredict or underpredict the number of poor school-age children, particularly when that phenomenon is evident for several periods.
STATE MODEL EVALUATION
The state model plays an important role in the production of county estimates of poor school-age children. Evaluations conducted of the state model for the assessment of the revised 1993 county estimates included an internal evaluation of the regression output for 1989 and 1993 and an external evaluation that compared 1989 estimates from the model with 1990 census estimates of proportions of poor school-age children. The results in each case supported the use of the model. However, the state model evaluations were more limited than the county model evaluations, as alternative state model formulations were not evaluated explicitly.
For the assessment of the 1995 county estimates, further evaluations were conducted of the state model. In particular, the model was estimated for 7 years—1989, 1990, 1991, 1992, 1993, 1995, and 1996—and the regression output for those years was examined to determine if there were any systematic biases in the model estimates. (The model was not estimated for 1994 because the redesign of the CPS sample, consequent to the 1990 census, was partly but not completely phased in for the March 1995 CPS.) Also, there was an evaluation of the state raking factors for 1993 and 1995.
State Model Regression Output
The state regression model is a poverty rate model with the variables not transformed (see equation (2) in Chapter 4). The analysis of the regression output for the state model, estimated for each year from 1989 through 1993 and for 1995 and 1996, examined the same assumptions that were examined for the 1995 county model estimated for 1989, 1993, and 1995. The analysis is somewhat less informative for the state model than for the county model because there are about 1,000 counties with poor school-age children in the CPS, but only 51 states (including the District of Columbia), and states are collectively much more homogeneous than counties with respect to poverty rates and other characteristics. In addition, with respect to both internal and external evaluation, some categories of states do not contain enough states for analysis, thereby reducing the utility of evaluation.
Nonetheless, examination of the regression output for the state model helps assess the validity of its assumptions. With a few exceptions, the analysis sup-
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
ports the assumptions underlying the state model (see below); there is little evidence of significant problems with the model formulation (although there may be other models that fit just as well).
Linearity
Plots of standardized residuals against the four predictor variables in the state model—the proportion of child exemptions reported by families in poverty on tax returns, the proportion of people receiving food stamps, the proportion of people under age 65 who were not included on a tax return, and a residual from the analogous regression equation using the previous census estimate as the dependent variable—support the assumption of linearity. Furthermore, the standardized residuals, when plotted against the model's predicted values, provide no evidence of the need for any transformation of the variables. This result helps justify the decision not to use the log transformation of the proportion poor as the dependent variable.
Constancy Over Time
Table 6-7 shows the regression coefficients for the predictor variables for the state model for each of the years from 1989 to 1996, excluding 1994. The coefficients for all four poverty-rate predictor variables are positive in all 7 years and generally similar across all years. All of the coefficients are significant at the 5 percent level except that the coefficient of the proportion of people under age 65 who were not included on an income tax return (column 3) is not significant in 1989.
Inclusion or Exclusion of Predictor Variables
The standardized residuals for the state regression model were grouped into four categories for each of the following characteristics: census region, population size in 1990, 1980 to 1990 population growth, percentage of black population in 1990, percentage of Hispanic population in 1990, percentage of group quarters residents in 1990, and percentage of poor school-age children in 1979 (from the 1980 census). The distributions of the standardized residuals for each category were then displayed using box plots. For none of these box plots is there an obvious pattern to the standardized residuals across categories, with one exception: in 1989, 1990, 1991, and 1993, the model underpredicts the proportion of poor school-age children in the West Region (i.e., the model estimates are lower than the CPS direct estimates for this group of states). The Census Bureau experimented with adding a West Region indicator predictor variable to the model. The coefficient of this variable has a negative sign for all 7 years; however, it is significant for only 1991, 1992, and 1993. For those 3 years, the
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
TABLE 6-7 Estimates of Regression Coefficients for the 1995 State Model, Estimated for 1989-1993, and 1995-1996
Predictor Variablesa
Year
(1)
(2)
(3)
(4)
1989
0.52
0.71
0.23
0.71
(.09)
(.20)
(.13)
(.34)
1990
0.46
0.65
0.42
1.07
(.09)
(.20)
(.15)
(.36)
1991
0.46
0.52
0.59
0.84
(.10)
(.21)
(.14)
(.37)
1992
0.41
0.71
0.42
1.38
(.10)
(.21)
(.13)
(.37)
1993
0.28
1.14
0.51
1.24
(.12)
(.25)
(.14)
(.39)
1995
0.57
0.79
0.32
1.54
(.12)
(.25)
(.13)
(.36)
1996
0.37
0.97
0.59
1.02
(.12)
(.26)
(.14)
(.36)
NOTES: All predictor variables are in terms of rates. Standard errors of the estimated regression coefficients are in parentheses.
aPredictor variables: (1) ratio of child exemptions reported by families in poverty on tax returns to total child exemptions; (2) ratio of people receiving food stamps to total population; (3) ratio of people under age 65 who were not included an income tax return to total population under age 65; (4) residual from a regression of poverty rates for school-age children from the prior decennial census (1980 or 1990) on the other three predictor variables.
model with the West Region variable performs better for states in the West Region. A further examination of the residuals from the state model without the West Region predictor variable for individual Western states reveals that the model fairly consistently underpredicts the proportion of poor school-age children in some Western states but just as consistently overpredicts the proportion of poor school-age children in other Western states. Further investigation is needed to explain these patterns.
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
Normality, Homogeneous Variances, and Outliers
The distribution of the standardized residuals from the state regression model shows some small degree of skewness, especially in the 1992 equation. However, the skewness does not appear sufficiently marked to be a problem. Also, the residual plots and the box plots of the distributions of the standardized residuals against the categories of states show little evidence of any heterogenous variance. Finally, there is no evidence of outliers from examination of the residual plots or displays of the distributions of the standardized residuals from the state regression model.
Model Error Variance
One problem in the state model concerns the variance of the model error (ui in equation (2) in Chapter 4). In the state model, the variances of the sampling errors (ei in equation (2)) are estimated directly from the CPS data using a generalized variance function. The total model error variance is calculated using maximum likelihood estimation. The result of this calculation is an estimate of zero for the model error variance in the equation for every year except 1993. This result, which implies (absent sampling variability) that the model gives perfect predictions of state poverty rates for school-age children, is not credible. In the shrinkage estimate, it produces a zero weight for the direct estimates even when those estimates are quite precise, as is the case for several large states in the CPS sample. Even a small model error variance can substantially change the weight on the relatively high-precision direct estimates when they are combined in a shrinkage procedure with the model estimates.
To evaluate the effects of using zero model error variance in the estimation, the panel examined tables that compared the model estimates of the proportion of poor school-age children to the CPS direct estimates by state for 1989-1993 and 1995-1996; as an illustration, Table 6-8 shows this comparison for 1995. This examination demonstrated two important points. First, there are some appreciable differences between the model estimates and the direct estimates. For example, for Mississippi in 1995, the difference is over 7 percentage points. Therefore, if a non-zero estimate for model error variance is produced, it might have important consequences for the state estimates of poor school-age children. Second, while there are some appreciable differences, the model estimates were within two standard errors of the direct estimates for almost all states in each year. The range of model estimates that exceeded that limit in either a positive or negative direction was from one state in 1992 to six states in 1996. (Mississippi's difference in 1995 was not statistically significant at the 5 percent level.) For no single state did the model estimates exceed two standard errors of the direct estimates for more than 3 of the 7 years for which the state model was estimated. (And this analysis ignores the variance of the model estimates, which means that
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
TABLE 6-8 CPS Direct Estimate and Regression Model Estimate of Percentage of School-Age Children in Poverty by State, 1995
CPS Direct Estimate
Lower Confidence Bound on Direct Estimate
Upper Confidence Bound on Direct Estimate
State Model Regression Estimate
Regression Estimate Minus Direct Estimate (4) – (1)
State
(1)
(2)
(3)
(4)
(5)
Alabama
22.2
16.5
27.9
23.4
1.2
Alaska
6.3
1.6
11.1
10.9
4.5
Arizona
23.0
16.8
29.2
21.1
−1.9
Arkansas
21.4
14.0
28.7
24.0
2.6
California
22.5
19.4
25.7
21.5
−1.0
Colorado
9.4
5.1
13.8
11.8
2.3
Connecticut
15.6
7.3
24.0
12.6
−3.0
Delaware
15.6
8.3
23.0
12.8
−2.8
District of Columbia
30.2
17.9
42.4
33.8
3.7
Florida
21.1
16.8
25.4
20.7
−0.4
Georgia
14.8
8.2
21.3
21.4
6.7
Hawaii
14.1
7.9
20.3
11.9
−2.2
Idaho
15.4
9.9
20.9
12.7
−2.7
Illinois
19.4
14.6
24.2
15.7
−3.7
Indiana
12.9
9.0
16.8
12.6
−0.4
Iowa
15.2
8.9
21.4
11.2
−3.9
Kansas
10.6
4.8
16.4
12.7
2.1
Kentucky
18.9
13.4
24.4
22.9
4.0
Louisiana
24.2
15.6
32.9
28.0
3.8
Maine
10.7
4.1
17.4
13.8
3.1
Maryland
12.8
5.0
20.5
11.5
−1.3
Massachusetts
16.5
11.5
21.5
13.3
−3.2
Michigan
14.2
10.0
18.3
17.2
3.0
Minnesota
9.5
5.5
13.4
10.0
0.6
Mississippi
34.9
25.6
44.3
27.4
−7.6
Missouri
9.4
3.5
15.2
17.0
7.7
Montana
17.4
9.4
25.3
18.4
1.0
Nebraska
11.4
7.1
15.7
10.0
−1.4
Nevada
9.8
4.0
15.6
11.8
2.0
New Hampshire
4.2
0.6
7.8
6.5
2.3
New Jersey
9.3
6.5
12.0
12.3
3.0
New Mexico
34.0
27.8
40.3
28.6
−5.5
New York
22.7
19.1
26.3
23.1
0.4
North Carolina
19.7
13.8
25.5
17.1
−2.6
North Dakota
10.3
5.3
15.2
14.1
3.8
Ohio
16.6
11.1
22.2
15.1
−1.5
Oklahoma
22.6
13.1
32.1
22.5
−0.1
Oregon
12.5
7.1
17.9
12.4
−0.1
Pennsylvania
16.1
12.5
19.7
15.3
−0.9
Rhode Island
16.4
10.7
22.2
15.1
−1.3
South Carolina
30.8
21.9
39.7
21.9
−8.9
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
South Dakota
16.7
8.7
24.8
17.3
0.6
Tennessee
18.4
9.1
27.7
18.7
0.3
Texas
22.4
19.3
25.5
24.3
1.9
Utah
7.3
3.9
10.8
7.5
0.2
Vermont
11.3
3.2
19.4
11.6
0.3
Virginia
14.3
7.6
21.1
14.5
0.1
Washington
15.8
7.9
23.7
12.4
−3.4
West Virginia
23.0
13.2
32.9
25.7
2.7
Wisconsin
11.1
4.0
18.1
12.2
1.2
Wyoming
10.5
6.3
14.7
12.2
1.7
NOTE: Confidence bounds are plus or minus two standard errors on the direct estimate (95% confidence interval, obtained using direct estimates of the CPS standard errors).
SOURCE: Data from U.S. Census Bureau.
a yet smaller number of differences are statistically significant.) These results suggest that the state model is performing reasonably well: differences between model and direct estimates are neither unusually large nor strongly persistent. However, more work should be conducted to evaluate the current procedures for estimating the sampling error variance of the state model and the effects on the model estimates.
1990 Census Comparisons
Fay and Train (1997) compare 1989 estimates of the proportion of poor school-age children from the state model with 1990 census estimates. They find that the differences between the model and census estimates are much smaller than the differences between the 1989 CPS direct estimates and the 1990 census estimates and considerably smaller than the differences between the 1980 census estimates and the 1990 census estimates. These findings, which are presented graphically in Fay and Train (1997), support the use of a model-based approach to producing updated state estimates of poor school-age children instead of relying on estimates from the previous census or from the CPS alone. Similarly, a formal hypothesis test performed for the state model (Fay, 1996) supports the conclusion that the model-based estimates for 1993 are preferable to estimates
OCR for page 57
Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology
from the 1990 census.22 Comparable evaluations have not been performed for alternative state models or for categories of states.
State Raking Factors
The final stage in producing updated estimates of the number of poor school-age children for counties is to ratio adjust, or rake, the estimates from the county model for consistency with the estimates from the state model. The county model-1990 census comparisons found that the raking procedure was beneficial to the county estimates. The raking factors vary considerably across states. For 1995, the raking factors range from 0.71 to 1.14 (two-thirds fall between 0.88 and 1.06); for 1993, the raking factors range from 0.91 to 1.31 (two-thirds fall between 0.98 and 1.16).
The Census Bureau determined that the correlation between the raking factors for states in 1993 and 1995 is low, which implies that there is little systematic variation by state across these years. Also, some variation in the raking factors is expected given the form of the county model and the need to transform the predicted log values of poor school-age children to estimated numbers before the raking is performed. Other sources of this variability could include the use of 3-year averages of CPS estimates as the dependent variable in the county model versus single-year estimates in the state model, sampling variability, and, possibly, individual state effects that are not captured in the county model (see Chapter 9 and National Research Council, 2000:Ch.3). Preliminary work by the panel suggests that a large proportion of the variation in the state raking factors is due to sampling variability. Further investigation should be carried out to better understand the causes of this variation.
22
The test assumes that the objective is to predict poverty rates that reflect the CPS measurement of poverty and not the decennial census measurement.