Read "Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology" at NAP.edu

Page 185 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

APPENDIX

B

Regression Diagnostics on Alternative County Regression Models

An internal evaluation of a regression model, or “regression diagnostics,” involves an assessment of its underlying assumptions and features. Chapter 6 reports the results of such an evaluation for four county models, estimated for 2 years, 1989 and 1993. These four models, which were considered serious candidates to produce revised county estimates of poor school-age children in 1993, have the following designations: (a) log number model (under 21, the original county model); (b) log number model (under 18, the revised county model); (c) log rate model (under 21); and (d) log rate model (under 18).

This appendix summarizes the results of an internal evaluation for 13 county models, listed below (see Chapter 5 and Appendix A for the model specifications). Twelve of the models were considered in the first round of model evaluations; they include models (a), (b), and (c). The other model, the log rate (under 18) model (d), was added for the second round of evaluations, which considered the four candidate models (a-d).

Of the 13 county models, 7 are single-equation models, in which the dependent variable is from 3 years of the CPS. For 1993 estimates of poor school-age children, the dependent variable is a weighted average of data from the March 1993, 1994, and 1995 CPS, covering income years 1992, 1993, and 1994. For 1989 estimates of poor school-age children, produced for evaluation purposes, the dependent variable is a weighted average of data from the March 1989, 1990, and 1991 CPS, covering income years 1988, 1989, and 1990.

The other 6 county models are bivariate models in which two equations are jointly estimated to develop estimates of poor school-age children in 1993. In

Page 186 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

one equation, the dependent variable is a weighted average of data from the March 1993, 1994, and 1995 CPS, covering income years 1992, 1993, and 1994. In the second equation, the dependent variable is from the 1990 census, covering income year 1989.

The regression coefficients for all the CPS models are presented in Table B-1; Table B-2 shows the regression coefficients for the 1990 census equation for the 6 bivariate models (see pages 188-190).

Single-Equation Models	Bivariate Models
Log number under 21 (1989, 1993)	Log number under 21 (1993)
Log number under 18 (1989, 1993)
Log number under 21, fixed state effects (1989, 1993)	Log number under 21, fixed state effects (1993)
Log rate under 21 (1989, 1993)	Log rate under 21 (1993) Log rate under 21, fixed state effects (1993)
Log rate under 18 (1989, 1993)
Rate under 21 (1989, 1993)	Rate under 21 (1993) Rate under 21, fixed state effects (1993)
Hybrid log rate-number under 21 (1989, 1993)
NOTE: The years for which coefficients were fit are in parentheses; for the bivariate models, the year shown is for the CPS equation.

REGRESSION DIAGNOSTICS METHODS

Regression diagnostics is an analysis of the extent to which the various assumptions on which a regression model is based are supported by the data. The following six assumptions were examined for the 13 county models of poor school-age children (see Chapter 6):

linearity of the relationship between the dependent variable and the predictor variables;
constancy over time of the assumed linear relationship and in the estimated coefficients of the predictor variables;
which variables are needed in the model, specifically, whether any of the included predictor variables are not needed in the model and, con-

Page 187 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

versely, whether other potential predictor variables are needed in the model;

normality (primarily symmetry and moderate tail length) of the distribution of the standardized residuals;¹
whether the standardized residuals have homogeneous variances; that is, whether the variability of the standardized residuals is constant across counties and does not depend on the values of the predictor variables; and
the absence of outliers, which can be considered to be the absence of an extremely long tail to the error distribution.

Various techniques are useful for examining the degree to which each of these six assumptions obtain. The following techniques that were implemented by the panel and the Census Bureau to evaluate the 13 county models are certainly not the only ones that can be used to examine each of the above assumptions, but they are usually included. In addition to these general techniques, specific analyses were conducted to evaluate the bivariate model formulation in comparison with the single-equation model formulation and the use of the population under age 18 in comparison with the population under age 21 as a predictor variable in the log number model.

Linearity Linearity of the relationships between the dependent variable and the predictor variables was assessed graphically, by observing whether there was evidence of curvature in the plots of standardized residuals against predictor variables in the model. In addition, plots of residuals against CPS sample size and against the predicted values from the regression model were examined for curvature.

Constancy For the single-equation models that could be fit for both 1989 and 1993, the regression coefficients were compared to determine if the values remained roughly constant over time.

Inclusion or Exclusion of Predictor Variables The possibility that one or more predictor variables should be excluded from a model was assessed by looking for insignificant t-statistics for the estimated values of individual regression coefficients. The need to include additional predictor variables was assessed by looking for nonrandom patterns, indicative of possible model bias, in the distributions of standardized residuals displayed for various categories of counties. (See Chapter 6 for the categories examined in various model evaluations;

¹	See Chapter 6 for the procedure used to standardize the residuals, which are the differences between the predicted and reported values of the dependent variable for each observation.

Page 188 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

TABLE B-1 Estimates of Regression Coefficients for the CPS Equation for 13 County Models

	Predictor Variables^a
Model	1	2	3	4	5
Log Number (under 21)
1989	0.52	0.30	0.76	−0.81	0.27
	(.07)	(.05)	(.22)	(.22)	(.07)
1993	0.31	0.30	0.03	0.03	0.40
	(.08)	(.07)	(.21)	(.21)	(.09)
Log Number (under 18)
1989	0.50	0.23	1.79	−1.80	0.32
	(.06)	(.05)	(.27)	(.27)	(.07)
1993	0.38	0.27	0.65	−0.59	0.34
	(.08)	(.07)	(.24)	(.24)	(.09)
Log Number (under 21), Fixed State Effects
1989	0.36	0.27	0.45	−0.56	0.51
	(.13)	(.07)	(.25)	(.25)	(.10)
1993	0.50	0.17	−0.03	−0.07	0.45
	(.12)	(.09)	(.25)	(.25)	(.11)
Hybrid Log Rate-Number (under 21)
1989	0.55	0.27	0.35	−1.34	0.25
	(.06)	(.05)	(.21)	(.21)	(.06)
1989	0.37	0.26	−0.33	−0.59	0.37
	(.07)	(.06)	(.18)	(.18)	(.08)
Bivariate Log Number (under 21)
1993	0.57	0.45	0.19	−0.20	NA
	(.06)	(.05)	(.20)	(.20)
Bivariate Log Number (under 21), Fixed State Effects
1993	0.83	0.34	0.21	−0.38	NA
	(.09)	(.07)	(.24)	(.24)

Page 189 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

	Predictor Variables^b
Model	1	2	3	4
Log Rate (under 21)
1989	0.32	0.29	−0.73	0.40
	(.07)	(.04)	(.19)	(.07)
1993	0.23	0.31	−0.07	0.41
	(.08)	(.06)	(.18)	(.09)
Log Rate (under 18)
1989	0.29	0.26	−1.13	0.43
	(.07)	(.04)	(.24)	(.07)
1993	0.26	0.30	−0.42	0.38
	(.08)	(.06)	(.20)	(.09)
Rate (under 21)
1989	0.25	0.46	−0.16	0.56
	(.06)	(.08)	(.03)	(.06)
1993	0.09	0.60	−0.05	0.52
	(.06)	(.11)	(.03)	(.10)
Bivariate Log Rate (under 21)
1993	0.57	0.40	−0.12	NA
	(.05)	(.04)	(.16)
Bivariate Log Rate (under 21), Fixed State Effects
1993	0.75	0.35	−0.01	NA
	(.08)	(.05)	(.19)
Bivariate Rate (under 21)
1993	0.38	0.89	−0.05	NA
	(.04)	(.06)	(.03)
Bivariate Rate (under 21), Fixed State Effects
1993	0.44	0.85	−0.05	NA
	(.06)	(.08)	(.04)
NOTES: All predictor variables are on the logarithmic scale for numbers and rates. Standard errors of the estimated regression coefficients are in parentheses. Estimated coefficients for the state indicator variables are not shown. The models were estimated with maximum likelihood. NA: not applicable. ^aPredictor variables: (1) number of child exemptions reported by families in poverty on tax returns (1989 or 1993); (2) number of people receiving food stamps (1989 or 1993); (3) population (under age 21 or under age 18, 1990 or 1994); (4) total number of child exemptions on tax returns (1989 or 1993); (5) number of poor school-age children from previous (1980 or 1990) census. ^bPredictor variables: (1) ratio of child exemptions reported by families in poverty on tax returns to total child exemptions (1989 or 1993); (2) ratio of people receiving food stamps (1989 or 1993) to total population; (3) ratio of total child exemptions on tax returns (1989 or 1993) to population (under age 21 or under age 18); (4) ratio of poor school-age children from previous (1980 or 1990) census.

Page 190 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

TABLE B-2 Estimates of Regression Coefficients for the 1990 Census Equation for the 1993 Bivariate Models

Model	Predictor Variables^a
	1	2	3	4
Bivariate Log Number (under 21)	0.71	0.31	0.48	−0.51
	(.01)	(.01)	(.03)	(.03)
Bivariate Log Number (under 21), Fixed State Effects	0.71	0.33	0.45	−0.48
	(.02)	(.01)	(.03)	(.03)
	Predictor Variables^b
Bivariate Log Rate (under 21)	0.66	0.30	−0.23	N.A.
	(.01)	(.01)	(.02)
Bivariate Log Rate (under 21), Fixed State Effects	0.67	0.30	−0.22	N.A.
	(.01)	(.01)	(.02)
Bivariate Rate (under 21)	0.56	0.75	−0.05	N.A.
	(.01)	(.01)	(.01)
Bivariate Rate (under 21), Fixed State Effects	0.55	0.78	−0.05	N.A.
	(.01)	(.02)	(.01)
NOTE: See notes to Table B-1. ^aPredictor variables: (1) number of child exemptions reported by families in poverty on tax returns in 1989; (2) number of people receiving food stamps in 1989; (3) population under age 21 in 1990; (4) total number of child exemptions on tax returns in 1989. ^bPredictor variables: (1) ratio of child exemptions reported by families in poverty on tax returns to total child exemptions in 1989; (2) ratio of people receiving food stamps in 1989 to total population; (3) ratio of total child exemptions on tax returns in 1989 to population under age 21.

the distributional displays examined for this and other model assumptions were box plots.)

Normality The normality of the standardized residuals was evaluated through use of Q-Q plots, histograms, and box plots of the standardized residuals. While some skewness of the distribution of standardized residuals may be acceptable, extreme skewness can change the regression fit so that a relatively small number of counties have more influence on the estimation of the regression coefficients. In addition, extreme skewness can indicate the need for a transformation of the variables, which might, in turn, reveal the need for additional predictor variables.

Homogeneous Variances The homogeneity of the variance of the standard

Page 191 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

ized residuals was assessed using several statistics and graphical displays. The statistics included: Spearman's rank correlation coefficient of absolute standardized residuals with the predicted values and also with the CPS sample size, and a robust regression of the log absolute standardized residuals on CPS sample size. The graphical displays included: scatterplots of absolute standardized residuals versus model predictor variables; box plots of absolute standardized residuals for categories of counties; plots of the median absolute deviation of the standardized residuals in a category by categories; plots of absolute standardized residuals versus log CPS sample size; and plots of standardized residuals to the two-thirds power (the Wilson-Hilferty transformation) versus log CPS sample size.

Outliers The assumption of the absence of outliers was evaluated through examination of plots of the distributions of the standardized residuals and plots of standardized residuals against the predictor variables and through analysis of patterns in the distribution of the 30 largest absolute standardized residuals for the various characteristics used to categorize the counties.² Any patterns observed among the 30 largest absolute standardized residuals for a characteristic may suggest that a predictor variable should be added to a model.

FINDINGS

Linearity There is no evidence of any strong nonlinearity between the predictor variables and the dependent variable in any of the 13 models. Thus, there is no reason to suggest a transformation of the dependent variable in any of the models, nor is there reason to include any higher order polynomial terms as additional predictor variables.

Constancy The regression coefficients for the 7 single-equation models for 1989 and 1993 are shown in Table B-1. All of these models have some coefficients that differ substantially between 1989 and 1993.

Inclusion or Exclusion of Predictor Variables All of the models with fixed state effects have a large fraction of state effects that are not significant at the 5 percent level. In addition, several other models, especially for 1993, had one or two predictor variables with regression coefficients that were not significant, but that was typically for only 1 of the 2 years that were analyzed. Therefore, except for the models with fixed state effects, there was little evidence of predictor variables that should be excluded from an equation. For the fixed state effects models, an examination of the extent to which the state effects cluster and could

²

All the outlier statistics examined are based on the residuals from a least squares model fit, so they may miss influential outliers. It would be useful to look for outliers from a robust fit of the models. It would also be useful to compare the predictions from models with extreme outliers removed.

Page 192 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

be estimated in groups might make it possible to reduce the number of coefficients that need to be estimated.

With respect to the need to include additional predictor variables in a model, nonrandom patterns of the distributions of the standardized residuals—especially a difference in the median standardized residual from 0 for the residuals in a county category—were observed for several characteristics: percent Hispanic population, location in a metropolitan area outside the central county, and population size. The models with the fewest nonrandom patterns of the distributions of the standardized residuals were the bivariate log rate, bivariate rate, and rate models.

Normality Many of the models had distributions of the standardized residuals that were both asymmetric and long-tailed, especially to the side to which the distribution was skewed. It was difficult to distinguish between skewness and the presence of outliers. Often, movement from a log number dependent variable to a log rate dependent variable reduced an outlier problem, but it introduced a skewness problem. The rate models and the hybrid log rate-number model seemed to have both problems and to be particularly problematic in this respect. In contrast, the log number models behaved relatively well on this criterion.

Homogeneous Variances All of the models exhibited nonconstant variances of the standardized residuals. One would expect the standardized residual variance to remain constant over the distribution of CPS sample size; however, for these models, it increased with increasing sample size. Most of the models also had some variance heterogeneity as a function of the predicted value (number or proportion of poor school-age children).

Outliers The rate models and the hybrid log rate-number model exhibited both skewness and long-tailed error distributions. For all models, large urban counties, particularly those with large percentages of Hispanics, and counties that are in metropolitan areas but not the central county had somewhat more outliers than other counties. The bivariate log rate, bivariate log number, and the log rate models had fewer outliers that demonstrated these patterns.

Additional Analysis Analysis that focused on a regression coefficient that is assumed to be constant in the single-equation formulation and is variable in the bivariate formulation demonstrated strong heterogeneity, thereby supporting the bivariate approach (see Appendix A). Also, Akaike's information criterion (AIC) confirmed the superiority of using the population under age 18 as a predictor variable in the log number model instead of the population under age 21.

Page 193 Cite

Suggested Citation:"Appendix B: Regression Diagnostics on Alternative County Regression Models." National Research Council. 2000. Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. Washington, DC: The National Academies Press. doi: 10.17226/10046.

×

SUMMARY

Analysis of the regression output for the 13 county models for the most part supports the assumptions of the models; it does not strongly support one model over another. All of the models exhibit a few common problems. First, they all behave somewhat differently for larger urban counties, especially those with large percentages of Hispanics, than for rural counties. Second, all models show evidence of some variance heterogeneity, particularly with respect to CPS sample size and often with respect to the predicted value (number or proportion of poor school-age children). The rate models and the hybrid log rate-number model exhibit more problems with skewness and outliers than other model formulations. The bivariate approach appears promising due to the heterogeneity in the regression coefficient mentioned above, the lack of patterns in the analysis of the standardized residuals, and the correlation observed by corresponding residuals in the CPS and census regression equations. Finally, according to the internal evaluation, none of the alternative models is clearly superior to the log number model, and the use of the predictor variable for the population under age 18 instead of under age 21 is supported for the log number model.