Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
9 Differential Validity and Differential Prediction This chapter addresses the important question of whether the General Aptitude Test Battery (GATB) functions in the same way for different specified groups. Investigations of group differences in the correlations of a test with a criterion measure are commonly referred to as differential validity studies. Such studies can take a variety of forms, including investigations of the possibility that validity coefficients may differ as a function of the setting (e.g., from one job to another or from one location to another) or the group (e.g., demographic group or groups formed on the basis of prior work experience). Investigations of differential prediction, which cover an equally broad range, focus on prediction equations rather than correlation coefficients. A differential prediction study may be used to investigate whether differences in setting or differences among demo- graphic groups (e.g., racial or ethnic groups or gender) affect the predictive meaning of the test scores. We are not concerned here with setting. Our investigation is limited to the possibility that the GATB functions differently for different population groups, and specifically that correlations of GATB scores with on-thejob criterion measures may differ by racial or ethnic group or gender, or that predictions of criterion performance from GATB scores may differ for employees on a given job who are of different racial or ethnic status or gender. Although questions about differences in correlations and about differ- ential prediction could be raised for groups formed on the basis of a wide range of characteristics, these questions are of particular importance for groups that are known to differ in average test performance. Some of the 172
DIFFERENTIAL VALIDITY AND DIFFERENTIAL PREDICTION ~ 73 policy issues regarding the use of tests for selection that are raised by the existence of group differences in average test performance were discussed in the report of the National Research Council's Committee on Ability Testing, from which we quote (Wigdor and Garner, 1982:71-721: If group differences on tests used for selection do not reflect actual differences in practice in college or on the joWthen using the test for selection may unfairly exclude a disproportionately large number of members of the group with the lower average test scores. Furthermore, even when the groups diner in average performance on the job or in college as well as in average performance on the test, the possible adverse impact on the lower-sconng group should be considered in evaluating the use of the test. Because the differences in average test scores for some groups are relatively large, and because reliance on the scores without regard to group membership can have substantial adverse impact, "it is important to determine the degree to which the differences reflect differences in performance . . . on the job" (Wigdor and Garner, 1982:73~. That is, the results of differential prediction studies are needed. Studies have been conducted by David J. Synk, David Swarthout, and William Goode, among others, comparing the predictive validities of GATE scores obtained for black employees and white employees (e.g., U.S. Department of Labor, 1987), and for men and women (U.S. Department of Labor, 1984a). Although these comparisons of correlation coefficients for different groups are related to the issue of differential prediction, they do not provide a direct answer to the question of whether group differences in average test scores are reflected in differences in job performance. It is possible, for example, for the correlations between a test and a criterion measure to be identical for two groups when there are substantial differences in the prediction equations for the two groups. Thus, the use of a single prediction equation could lead to predictions that systematically over- or underestimate the job performance of members of one of the groups, even though the validity coefficients are the same. Conversely, it is possible for two groups to have the same prediction equations and the same variability of actual criterion scores about their predicted values, and yet have different validity coefficients. Prediction equations are usually based on a linear regression model and are influenced by means and standard deviations of the test and criterion measure as well as the correlation. Thus the equations for two groups may differ as the result of differences in means or standard deviations as well as differences in correlations. Although differential prediction is the more important of the two topics, differences in correlations between scores on the GATB and scores on a criterion measure are also of interest. This is so because there is a
~74 GATE VALIDITIES AND VALIDITY GENERALIZATION common expectation that a test, which may be known to have a useful degree of validity for majority-group employees, may have no useful degree of validity for minority-group employees. Therefore, the results of the committee's investigations of differences in correlations between the GATE and criterion measures are briefly reviewed before turning to a consideration of differential prediction. GROUP DIFFERENCES IN CORRELATIONS David J. Synk and David Swarthout compared the validity coefficients obtained for black and for nonminority employees in 113 Specific Apti- tude Test Battery validation studies conducted since 1972 for which there were at least 25 people in each of the two groups (U.S. Department of Labor, 19871. For almost all of the 113 studies the criterion measure was based on supervisor ratings, typically "the sum of the scores from two administrations of the Standard Descriptive Rating Scale" (p. 21. The weighted average of the validity coefficients across studies was reported separately by group for each of the nine aptitude scores. Also reported were the weighted average validities for the appropriate composites for each of the five job families. The latter results are of greatest interest here because it is the composites that would be used in the proposed VG-GATB Referral System. The weighted average job family correlations reported in Table 4 of the Synk and Swa~thout report are reproduced in Table 9-1. Also shown are the number of studies and the number of employees on which each of the weighted average correlations is based. TABLE 9-1 Weighted Average Job Family Correlations for Black and Nonminority Employees Blacks Nonminorities Job Number of Average Average Family Studies N Correlation N Correlation I 5 196 -.01 624 .05 II 1 44 .11 81 .07 III 1 66 .19 291 .27 IV 62 3,886 .15 9,938 .19 V 44 3,662 .12 4,834 .20 SOURCE: Based on U.S. Department of Labor. 1987. Comparison of Black and Norlminority Validities for the General Aptitude Test Battery. USES Test Research Report No. 51. Prepared by David J. Synk and David Swarthout, Northern Test Development Field Center, Detroit, Mich., for Division of Planning and Operations, Employment and Training Administration. Washington, D.C.: U.S. Department of Labor, Table 4.
DIFFERENTIAL VALIDI~ AND DIFFERENTIAL PREDICTION ~ 75 As the table shows, the average correlation for black employees is smaller than the corresponding average correlation for nonminority employees for all but Job Family II, in which case the results are based on only one study with a relatively small sample of employees. The differ- ence in the average correlations for black and nonminority employees is statistically significant according to the critical ratio test reported by Synk and Swarthout for Job Families IV and V. Synk and Swarthout did not present more detailed information about the distributions of the validity coefficients for the two groups within each job family. However, the Northern Test Development Field Center of the U.S. Department of Labor made data available to the committee that we used to compute correlations between the job-specific GATB composite and criterion measures. These correlations were computed separately for each job with at least 50 black and 50 nonminority employees with GATB scores and scores on the criterion measure. The data files overlap with those used by Synk and Swarthout, differing mainly in the number of studies, since only studies that included at least 50 people in each group were used in the present analyses. As before, the criterion measure is based on supervisor ratings in most cases, usually the Standard Descrip- tive Rating Scale. A total of 72 studies had at least 50 black and 50 nonminority employees. The 72 studies included a total of 6,290 black and 11,923 nonminority employees, for an average of about 87 black and 166 nonminority employees per study. The number of black and nonminority employees per study ranged from 50 to 321 and from 56 to 761, respectively. The correlation between the GATB composite and the criterion mea- sure was larger for the sample of nonminority employees than for the sample of black employees in 48 of the 72 studies. The average correlation (weighted for sample size) of the job-appropriate GATB composite with the criterion measure was .19 for nonminority employees. The corre- sponding weighted average for black employees was .12. Thus, the finding of Synk and Swarthout that the average correlation is smaller for blacks than for nonminorities is confirmed in our analysis. A more detailed comparison of the distributions of correlations be- tween the GATB composite and the criterion measure for the two groups is shown in the stem-and-leaf chart in Table 9-2. The stem-and-leaf chart can be read like a bar chart. The numbers in the center between the brackets give the first digit (i.e., tenths) of the correlation. The numbers to the left give the hundredths digit for each of the 72 correlations based on black employees, and the numbers to the right give the hundredths digit of the 72 correlations based on nonminority employees. For exam- ple, in one study the correlation for black employees was .42. That study
~76 GATE VALIDITIES AND VANDAL GENE=~IZATION TABLE 9-2 Stem-and-Leaf Chart of the Correlations of the Job-Appropriate GATE Composite with the Criterion Measure for Black and Nonminority Employees (stem = .1; leaf = .01) Leaf for BlacksStem Leaf for Nonminor~ties [~5] 1 [~4] 2[.4] 1 97[.3] 57778 4430[.3] 00023 8876665[~2] 55567788 442200[~2] 022222333114444 99888776666655[.1] 55557788889 433222110[.1] 022234444 98877665[.0] 5557789 443311111000[.0] 1133444 300[-.0] 44 877[-.0] 7 O[-.1] 55[-.1] Median, blacks = .13 Median, nonminorities = .185 is depicted by the leaf of 2 to the left of the [.4]. The 1 to the right of the [.4] represents a study where the correlation for nonminority employees was .41 . As the table shows, there is a general tendency for the distribution of correlations to be higher for nonminorities than for blacks. The difference in medians (.185 versus .13) is similar to the difference in sample- size-weighted means (.19 versus .12~. The 25th and 75th percentiles are .11 and .25 for the distribution of correlations based on nonminority employees; the corresponding figures for black employees are .03 and .21. The greater spread in the correlations for blacks compared with nonmi- norities is to be expected because the average number of black employees per study (87) is smaller than that for nonminorities (166~. Hence, the correlations based on data for blacks have greater variability due to sampling error. Nonetheless, for a quarter of the studies, the correlation for blacks is .03 or less. The above results give only a global picture for one of the minority groups of interest. However, the results raise serious questions about the degree of validity of the job family composites for blacks, especially in Job Families IV and V for which the results are based on a sizable number of studies and large samples of black employees. Not only are the average
DIFFERENTIAL VALIDITY AND DIFFERENTIAL PREDICTION ]77 validity coefficients lower for blacks than for nonminor~ties, but the level of the correlation for blacks is also quite low. Comparisons of validity coefficients for other racial or ethnic groups would be of value but data are not presently available. Comparisons of validity coefficients for men and women, however, have been reported by Swarthout, Synk, and Goode (U.S. Department of Labor, 1984a). Swarthout, Synk, and Goode analyzed the results of 122 Specific Aptitude Test Battery validation studies conducted since 1972 in which there were at least 25 men or 25 women. Only 37 of these studies had at least 25 male and 25 female employees. For those 37 studies the weighted average validity of the nine aptitude scores for men and women was reported. Except for manual dexterity, for which the average validity was .05 higher for women than for men (.14 versus .09), the average validities for men and women did not differ by more than .02 on the remaining eight aptitudes. Unfortunately for present purposes, the comparisons of the validities of the job family composites were reported for all studies that had the minimum number of men or 25 women. Thus the averages for men and women are based on overlapping but not identical sets of studies. Since the available studies in Job Families I, II, and III were all single-sex studies, only the results from the Swarthout, Synk, and Goode research for Job Families IV and V are summarized in Table 9-3. As the table (which was taken from Table 6 of the Swarthout, Synk, and Goode research) shows, the weighted average validity for women is quite similar to the corresponding value for men in both job families. Although caution is needed in interpreting these results because the averages for men and women are not based on identical sets of studies, there does not seem to be any indication that the GATE composites for Job Families IV and V are any less valid for women than for men. It might be noted, however, that the average validities reported here are higher for men and women TABLE 9-3 Weighted Average Job Family Correlations for Male and Female Employees Men Women Job Number of Average Number of Average Family Studies N Validity Studies N Validity IV 51 8,793 .24 37 7,101 .25 V 23 2,365 .20 41 6,262 .22 SOURCE: U.S. Department of Labor. 1984. The Elect of Sex on General Aptitude Test Battery Validity and Test Scores. USES Test Research Report No. 49. Prepared by Northern Test Development Field Center, Detroit, Mich., for Division of Counseling and Test Development, Employment and Training Administration. Washington, D.C.: U.S. Department of Labor, Table 6.
)78 GATE VALIDITIES AND VALIDITY GENE~IZATION than the averages that were presented earlier for blacks and nonminori- ties. Recall that the average weighted validities reported by Synk and Swarthout for blacks in Job Families IV and V were only .15 (based on 62 studies) and .12 (based on 44 studies), respectively (U.S. Department of Labor, 19871. DIFFERENTIAL PREDICTION As has already been noted, differences in validity coefficients are related to differential prediction, but the two are not identical and the latter concept is more relevant to determining if predictions based on test scores are biased against or in favor of members of a particular group. According to Standards for Educational and Psychological Testing (American Educational Research Association et al., 1985:12~: There is differential prediction, and there may be selection bias, if different algorithms (e.g., regression lines) are derived for different groups and if the predictions lead to decisions regarding people from the individual groups that are systematically different from those decisions obtained from the algorithm based on the pooled groups. The Standards (p. 12) go on to discuss differential prediction in terms of selection bias: [In the case off simple regression analysis for selection using one predictor, selection bias is investigated by judging whether the regressions differ among identifiable groups in the population. If different regression slopes, intercepts, or standard errors of estimate are found among different groups, selection decisions will be biased when the same interpretation is made of a given score without regard to the group from which a person comes. Differing regression slopes or intercepts are taken to indicate that a test is differentially predictive of the groups at hand. Since the available reports comparing validities do not provide direct evidence regarding the possibility of differential prediction, the committee conducted analyses for this report. Data for these analyses were provided by the Northern Test Development Field Center of the U.S. Department of Labor. The data tape that was provided contained studies used in the Synk and Swarthout comparison of validities for black and nonminority employees (U.S. Department of Labor, 19871. Although the data tape contains a variety of other information, only one criterion measure and one test-based predictor were used in the analyses reported here. The criterion measure is the same as the one used by Synk and Swarthout. Thus, with the exception of a few studies, the criterion measure is based on supervisor ratings, usually the Standard Descriptive Rating Scale. The predictor is the job family
DIFFERENTIAL VALIDI~AND DIFFERENTIAL PREDICTION ~79 composite appropriate for the job family to which each study is assigned. Group membership was indicated by a variable that identified the indi- vidual as black, Native American, Asian, Hispanic, or nonminority. Only individuals identified as either black or nonminority were included in the analyses. For each of the 72 Specific Aptitude Test Battery validation studies in the data file that had data for 50 or more black and 50 or more nonminority individuals, the following statistics were computed separately for each group and for the total combined group: the mean and standard deviation of the job family composite test score and criterion measure, the correlation between the composite test score and the criterion measure, the slope and intercept of the regression of the criterion measure on the composite test score, and the standard error of prediction. Within each study the regression equations were compared by testing the significance of the difference between the slopes, and if the slopes were not signifi- cantly different, the significance of the difference between the intercepts of the regression equations. Standard Errors of Prediction The standard error of prediction is based on the spread of the observed scores on the criterion measure around the criterion scores that are predicted from the test scores using the regression line. A larger standard error of prediction indicates that there is more spread around the regression line, and hence the prediction is less precise. If the standard error of prediction was consistently larger for one group than for another, then one could conclude that the errors of prediction were greater for the group with the larger standard error, and hence that the predictor is less useful for that group. The standard error of prediction was larger for blacks than for nonmi- norities in 40 of the 72 studies, whereas the converse was true in the remaining 32 studies. Since the standard error of prediction increases as the correlation decreases, one might have expected more of a tendency for the standard error of prediction to be larger for blacks than for nonminorities due to the previously discussed difference in correlations. However, the standard error of prediction also depends on the standard deviation of the criterion scores. Indeed, when the correlations are as low as those typically found between the GATB composite and the criterion measure, the standard error of prediction is dominated by the standard deviation of the criterion scores. Thus, the fact that the standard error of prediction is larger for blacks than for nonminorities only slightly more often (56 percent of the studies) than it is smaller (44 percent) is not inconsistent with the typical difference in correlations. it'
i80 GATE VAMDITIES ID VA~DI~ GENE TON Slopes The slopes of the regression of the criterion scores on the job family GATE composite scores were significantly different at the .05 level in only 2 of the 72 studies. The number of significant differences in slopes at the .10 level was 6 of the 72; in 4 of the latter 6 studies, the slope was greater for nonminorities than for blacks, whereas the converse was true in the other 2 studies. Although these results suggest that slope differ- ences are relatively rare, it should be noted that the test for differences in slopes for the two groups in an individual study has relatively little power for the typical sample sizes of the studies. A more sensitive comparison of the slopes is provided by considering the full distribution of the 72 t-ratios computed to test the difference between the slopes obtained for the two groups on a study-by-study basis. A positive l-ratio indicates that the slope for nonminor~ty employees is greater than the slope for black employees, albeit not necessarily signif- icantly greater. Conversely, a negative l-ratio indicates that the slope for black employees is greater than that for nonminority employees. The distribution of the t-ratios for the tests of differences between slopes for the two groups is shown in the stem-and-leaf chart in Table 9-4. If the pairs of slopes differed only due to sampling error in the 72 studies, positive and negative t-ratios would be equally likely and the mean of the 72 t-ratios would differ from zero only by chance. As can be seen, positive t-ratios outnumber negative ones almost two to one (47 versus 25~. The mean of the 72 t-ratios is .30, a value that is significantly greater than zero. Thus, there is a tendency for the slope to be greater for TABLE 9-4 Stem-and-Leaf Chart of the t-Ratios for the Tests of Differences Between the Slope of the Regression Based on Data for Black and Nonminority Employees in 72 Studies (stem = 1; leaf = .1) Stem Leaf Count 2 01 2 1 5579 4 1 001112222333444 15 0 555666667888999 15 0 01122233444 11 -0 0012334444 10 -0 56667 5 - 1 1233444 7 - 1 589 3 NOTE: Median=.45.
DIFFE=NTIAL VALIDI~ AND DIFFE=NTIAL PREDICTION ~ ~ ~ nonminorities than for blacks, but the differences are generally not large enough to be detected reliably in an individual study because of relatively small samples of people in each group. The tendency for the slope to be somewhat greater for nonminorities than for blacks is consistent with the finding that, on average, the correlation between the GATB composite and the criterion measure is higher for nonminorities than for blacks. When slopes are unequal, then the difference between the predictions based on the equations for the two groups will vary depending on the value of the score on the GATB composite. The practical implication of the difference in slopes depends on the relationship of the two regression lines. When the regression line for nonminorities not only has a steeper slope but also is above the regression line for blacks throughout the range of GATB scores obtained by blacks, then blacks will be predicted to have higher criterion scores if the equation for nonminorities is used than if the equation based on the data for blacks is used. However, the difference will be greater for blacks with relatively high GATB scores than for blacks with relatively low scores. Other combinations are, of course, possible when the slopes differ. However, as we show below, the above pattern is most common. Intercepts For the 70 studies in which the slopes were not significantly different at the .05 level, a pooled within-group slope was used and the difference in intercepts for the two groups was tested. In 26 of the 70 studies the intercepts were significantly different at the .05 level. Even with a significance level of .01, 20 of the studies had significantly different intercepts. In all 20 of the latter cases, the intercept for the nonminority employees was greater than the intercept for the black employees. This was also the case for five of the six studies in which the difference was significant at the .05 level but not at the .01 level. Thus, in only 1 of the 26 studies in which the intercepts were significantly different was the intercept greater for black than for nonminority employees. To get a sense of the magnitude of the difference in intercepts, the intercept for black employees was subtracted from the intercept for nonminority employees and the difference was divided by the standard deviation of the criterion scores based on the sample of black employees. The latter step was taken, in part, to account for differences in the criterion scale from one study to another and, in part, to express the difference in a metric that is defined by the spread of the scores for one of the groups. The distribution of these standardized differences in inter- cepts is shown in the stem-and-leaf chart in Table 9-5. (Note that all 72
~2 GATE VALIDITIES AND VALIDI~ GENE=~IZATION TABLE 9-5 Stem-and-Leaf Chart of Standardized Differences in Intercepts of Regression Lines for Black and Nonminority Employees (stem = .1; leaf= .01) Stem Leaf Count .8 11 2 .7 .6 ·5 0122499 7 .4 2233499 7 ·3 012233457 9 .2 13455678999 11 .1 1111455999 10 55777999 12 -.0 0123349 7 -.1 126 3 -.2 0 -.3 04 2 NOTE: Median=.235. studies are included in the distribution, even though 2 of the studies had significant differences in the slopes, suggesting that a pooled, within- group slope is not entirely appropriate.) As the table shows, the difference is positive more often than it is negative, with a median value roughly equal to one-quarter of the standard deviation of criterion scores for black employees. Values of these standardized differences in intercepts that are greater than zero indicate that the performance that would be predicted for a given test score would be higher if the equation with the pooled, within-group slope but the intercept for nonminority employees were used than if the equation with the intercept for black employees were used. With positive values the nonminority equation would tend to overestimate the criterion performance of black employees. The converse is true for standardized differences that are less than zero. Predictions Based on the Total Group In practice, if a single regression equation were to be used to predict the criterion performance of applicants, presumably it would not be either of the within-group equations that was used to test the differences in intercepts. Rather, a total-group equation based on the combined groups would be used. Therefore, the regression equation based on the combined group of black and nonminority employees was estimated for each study.
DIFFERENTIALVALIDI~ANDDIFFERENTIALP~DICTION 183 TABLE 9-6 Stem-and-Leaf Chart of Standardized Difference in Predicted Criterion Scores Based on the Total-Group and Black-Only Regression Equations: GATB Composite Score = Black Mean Minus One Standard Deviation Stem Leaf Count .3 .2 .1 .0 01568 01112236 01244457889 0112222333455566677889999 8 11 25 -.0 00111111223567788 17 0015 3 2 4 NOTE: Median=.05. The potential- impact of using a total-group regression equation to predict the criterion performance of black employees was evaluated by computing the predicted scores that would be obtained using the total- group equation and comparing those predictions to the values that would be obtained using the corresponding equation based on black employees only. More specifically, at each of three score values on the GATB job family composite, two scores were obtained: the predicted criterion score based on the total-group equation and the predicted criterion score based on the equation for black employees only. The latter predicted value was subtracted from the former and, as before, the difference was divided by the standard deviation of the criterion scores for black employees to take into account between-study differences in the metric of the criterion measure. The three levels of GATB job family composite score that were used were (1) the mean for black employees in the study minus one standard deviation for those employees, (2) the mean for black employ- ees, and (3) the mean plus one standard deviation. The distributions of these standardized differences in predicted scores are shown in the stem-and-leaf charts in Tables 9-6, 9-7, and 9-8, one for each of the predictor score levels used in the calculations. Analogous to the above intercept comparisons, a positive number indicates that the predicted criterion performance of a black employee with the selected GATB composite score is higher when the total-group equation is used than when the equation for black employees only is used. In this case the total-group equation is said to overpredict or to provide a prediction that is biased in favor of black employees with that GATB composite score. Conversely, negative numbers would be said to underpredict or to yield
~84 GATB VALIDITIES AND VALIDI~GENE~IZATION TABLE 9-7 Stem-and-Leaf Chart of Standardized Difference in Predicted Criterion Scores Based on the Total-Group and Black-Only Regression Equations: GATB Composite Score = Black Mean . Stem Leaf Count .4 011 3 .3 012358 6 .2 222345588 9 .1 0022223333334555666778999 25 .0 012222334555666788 18 -.0 12223446 8 -.1 138 3 NOTE: Median = .13. predictions that are biased against black employees with that GATB composite score. Although there is substantial variation from study to study, a large amount of which would be expected simply on the basis of sampling variability, there is some tendency for the standardized difference in predicted criterion scores to be positive. The tendency is weakest at the lowest predictor score value (median = .05) and strongest at the highest predictor score value (median = .18~. The latter difference is a conse- quence of the total-group slope typically being slightly greater than the TABLE 9-8 Stem-and-Leaf Chart of Standardized Difference in Predicted Criterion Scores Based on the Total-Group and Black-Only Regression Equations: GATB Composite Score = Black Mean Plus One Standard Deviation Stem Leaf Count .6 133 3 ·5 011 3 .4 2567 4 .3 01235588 8 .2 00123456788999 14 .1 00244567788889 14 .0 1122233345677 13 -.0 0222445667 10 -.1 27 2 -.2 0 -.3 4 1 NOTE: Median = .18.
DIFFERENTIAL VALIDITY AND DIFFERENTIAL PREDICTION ~ 85 slope for black employees only, and it is consistent with the finding noted above that there is a tendency for the slope based on data for nonminor- ities to be somewhat greater than the slope based on data for blacks. The above results suggest that the use of a total-group regression equation generally would not give predictions that were biased against black applicants. If the total-group equation does give systematically different predictions than would be provided by the equation based on black employees only, it is somewhat more likely to overpredict than to underpredict. These results are generally consistent with results that have been reported for other tests. As was noted by Wigdor and Garner (1982: 77), for example: Predictions based on a single equation (either the one for whites or for a combined group of blacks and whites) generally yield predictions that are quite similar to, or somewhat higher than, predictions from an equation based only on data from blacks. In other words, the results do not support the notion that the traditional use of test scores in a prediction yields predictions for blacks that systematically underestimate their performance. In considering the implications of these results, it is important to note that the criterion measure in most cases consisted of supervisor ratings. Any interpretation of the results depends on the adequacy of the criterion measure, including the lack of bias. In addition, it is important to recall that the correlation of the GATE composite with criterion performance is generally low for black employees (weighted averages of only .15 and .12 for Job Families IV and V, according to the summary reported by U.S. Department of Labor, 1987~. Given the low correlation and the substantial difference in mean scores of blacks and whites on the GATE, use of the test for selection of black applicants without taking the applicant's race into account would yield very modest gains in average criterion scores but would have substantial adverse impact. It is within this context that the differential prediction results need to be evaluated. Performance Evaluation and the Issue of Bias It is often demonstrated in the psychological literature that supervisor ratings are fallible indicators of job performance (e.g., Alexander and Wilkins, 1982; Hunter, 1983~. In order to combat some of the weaknesses of the genre, a specially developed rating form, the Standard Descriptive Rating Scale, is used for most of the GATE criterion-related validity studies. Raters are told that the information is being elicited only for research purposes, not for any operational decisions. Nevertheless, the possibility of racial, ethnic, or gender bias contami- nating this kind of criterion measure is an issue deserving attention.
~86 GATE VALIDITIES AND VALIDI~GENE~~IZATION Although common sense suggests that evaluations of the performance of blacks or women might well be depressed to some degree by prejudice, it is difficult to quantify this sort of intangible (and perhaps unconscious) elect. Two recent surveys draw together the efforts to date. Kraiger and Ford (1985) and Ford et al. (1986) provide meta-analyses of the presence of race effects in various types of performance measures. The first review (Kraiger and Ford, 1985) examines the relation between race and (sub- jective) performance ratings. A total of 74 studies were located, 14 of them using black as well as white raters. The analysis reveals the existence of a suggestive rater-ratee interaction: white raters rated the average white ratee higher than 64 percent of the black ratees, and black raters rated the average black ratee higher than 67 percent of the white ratees. For white raters, there was sufficient variability and a sufficient number of studies to evaluate the effect of moderator variables. (Although there was more variability for black raters, there were too few studies to perform the moderator analysis.) The authors found a significant (p < .10) inverse correlation between the percentage of blacks in a sample and the difference in the average rating. The higher the percentage of blacks, the less the difference. The three remaining moderator effects had nonsignif- icant effects. Rater training (which may or may not have discussed race) had no impact. The purpose for obtaining the ratings, either for real, administrative reasons or for research only, had no impact. Although there appeared to be a tendency for behaviorally based rating scales to show a greater difference between blacks and whites than did trait scales, it was not significant. Because the 1985 study was limited to subjective ratings, the authors could not attempt to estimate the relative contributions of ratee perfor- mance and rater bias to the differences in ratings found for blacks and whites. The second paper (Ford et al., 1986) represents a preliminary attempt to address the issue of the extent to which race differences in assessments of job performance are the product of meaningful perfor- mance differences or the product of rater bias. Ideally, one would have a perfect criterion, one without limitation or bias, that would provide a perfectly accurate measure of job performance for blacks and whites. In the absence of such an ultimate criterion, the authors seek to advance our understanding by looking at the extent of racial differences for objective and subjective ratings of performance. Ford and colleagues identified 53 studies, published and unpublished, that reported at least one objective performance measure and one subjective rating for a sample of black and white workers. Comparisons are reported for three types or aspects of performance: absenteeism and
DIFFE=NTIA f VA~DI~ AND DIFFERENTIAL PREDICTION ~ 87 tardiness; cognitive performance; and direct performance such as units produced, accidents, or customer complaints.- The meta-analysis cumu- lated correlations between race and objective indices of performance and subjective ratings of performance in order to compute mean effect sizes and variances across studies. For the purposes of the committee's study, Ford and colleagues (1986) make a number of interesting observations. First, they report a relatively high degree of consistency in overall erect sizes across multiple criterion measures; in other words, there are similar magnitudes of difference between blacks and whites no matter what kind of performance is measured or what kind of measurement is used (the effect size ranges from.11 to .341. Second, contrary to conventional wisdom, they found that the effect sizes for objective and subjective performance criteria were virtually identical. They report (1986:Table 1) for the total sample a mean effect size of the correlation (corrected for unequal sample sizes and attenua- tion) between ratee race and performance measure of .21 for objective criteria and .20 for subjective criteria. One conclusion that the authors draw from this is that the race effects found in subjective ratings cannot be attributed solely to rater bias. Interestingly, the biggest reported differences in measured performance between blacks and whites are associated not with the type of criterion measure (objective or subjective) but with the type of performance measured. The biggest mean effect sizes with both objective and subjec- tive measure were for cognitive performance-.34 and .23, respectively. In comparison, the effect sizes for direct performance were .16 for objective measures and .22 for subjective measures. Note that although race differences are smaller when measures of direct performance are used than when cognitive performance measures are used, all measures of on-thejob performance produce much smaller differences in scores between blacks and whites than do predictor tests such as the GATB, on which blacks are typically found to score about one standard deviation below whites. We will return to this subject in Chapter 13. The studies reported here point up the need for more attention to the matter of performance differences between blacks and whites and the extent to which measured differences reflect meaningful differences in employee performance or are the consequence of bias in the measurement technique. With regard to the immediate purpose of evaluating the GATB validation research, the possibility of bias in the criterion measure adds further grounds for caution in interpreting the validity of the GATB for minority applicants. The U.S. Employment Servicers long-term research agenda should include the task of exploring the influence of bias on supervisor ratings.
88 GATB VALIDITIES AND VALIDI~GENE~IZATION CONCLUSIONS Differential Validity by Race 1. Analysis of the 72 GATB validity studies that had at least 50 black and 50 nonminority employees indicates that correlations are lower for blacks than for whites. The average correlation of the GATB composite with the criterion measure was .12 for black employees and .19 for nonminority employees. Moreover, for a quarter of the studies, the correlation for blacks is .03 or less. These results raise serious questions about the degree of validity of the job family composites for black applicants. Not only are the average validity coefficients lower for blacks than for nonminorities, but also the level of the correlation for blacks is quite low. Differential Prediction by Race 2. Are group differences in average test scores reflected in differences in performance? Analysis of the same set of 72 validity studies shows that use of a single prediction equation relating GATB scores to performance criteria for the total group of applicants would not give predictions that were biased against black applicants. That is, the test scores would not systematically underestimate their performance. A total-group equation is somewhat more likely to overpredict than to underpredict the perfor- mance of black applicants. Criterion Bias 3. The results of our differential prediction analysis could be qualified by inadequacies in the criterion measure, including racial or ethnic bias. Supervisor ratings are susceptible to bias. There is some evidence that supervisors tend to rate employees of their own race higher than they rate employees of another race. Real performance differences could thus be confounded with spurious differences in the performance measure used to judge the accuracy of prediction of GATB scores. This is an issue that should be part of U.S. Employment Service's long-term research agenda.
PART IV ASSESSMENT OF THE VG-GATB PROGRAM Part IV contains the committee's appraisal of the VG-GATB Referral System and consideration of the potential effects of its widespread use throughout the Public Employment Service. Chapter 10 lays out the plan for the VG-GATB system as it is envisioned by the U.S. Employment Service for its local office operations; discusses the claims that have been made for the system; and assesses the evidence available on its imple- mentation from a small number of pilot studies. Chapter 11 discusses the likely effects of widespread adoption of the VG-GATB system on the specific groups involved: employers, job seekers, in particular minority job seekers, people with handicapping conditions, and veterans. Chapter 12 is the committee's assessment of the claims of potential economic benefits that have been made for VG-GATB referral, including both gains for individual firms and gains for the economy as a whole.