Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Recommendations for Referral and Score Reporting A particular charge of the committee is to review the use of within- group scoring in the VG-GATB Referral System. This method of scoring transforms raw scores into percentile scores referenced to particular subpopulations (black, Hispanic, and other). It was adopted to prevent the test-based referral system from adversely affecting the employment opportunities of minority applicants. The adjustments made by computing percentile scores within the specified subpopulations have the effect of erasing average group differences in reported test performance. There are several steps in the production of within-group percentile scores. First, the raw test scores for each applicant are converted into five job family scores, based on predetermined weightings of the cognitive, perceptual, and psychomotor composites. Then each of the applicant's five job family scores is converted to a percentile score, which shows the applicant's ranking with respect to others in the same ethnic or racial subgroup on a scale of 1 to 100. That ranking is derived from norm groups constructed from samples of blacks, Hispanics, and majority-group job incumbents who took the test in a number of General Aptitude Test Battery (GATE) validity studies. In the VG-GATB system, applicants are referred to jobs in order of their percentile scores, and the scores are reported to employers without designations of the applicant's group identity. Hence a black applicant with a Job Family IV within-group score of 70 percent will have the same referral status as a white ("other") applicant with a within-group score of 70 percent, although their raw scores would be 283 and 327, respectively 251 .
252 CONCLUSIONS AND RECOMMENDATIONS Within-group scoring is without question race-conscious. It is an example of what some commentators describe as an inclusionary or benign racial classification, because it was adopted by the U.S. Employ- ment Service (USES) in order to enrich the employment opportunities of black and Hispanic job seekers (while at the same time promoting the overall quality of applicants referred to an employer). Others, chief among them the former Assistant Attorney General for Civil Rights, Wm. Bradford Reynolds, view within-group scoring as intentional racial dis- crimination, an abridgment of the equal protection clause of the Consti- tution and illegal under Title VII of the Civil Rights Act of 1964. In its interim report (Wigdor and Hartigan, 1988) the committee concluded that, as an instrument of public policy, the "within-group referral procedure is an effective way to balance the conflicting goals of productivity and racial equity,'' at least as far as the individual employer is concerned. Nevertheless, the committee refrained from endorsing the way within-group percentile scores are being used in the VG-GATB Referral System because of concerns about its legal status, about the representativeness of the norm groups used in score conversions, and about potential misunderstanding by employers and applicants in inter- preting the reported scores. The sole use of group-based percentile scores, in the absence of any information about the applicant's self- reported group membership or about the size of the adjustments made to minority scores, would encourage two kinds of misinterpretation on the part of employers: 1. The employer could easily assume that all individuals with the same reported score achieved the same raw score on the GATB. 2. The employer might also be led to assume that all candidates with the same percentile score on the test would have the same expected performance on the job. We could have added a third reservation, for, if the VG-GATB Referral System became a very important route to employment, policy makers would have to anticipate that at least some applicants might claim minority status at the local Job Service office in order to get the benefit of preferential score adjustments and make no such claim at the workplace, so that the meaning of the reported score would be interpreted with reference to the majority group. Despite these reservations, we conclude this chapter with the recom- mendation that score adjustments, possibly within-group percentile score adjustments, continue to play a role, albeit a somewhat different role, in the VG-GATB Referral System for reasons that emerge from our technical analyses of GATE data as well as considerations of social policy.
REFERRAL AND SCOW SPORTING 253 The analysis in the committee's interim report was based on theoretical comparisons of within-group scoring and a number of alternative referral and reporting options. It was taken as given that referral would be based on a test in which minority average scores were substantially lower than majority average scores. The assumptions that allowed the theoretical comparisons were chosen to match, as best we knew, the circumstances of Employment Service referrals. The comparisons also depended on assumptions about the validity of the test and its predictive behavior for different racial groups. We are now in a position to look again at alternative score reporting and referral models, but at this time many of the earlier assumptions can be replaced by empirical statements. Evidence presented earlier in this report establishes that the average scores of black Job Service clients are substantially lower than those of majority clients, although the difference varies somewhat by job family. Our earlier assumption that the GATB does not predict differently for different racial groups needs some qualification in light of the analyses presented in Chapter 9. There is evidence that the GATB has somewhat lower correlations with supervisor ratings of job performance for blacks compared with whites. Nevertheless, the use of a regression equation based on the combined group of black and nonminority workers would generally not give predictions that are biased against blacks. Insofar as the total-group equation gives systematically different predictions, it is somewhat more likely to overpredict the performance of blacks than to underpredict. The degree of overprediction is slight at the lower score ranges, and somewhat larger at higher score levels. We have now made independent estimates of GATB validities (pre- sented in Chapter 8), taking account of recent (post-1972) validity studies. The modest relationship between GATB scores and ratings of perfor- mance on the job-our estimate is an average corrected validity of .30 with about 90 percent of the jobs studied falling in the range of .20 to .40 is one important factor for policy makers to consider in assessing various referral alternatives. PERSPECTIVES ON TEST FAIRNESS What makes the use of a test fair? Like most Americans, testing specialists have wrestled with questions of equity and fairness in the past two decades. A number of models for the fair use of tests have been proposed in the psychometric literature. The following discussion of fairness draws on this literature as well as more popular sources to build a framework for the analysis of score reporting and referral methods.
254 CONCLUSIONS ED RECOMMENDATIONS To illustrate various perspectives on fairness, we take as given the conditions that would apply in the proposed VG-GATB system: 1. Applicants meeting other criteria set by the employer will be referred in order of their scores on the test. 2. The test is modestly predictive of job performance, so expected performance increases with test score. 3. The applicants represent several population subgroups. 4. There are substantial subgroup differences in average scores on the test. When can the use of a test be said to be fair to the various subgroups? The perspectives offered by psychometricians are derived from quantita- tive analysis of the joint distributions of group status, test score, and job performance (as indicated by a criterion measure such as supervisor ratings of performance). Since only group status and test scores are known for applicants, information about future job performance must be extrapolated from validity studies of job incumbents who have taken the test. The many definitions of fairness that have grown out of concern about the use of employment tests can be distilled for our purposes into two general approaches: fairness in predicting job performance from test score and fairness in selection, given job performance. Fairness in Predicting Job Performance from Test Score It can be argued that selection is fair if the predicted distribution of job performance for people with a given test score does not vary by population subgroup. We expect a white person with a test score of 70 to perform about the same as a black or Hispanic person with a test score of 70. In this conception of fairness, the focus is on prediction and whether the test predicts differently for different groups. If there is no evidence of differential prediction by group, then knowing any individual's test score is sufficient to predict job performance; the employer can make the same inferences about future job performance for all applicants. If, however, a test is found to predict differentially (as the GATB appears to for white and black applicants), then information about group status would be necessary to make appropriate inferences from test scores. In this definition, fairness consists of the evenhandedness with which the test predicts the future job performance of various subgroups. If a given test score can be associated with the same level of future job performance for black and white applicants, that is to say, if there is no predictive bias, then the test is fair and, to the extent that one feels that selection should be based solely on predicted performance, the selection
REFERRAL AND SCORE REPORTING 255 system is fair. Note that this definition of test fairness does not address group differences in average test scores or the legal problem of adverse impact. This definition is the classical one (Cleary, 1968) and the conception of fairness most widely accepted in the psychometric literature, at least as a minimum requirement (e.g., Petersen and Novick, 1976; American Edu- cational Research Association et al., 19851. When testing professionals refer to test bias, it is differential prediction that they have in mind (contrary to certain popular usage, in which the claim of bias refers to group differences in average scores). The general approach also appears in the fair pay literature. In that context, fairness requires that the formula best predicting pay as a function of legally compensable factors (qualifi- cations, experience, seniority) be the same for all groups. Because of the existence of substantial group differences in average test scores, particularly differences between black and majority-group job applicants, many now find this definition of fairness insufficient, at least as it pertains to allotting employment opportunity. A test may be fair in predicting performance, but nevertheless predict performance rather poorly. When that is so, many able workers will be rejected by the test, including a disproportionately large number of able minority workers. Fairness in Selection, Given Job Performance An alternative approach to fairness focuses not on prediction equa- tions, but on realized job performance (e.g., Darlington, 1971; Cole, 1973~. Selection can be considered "performance fair" if people with a given level of performance on the job have the same distribution of test scores, no matter what population subgroup they belong to. In that case, a rule that selects workers in order of test score will select the same proportion of good workers in each population subgroup. The question asked from this perspective is, Do workers of equal job proficiency in the several groups have the same chance of selection? At first glance, it would seem that if the use of a test is fair in the first sense, it would also be fair in the second. But it is possible to satisfy both definitions of fairness only if prediction of job performance from test score is perfect, or if all groups have the same joint distribution of test score and performance. Neither of these conditions is met in the GATB. Tests are at best only moderately good predictors of job performance. Human performance is far too complex to expect any- thing approaching perfect prediction. One of the consequences of prediction error is that some people who could perform well on the job but who score in the lower ranges on the test are screened out, whereas
256 CONCLUSIONS AND RECOMMENDATIONS high o LL cutoff tr A = / / \ group | ~ Black ~ it C B low cutoff high TEST FIGURE 13-1 Effects of imperfect prediction when there are subpopulation differences in average test scores. some others who do well on the test, and hence are selected, will perform inadequately on the job. So long as there are average group differences in test scores-and these are likely to manifest themselves whenever racially or ethnically identifiable subgroups live in circum- stances of comparative disadvantage the ejects of imperfect predic- tion will fall more heavily on these disadvantaged minorities than on other social groups. Figure 13-1 shows why the effects of imperfect prediction fall disproportionately on groups that have lower average test scores than the majority group. It should be remembered, however, that the phenomenon is not the result of some racial or ethnic bias inherent in the test; the impact is the same for all low-scoring individuals, regard- less of group identity. Not only do low scorers have a greater likelihood
REFERRAL AND SCOW REPORTING 257 of being erroneously rejected, but high scorers also have a greater likelihood of being erroneously accepted. In the figure the horizontal line labeled "criterion cutoff' distin- guishes adequate from unsatisfactory performance on the job. The vertical line labeled "test cutoff'' represents the score below which no applicant will be selected. Ellipses representing the joint distribution of job and test performance for majority and minority groups are super- imposed, one upon the other. Note that the white group has higher job performance and test scores on average, although there is also a good deal of overlap between the two groups. The intersection of the criterion cutoff and test cutoff creates four sectors: Sector A = successful performance on both test and criterion; Sector B = success- ful test performance, unsuccessful job performance; Sector C = unsuc- cessful performance on both test and criterion; and Sector D = successful job performance and unsuccessful test performance. Sectors B and D represent prediction error. Because the average test and performance scores are higher for the majority group than for the minority group, more of the majority ellipse falls in Sector A (successful performance on both test and criterion). Conversely, more of the minority ellipse falls in Sector C (unsuccessful performance on both test and criterion). Now observe Sectors B and D. A larger segment of the majority ellipse than the minority ellipse can be seen to fall in B. which means that proportionally greater numbers of majority applicants will be selected but will perform unsuccessfully. And a larger segment of the minority ellipse falls in Sector D, which means that minority applicants who could have performed adequately on the job will be screened out in greater numbers. It is the Sector B and D effects that violate the conception of fairness that we have called "performance fair." They occur despite the absence of any predictive bias in the test itself. Richard T. Seymour, representing the Lawyers' Committee for Civil Rights Under Law at a meeting of the committee and its liaison group, made a forceful statement of this view of fairness as a function of performance (Seymour, 1988~. His analysis, which is based on GATE validity data for 47 jobs, illustrates the effects of rejection errors and acceptance errors: many more of the successful black job incumbents in the validity studies would not have been referred had the test scores been the basis of referral; conversely, of the marginal job incumbents (those who received low supervisor ratings), a greater proportion of whites than blacks would have been referred had test scores been used. These effects of prediction error led him to conclude that the GATE produces "an extreme degree of racial unfairness" (Seymour, 19881:
258 CONCLUSIONS AND RECOMMENDATIONS The evidence is overwhelming that tests work differently for blacks and for whites, and that they both systematically under-predict black job performance and over-predict white job performance. [Reliance on cognitive ability tests] can only be justified as an affirmative-action program for whites, to ensure that whites are represented in desirable jobs at rates beyond the natural limits of their abilities. As a consequence, he strongly recommends against further use of the VG-GATE) Referral System. Mr. Seymour seems not to acknowledge the two types of fairness analysis we have described when he claims (erroneously) that the GATB underpredicts black job performance and overpredicts white perfor- mance. We must reemphasize the point that the effects he describes are not inherently bound up with race or ethnicity, but rather with high and low scores. Nevertheless, the undoubted effect of imperfect prediction when social groups have different average test scores is to place the greater burden of prediction error on the shoulders of the lower-scoring group. Is this fair? In the final analysis, we think not. But there are complexities to the question that require explication. An Example Comparing Different Concepts of Fairness As a more concrete way of illustrating the effects pictured in Figure 13-1, we present the results of a GATB validity study on carpenters that included 91 whites and 45 blacks. The individuals in the study were already on the job. They took the GATB test and were rated by their supervisor. Arbitrary cutoffs were used to divide the groups into high and low test scorers and high and low performers on the job. The frequency counts showing joint distributions of job and test performance for each group are shown in the table below: Frequency Counts Showing the Joint Distributions of Test Performance and Job Performance for 91 White and 45 Black Workers: Test Performance Whites (N = 91) Blacks (N = 45) Job Performance Fail Pass Fail Pass Good 11 1 60 8 1 8 Poor 11 9 24 5 There are three different ways to convert these frequency counts to percentages, and each presents a different perspective on fairness. The first method evaluates predictive fairness. The raw data are converted to percentages so that the columns sum to 100, as shown in the table below.
REFERRAL AND SCORE REPORTING 259 Column Percentages Computed to Elucidate the Conception of Predictive Fairness: Test Performance Whites Blacks Job Performance Fail (pro) Pass (%) Fail (%) Pass (%) Good Poor so 187 50 13 (100) (100) (100) 25 162 75 138 (100) Now we can see that 50 percent of white carpenters (11 of 22) who fail the test do well on the job, whereas only 25 percent of black carpenters (8 of 32) do so. And whereas only 13 percent of whites who pass the test do poorly on the job, the figure for blacks is 38 percent. When analyzed this way, the data reveal that more white test failers than black ones would do satisfactory work if given the chance, and more blacks than whites are passing the test and proving to be unsatisfactory workers. Thus the test overpredicts black job performance and is predictively unfair to whites. The second method of converting the frequency counts illustrates performance fairness. It creates percentages in such a way that the row percentages sum to 100, as shown in the table below. Row Percentages Computed to Elucidate the Conception of Performance-Based Fairness: Test Performance Whites Job Performance Good Poor Blacks Fail (%) Pass (do) Fail (%) Pass (%) 15 ~85 (1005to) 50 ~50 (100~o) 55 1 45 (100%) 83 1 17 (100%) Look first at good workers who fail the test and would therefore never have been referred to the employer had a test-based system been in place (sector D in Figure 13-14. The numbers are 15 percent for white carpenters (11 of 71) and 50 percent for black carpenters (8 of 161. For the poor workers, 45 percent of white workers who are poor performers (9 of 20) pass the test and thus are among those who would have been referred for employment (sector B in Figure 13-1~. By comparison, only 17 percent of blacks (5 of 29) who are poor workers passed the test. Viewed this way, the percentages say that good black workers will be dispropor- tionately screened out in a test-based referral system, and unsatisfactory white workers disproportionately screened in. The test is performance- biased against black workers.
260 CONCLUSIONS ED RECOMMENDATIONS There is a third way to look at the frequency data, and that is to compute percentages within each racial group. The effect is to show what the numbers in each cell would be for blacks and for whites if the sample size was 100 for each group, as shown in the table below. Proportional Percentages of White and Black Workers in Each Test Perfo~ance by Job Performance Category: Test Performance Whites Job Performance Blacks Fail (%) Pass (%) Fail (56) Pass (Jo) Good 12 ~66 l8 ~18 Poor 12 1 10 53 1 11 (100%) (100~o) This presentation of the data also tells an important story. First, group differences in test performance and job performance are a reality. Black carpenters score substantially lower on the test, so any system of top-down referral will find proportionally more blacks below the cutoff score than whites, 71 percent compared with 24 percent. Black carpenters also perform poorly on the job in substantially greater proportions, or, put the other way, a larger percentage of whites perform satisfactorily on the job, 78 percent compared with 36 percent of black carpenters. (This numerical demonstration assumes that the supervisor ratings of perfor- mance are themselves valid.) Second, the proportion of correct classifications is reasonably similar for the two groups; 78 percent of white carpenters were correctly classified compared with 71 percent of blacks. But the damaging predic- tion errors fall more heavily on the black carpenters. Of the 36 percent who performed well on the job, 18 percent fully one-half would not have been referred for employment under a straight rank-ordering of applicants. Each way of looking at the data provides insights about the effects of using a test to screen job applicants. Which truth is the most important truth? At this point in our history, it is certain that the use of the GATE without some sort of score adjustments would systematically screen out blacks, some of whom could have performed satisfactorily on the job. Fair test use would seem to require at the very least that the inadequacies of the technology should not fall more heavily on the social groups already burdened by the effects of past and present discrimination.
REFERRAL AND SCOW REPORTING 26) EQUITY AND EFFICIENCY: COMPARISON OF FOUR REFERRAL MODELS The question of fair use of the GATE is not one that can be settled by psychometric considerations alone-but neither can referral policy be decided on the basis of equity concerns alone. If there is a strong federal commitment to helping blacks, women, and certain other minority groups move into the economic mainstream, there is also a compelling interest in improving productivity and strengthening the competitive position of the country in the world market. The underlying principle of the VG-GATB system is to make the maximization of performance the basis of the personjob match. It is a productivity-oriented referral procedure that, through the addition of score adjustments, has been made responsive to equal employment opportunity policy. In our interim report, we evaluated six possible referral rules for their effect on estimated job performance and on the proportion of minority- group members who would be referred. In the following discussion we look at four rules, including one new variant, that most clearly illustrate the available policy options. Two of the rules use linear adjustments to minority scores, different for each group, to increase minority referral rates. The four rules presented for consideration are: (1) raw-score, top-down referral; (2) within-group percentile score, top-down referral; (3) performance-based score, top-down referral; and (4) minimum com- petency referral. Raw-score, top-down referral is referral made from the total group of applicants in order of unmodified test score. This rule complements the conception of fairness as lack of differential prediction. If the predicted job performance for a given test score is the same for all population groups, then the set of applicants with highest expected productivity is obtained by referring in order of test score. However, given current average group score differences, the rule would produce substantial adverse impact on the lower-scoring groups. The question that policy makers must ask of the VG-GATB system is whether the gains in expected performance are sufficient to justify this impact. Within-group percentile score, top-down referral is referral in which a percentile score is computed for each applicant by comparing the raw score for that applicant with the scores obtained by a norm group of the same racial or ethnic identity. (Equivalently, a different linear transfor- mation is applied to the raw test score for the different groups so that the mean and the variance of test scores are the same for all groups. In the simplest case, the quantity m is added to each minority score, where m is the difference between majority and minority means.) Referral is made from the total group of applicants in order of modified test score. Given
262 CONCLUSIONS AND RECOMMENDATIONS average GATB validities of about .3, this referral rule would eliminate the disproportionate rates of false rejections of able black workers and false acceptances of inadequte white workers described in the discussion of performance fairness. If GATB validities were substantially higher, the rule would overcompensate. Performance-based score, top-down referral, a variant of the previous rule, is referral by test scores that are adjusted for group membership in such a way that the distribution of test scores at a given level of performance is the same for all groups. In the simplest case, a different linear transformation is applied to the raw test score for the different groups so that the mean and the variance of test scores for a given performance score are the same for all groups. That is, the score adjustment adds (1 - r2)m to each minority score, where r is the correlation between test score and job performance, and m is the difference between majority- and minority-group means. Although con- ceptually attractive, this approach requires that the test s validity for the job be estimated. Although this rule will be seen (Tables 13-1, 13-2) to function essentially the same as the within-group percentile rule when test valid- ities are modest (as they currently are for the GATB), we treat it separately because it is more specifically a remedy for imperfect predic- tion. Were there a test with perfect or nearly perfect predictive power, this rule would function like the raw-score, top-down rule. Minimum competency referral is the system used before the introduc- tion of VG-GATB procedures in which applicants who score above some minimum cutoff score, set perhaps by the employer, are referred at random. We may view selection under this rule as being determined by an adjusted test score that is obtained from the original by randomly reassigning test score values above the cutoff to all examinees who initially score above the cutoff and randomly reassigning test score values below the cutoff to all examiners who score initially below the cutoff. Analysis of the Referral Rules There are two statistical computations used to assess both the gains in expected performance from using the GATB and its adverse impact on groups that tend to score low on the test. The first is the correlation between the test score and job performance. The gain in expected performance, under certain assumptions about the distribution of test scores and job performance measures when workers are selected in order of their test scores, is proportional to the correlation between test score and job performance (Broaden s formula, see Chapter 12, this volume). The second is the proportion of minorities referred. These proportions are
REFERRAL AND SCOW SPORTING 263 determined by the distributions of test scores for the different population groups. Our analysis compares the four referral rules by computing the corre- lations of test scores with job performance, as well as the means and standard deviations of the test scores for minority and majority groups. The majority-group scores are assumed to have a mean of 0 and a standard deviation of 1. We present the analysis in two scenarios. The first assumes that the GATB predicts job performance equally well for black and white groups. The second assumes that predictions of job performance from GATB scores differ somewhat, with the effect that use of a single prediction equation will slightly overpredict the performance of blacks. Our analysis of the subset of GATB validity studies that report results separately by race indicates that such differential prediction may exist, but the small sample sizes and the possibility of bias in the criterion (supervisor ratings) make us reluctant to place too much emphasis on the results. For convenience in comparing the performance of the various referral rules with and without the effects of differential prediction, both analyses are based on the small number of studies that report validities separately by race. The average validity coefficient for this set of 72 studies is substantially lower than the mean value of .3 for all 755 GATB validity studies used elsewhere in the report. Scenario A: No Differential Prediction, Applicant Groups Are Like the Norm Group (Table 13-~) 1. The raw test score and job performance have correlation .2 for the majority group and for the minority group, and the same regression line predicts mean job performance for a given test score in both groups. The uncorrected correlation of .2 is based on the set of 72 validity studies that contain at least 50 black and 50 nonminority workers (see Chapter 91. Note that the average value for these studies is at the lower end of the range of validities (.2 to .4) found for the entire set of 755 studies undertaken by USES. 2. The mean test score for the minority group is .9 standard deviations below the mean test score for the majority group, and the minority group and majority group have equal test score standard deviations. These assumptions are taken from the USES norm groups for Job Families IV and V, which include almost all jobs typically filled through the Employ- ment Service. (The norm-group differences in average test scores be- tween "other" and black are 1 standard deviation for Job Family IV and 0.8 standard deviation for Job Family V; since jobs are divided nearly evenly between the two families, an overall figure of .9 is assumed.)
264 CONCLUSIONS ID RECOMMENDATIONS TABLE 13-1 Scenario A: No Differential Prediction. Correlations with Job Performance, Minority Group Means, and Standard Deviations,a and Percentage of Minority Group Referred When 20 Percent of the Majority Group Are Referred, for Different Referral Rules Minority Percentage of Minority Standard Minority Group Referral Rule Correlation Mean Deviation Referred .22 .20 .20 -0.90 0.00 -0.04 Raw score Within-group PerfoImance-based Minimum competency (Woo cutoff) Minimum competency (Woo cutoff) .13 -0.58 aTaking major~ty-group means to be O and standard deviations to be 1. .09 -0.49 1.0 1.0 20 1.0 19 1.2 1.1 8 3. Thirty percent of the applicant group is minority. This figure corresponds to recent Employment Service national registrations. This fraction may vary markedly from job to job and from locality to locality. As Table 13-1 illustrates, the two referral models that incorporate score adjustments dramatically increase the percentage of minority applicants referred and yet show only small decreases in predictive power compared with the raw-score, top-down referral rule. The minimum competency model does not compare favorably on either dimension of interest, expected performance or minority presence in the referral pool. Scenario B: Differential Prediction, Applicant Groups Are Like Worker Groups (Table 13-2) 1. The correlation between test score and job performance for the majority group is .20. The correlation between test score and job performance for the minority group is .15. The difference between expected job performance of majority and minority groups is 0.20 standard deviations (in units of majority job performance) when the test score is at the majority average. These assumptions are drawn from the analysis in Chapter 9 of 72 GATE studies that contained at least 50 black and 50 nonminority workers. (The correlations have not been corrected for reliability or restriction of range.) First, the correlation of test score and job perfor- mance is somewhat lower for blacks than for whites. Second, at any given test score, blacks with that test score have lower average job performance than whites with the test score. This effect appears in the regression
REFERRAL AND SCOW SPORTING 265 TABLE 13-2 Scenario B: Differential Prediction. Correlations with Job Performance, Minority Group Means, and Standard Deviations,a and Percentages of Minority Group Referred When 20 Percent of the Majority Group Are Referred, for Different Referral Rules Percentage of Minority Standard Minority Group Referral Rule Correlation Mean Deviation Referred Raw score .23 -0.90 0.9 3 Within-group .18 0.00 0.9 18 Performance-based .19 -0.05 0.9 16 aTaking major~ty-group means to be O and standard deviations to be 1. equations as differences in the regression intercepts. The correlations obtained from the 72 studies are substantially lower than those obtained in the 755 studies, and they are based on relatively few workers, so that we cannot confidently extend the corresponding regression equations to all workers. Nevertheless, it seems prudent to consider these assump- tions as an alternative scenario allowing for differential prediction. 2. The mean test score for the minority group is 0.9 standard deviations below the mean test score for the majority group, and the minority-group standard deviation is 0.9 that of the majority group. These assumptions are based on 72 GATE studies, each of which each contained at least 50 black and 50 nonminority workers, all from Job Families IV and V; the median difference in average test scores (with the majority-group stan- dard deviation in each study set equal to 1) is 0.88; the median ratio of minority standard deviation to majority standard deviation is .90. These assumptions disagree with the USES norm groups only in the minority-group standard deviations being .9 rather than 1. (It might be argued that the lower minority standard deviations are due to restriction in range operating differently for the minority group and the majority group. The same argument would justify correcting the observed minority correlation by 10 percent to .165; but, in the absence of data on plausible applicant groups, we here accept the numbers as given.) The median majority-group standard deviation for the 72 jobs agrees quite closely with the standard deviations used in the USES norm group, so that within- group scoring is equivalent to increasing minority scores by 0.9 majority- group standard deviations. 3. Thirty percent of the applicant group is minority, in agreement with the observed proportion in the 72 studies. From Table 13-2, we see the same general pattern of large gains in the percentage of minority applicants who would be referred under the models that incorporate score adjustments at some cost to expected performance. The principal difference between the two scenarios lies in
266 CONCLUSIONS ED RECOMMENDATIONS the effect on the validity coefficient of using the referral rules that incorporate score adjustments. For the example with no differential prediction, using the within-group and performance-based rules causes a drop of 10 percent in the correlation; for the model with differential prediction, using these rules causes a drop of about 20 percent in the correlation. The larger reduction in correlation observed under the assumption of differential prediction is due principally to the difference in mean scores of the two groups, not to the lower validities found in this data set. The effect on productivity of the drop in correlation from .22 to .20 in Scenario A or from .23 to .18 in Scenario B depends on a number of unconsidered variables, including selection ratio. When the selection ratio is high say 1 in 3 such a decline in correlation is trivial. However, if a great deal of choice is available- say 10 or 20 viable candidates for every job opening such a drop in correlation will result in a much larger difference in expected productivity. In that case, the additional 10 percent reduction in correlation due to differential production makes it of consid- erable practical interest to know which of the two scenarios is more likely to be applicable. Effects of the Referral Rules Raw-Score, Top-Down Referral This rule results in the highest expected performance in the referred group and the lowest minority-group proportion referred. The effect is extreme when the referral ratio is low. If the referral ratio is 1 in 5 for the majority group and the applicant group is 30 percent minority, only 4 percent of the minority applicants will be referred, one-seventh of the majority-group rate. Yet, when the validity is modest, as it is here, many of the minority applicants excluded would have performed better than many of the majority workers included. Conclusion This rule has an adverse impact on minority applicants that, in our judgment, is out of all proportion to the gains in expected job performance (as measured by supervisor ratings). Within-Group Percentile Score, Top-Down Referral This rule achieves the highest proportions of minority referrals, with some loss in correlation with job performance. When the applicant group has the same distribution of minority and majority scores as the norm group, the proportion of minorities referred is the same as the
REFERRAL AND SCORE REPORTING 267 proportion of minorities in the applicant group, and the goal of eliminating adverse impact is achieved. For the 72 worker groups in the differential! validity studies, the rule appears to refer nearly proportion- ately overall, although there are substantial deviations in particular worker groups. Conclusion The within-group referral rule is race-conscious. There is negligible difference between this rule and the performance-based rule that is designed to refer workers at the same level of job performance in the same proportion for each group. Thus the within-group referral rule is an effective way to achieve proportionate referrals of workers at the same level of job performance. Admittedly it will increase somewhat the rate of false acceptances, but the loss in overall expected job performance is small. Performance-Based Score, Top-Down Referral As Tables 13-1 and 13-2 illustrate, this rule is very similar in application to the previous one. When the correlation expressing validity between test score and job performance is .3, and the minority-group average test score is 1 standard deviation less than the majority-group average test score, then this score adjustment adds approximately (1 - .09) = 0.91 standard deviation to each minority score. Referral is then in order of the adjusted score. In comparison, the within-group percentile score adjustment would add 1 standard deviation to each minority score. The difference between the two rules is negligible for the modest validities observed for the GATB. Conclusion The slight drop in correlation that occurs for each of the score adjustment strategies suggests that the choice between the raw score, top-down rule and either of the rules that incorporates a score adjustment cannot be based principally on efficiency grounds, at least not for the range of (corrected) validities of .2 to .4 that we have calculated for Employment Service jobs. In choosing between within-group and perfor- mance-based score adjustments, there is no reason to prefer one to the other by its correlation with job performance. In terms of legal admissi- bility, both are race-conscious, both would virtually eliminate the adverse impact of the GATE on black and Hispanic applicants, and both can be seen as counteracting prediction error for minority groups. Because the performance-based score adjustment is responsive to changes in test validities (with high validities, smaller score adjustments would be made and the proportion of minorities referred would be reduced), policy makers at the Department of Labor will want to consider whether the performance-based referral rule might be legally more defensible. For the same reason, if policy makers choose this rule, special caution should be exercised to ensure that test validities are not overstated.
268 CONCLUSIONS AND RECOMMENDATIONS Minimum Competency Referral Note from Table 13-2 that this referral rule is inferior to both of the score adjustment strategies because it reduces the correlation quite dramatically without much increasing minority referrals. Minimum com- petency referral is the only alternative to raw-score, top-down referral examined here that is not race-conscious, but it might well open the Employment Service to a Title VII challenge, since it would produce markedly unequal referral rates for majority and minority applicants. The So-Called Golden Rule Procedure Thus far we have explored two basic conceptions of fairness in testing, one of which focuses on prediction and the other on job performance. We have also examined four systems for assembling the pool of applicants to be referred to an employer. The first, raw-score, top-down referral, corresponds to the idea that the use of a test is fair if a given score predicts about the same level of criterion performance regardless of group identity. Two of the referral models, those involving score adjustments, complement the conception of fairness that focuses on realized job performance. Minimum competency referral, the fourth option discussed, does not advance either idea of fairness and is the least attractive model in terms of maximizing expected performance. There is another approach to the issue of fairness in testing that has gained some currency in the past few years. It is based on the premise that a test can be considered fair only if it produces the same distribution of scores for all population subgroups. Called the Golden Rule procedure (after the insurance company, not the maxim), it is a strategy for selecting test items with the goal of eliminating group differences in test scores. Using this procedure, items are field-tested with minority- and majority- group members and, whenever possible, items are selected that show the least difference in the proportions correct obtained by each group. Because items are selected explicitly to reduce the difference between minority- and majority-group performance, this procedure should in theory yield tests in which the overall difference between minority and majority performance is reduced. Since 1985 the Golden Rule procedure has been used in assembling the tests used in the licensing of life, property, and casualty insurance agents in the State of Illinois. It is the consequence of a lawsuit against the State of Illinois and the Educational Testing Service filed by the Golden Rule Insurance Company on behalf of five black examinees who failed the licensing examination in 1976.
REFERRAL AND SCORE REPORTING 269 In an out-of-court settlement, the parties agreed to assembling tests using the following system. Racial status of candidates taking the test would be recorded and used on an ongoing basis with test performance data to sort items into two types: Type I items are those in which (1) the difference in proportions correct between black and white examinees is not significantly greater than 0.15, and (2) the overall proportion correct is not significantly less than 0.40 (thus eliminating the very difficult items). Type II items are those that fail to meet the above criteria for between- group differences or item difficulty. Tests were to be assembled by selecting Type I items whenever possible. Whenever Type I items were not available or when their use was inappropriate according to generally accepted principles of test construction, Type II items could be used. In either case, items with the smallest between-group differences were to be used first. For a number of reasons, theoretical and practical, the Golden Rule procedure has been greeted with skepticism by the psychometric profes- sion. And in fact, its first systematic application has not been promising. As part of the consent decree, an advisory committee of academic and insurance professionals was established to advise the State of Illinois on the use and effects of the new tests for insurance agents. Statistics on the performance of black and white examiners were studied by the advisory committee over a three-year period. A member of the GATE committee who served on the Golden Rule advisory committee reports that the general consensus was that the procedure resulted in only a modest decrease in test performance differences between black and white exam- inees. The reasons for the failure of the procedure to bring substantial reductions in group differences in test scores stem in part from the requirements of a large-scale testing program. The item pool was not stable over time, in part because test forms were periodically made public as part of the settlement. New items were constantly being introduced and old ones retired. Consequently, the number of test items on which statistics were available was never large compared with the number of different test forms needed both to maintain test security and to have alternate forms available for those who wished to retake the exam. This problem was exacerbated by the very detailed content specifications for the insurance agent test. In some content areas, the number of items available did not permit much selection based on between-group differ ences. Obtaining large numbers of effectively interchangeable items is always a difficult practical problem. But the problems with the procedure are more than logistical. A number of scholars have pointed out that the two
270 CONCLUSIONS AND RECOMMENDATIONS rules of exclusion (the overall minimum correct response rate, 0.40, and the differential correct response rate, 0.15) work at cross purposes, with the result that the procedure will not necessarily reduce the between- group difference in means. This is so because the items with the smallest between-group difference in proportion correct are the very easy and the very difficult items. The minimum 0.40 rule eliminates the difficult items (Linn and Drasgow, 1987; Marco, 1988~. Moreover, even without the minimum 0.40 rule, the reductions in group differences in item scores would not come close to eliminating the degree of adverse impact associated with top-down, total-group selection (Marco, 19881. In other words, if the policy goal is to eliminate adverse impact, the Golden Rule procedure, although also race-conscious, is not nearly as effective as either of the score adjustment strategies discussed above. The Golden Rule procedure s effects on the quality of tests, however, would be detrimental. The construct validity of a test would be altered if items were selected primarily on a basis other than optimal measurement. Moreover, the predictive value of the test would be reduced for majority and minority examiners. Test reliability would also be reduced. Items of middle difficulty and items most closely associated with total score would tend to be eliminated more than easy items. As a result, the reliability of the test might be increased for lower-scoring examiners, but for middle- and high-scoring examiners, the opposite result is more likely (Marco, 19881. We do not see the Golden Rule procedure as a viable alternative for the Department of Labor to consider. For technical and practical reasons it does not rival score adjustment strategies. Moreover, the losses in test validity incurred are not offset by the marginally improved legal attrac- tions it offers. An Alternative Referral Rule From the perspective of fairness to all Employment Service applicants, the major drawback of the two rules that require score adjustments is that white applicants will be referred to employers in somewhat smaller numbers than they otherwise would have been. In other words, increasing the referral rates of racial and ethnic minorities will produce a concomi- tant reduction in the referral chances of some white applicants with higher raw test scores and somewhat greater predicted success on the job. In order to avoid that diminution in the prospects of majority-group applicants while at the same time enhancing the competitive position of minority applicants, the committee recommends the consideration of a referral rule that combines the essential features of both the raw-score,
REFERRAL AND SCOW SPORTING 27 ~ top-down and the within-group score, top-down rules. To achieve both kinds of fairness, all applicants who would have been chosen by a straight ranking of unadjusted scores will be referred, and, in addition, all applicants whose adjusted scores qualify them will also be referred. Thus, no job seeker will be denied an opportunity that would have been available under either fairness model. Since the score adjustment is commensurate with the effects on minority groups of imperfect prediction and since no group is greatly damaged by the combined-rules approach, the legal objections raised by the Assistant Attorney General for Civil Rights to the VG-GATB testing program may be assuaged. Although we recommend the Combined Rules Referral Plan to the serious consideration of the Department of Labor and other federal authorities in the fair employment practices area, we cannot claim that it is a panacea for the legal stalemate in which many employers find themselves. It is a compromise and as such may fail to satisfy advocates on either side of the fairness question. Depending on an employer s selection decisions, the total procedure could produce some degree of adverse impact on minority groups, although of far lesser severity than would a referral system based on unadjusted scores. At the same time, majority job seekers could claim that enrichment of the referral pool by definition dilutes their chances for selection. Policy makers at the Department of Labor will need to consider the potential legal risks of this referral strategy just as they do the risks of other referral plans. On a practical level, if there is a burden imposed by the Combined Rules Referral Plan, it is that the local Job Service office must deal with a somewhat larger number of people to fill a job order and the employer must consider more applicants than is absolutely necessary under either rule alone. There is some concern that this necessity might make the strategy impractical for small, low-volume offices. Operationalizing the Combined Rules Referral Plan For illustrative purposes, the plan is presented as it might work in a local office that has a sufficiently large number of otherwise qualified job seekers on hand to allow selectivity. The thrust of the plan is to increase the flexibility of the employer by referring either more high scorers or more minority applicants than would otherwise have been seen. An employer sends a job order for 10 job openings and asks to see 20 applicants. Twenty becomes the base number. The referral group is 'Although we phrase our recommendation in terms of within-group score adjustments, performance-based adjustments could be substituted with virtually identical results. Our slight preference for the within-group strategy is that it is easier to put into practice.
272 CONCLUSIONS ED RECOMMENDATIONS TABLE 13-3 Applicants Referred Under Total-Group, Within-Group, and a Combined Rules Referral Plan Percentile Score Referral Method Total- Within- Total- Within- Combined Applicant Race Group Group Score Group Rules 1 W 71 X X X 2 W 65 X X 3 W 63 X X 4 B 60 82 X X X 5 W 58 - 6 W 57 _ 7 W 54 8 B 51 73 X X 9 B 48 70 - X X 10 B 38 60 - NOTE: X = Referred; = Not referred. assembled in two stages. First, a list of all otherwise eligible candidates in the files is compiled on the basis of rank-ordered, total-group scores. The top 20 scorers are identified; they will be placed in the referral group. Second, the same list of candidates is reordered with minority scores converted to within-group percentile scores. Again, the top 20 scorers are identified for placement in the referral group. Thus an applicant is placed in the referral group by having a high total-group percentile score, a high within-group percentile score, or both. There will be a good deal of overlap between the stage-one and stage-two selections, so the total referral group will be less than double the baseline figure. Under the Combined Rules Referral Plan no applicant is excluded who would have been referred if the Employment Service had made the baseline 20 referrals on just total-group or just within-group percentile scores. To illustrate, Table 13-3 describes a situation in which the employer has two job openings and has asked for a referral ratio of 2:1. The baseline referral figure is 4. On the basis of file search there are 10 applicants who meet the employer's initial requirements (education, minimum cutoff score, and so on). The 10 are listed in order of total-group percentile score. A total-group referral procedure would refer the first four candi- dates listed. The within-group method would in this example refer three black applicants, two of whom had lower total-group scores than com- peting majority candidates. With this set of scores, the combined rules would result in a referral group augmented by two for a total of six applicants who will be referred to the employer.
REFERRAL AND SCOW SPORTING 273 Not the least of the attractions of the Combined Rules Referral Plan, in the committee's judgment, is that it places responsibility for the compo- sition of the work force with the employer. It gives the employer the flexibility to emphasize predicted performance, racial and ethnic repre- sentativeness, or a combination of these policies according to the job in question, the affirmative action posture of the firm, or other situational factors. The Job Service is not placed in the position of appearing to relieve the employer of these decisions, an implication that some employ- ers seem to have drawn from the VG-GATB system of referral based only on within-group scores. Norm Groups for Within-Group Scoring If any referral plan that incorporates the within-group score adjustment strategy is adopted, USES will need to undertake the construction of more satisfactory norm groups on which to base the score adjustments. In practice, there will be considerable variation in the applicant groups for different jobs in different localities. There is evidence from the data supporting the within-group percentile tables, from employer representa- tives in the committee's liaison group, and from some applicant data obtained by the committee, of noticeable differences between the national norm group currently used by the Employment Service for score conver- sions and applicant groups. Differences in means or standard deviations of the applicant groups from the norm group could cause quite different referral rates and validities of the within-group score for particular jobs. If, for example, an employer set qualifications for a job that are correlated with test score, then the applicants for the job would be expected to have a smaller standard deviation in test score than the norm group, and the differences between majority-group and minority-group mean score would be ex- pected to be lower. The effect of using within-group scoring based on national norms would be to refer minorities in larger fractions than in the applicant pool, and to significantly reduce the validity of the test, because of overestimates of standard deviations. It obviously is not practical for the Employment Service to devise a different additive factor for every job in every locality. But we do recommend that norm groups be developed by job family and, if possible, by smaller, more homogeneous clusters of jobs. In addition, the score adjustment factor should be computed differently than is currently done. Currently the adjustment factor is computed as the difference between the mean scores in a given job family composite of all majority- and minority-group workers in the national norm group. The correct factor is the mean score difference between majority-group and
274 CONCLUSIONS AND RECOMMENDATIONS minority-group applicants for the same job, averaged over all jobs. Similarly, standard deviations should be computed for applicants to a particular job, and then averaged over jobs. The current computation does not properly allow for differences between jobs. Suppose, for example, that there are two jobs, and applicants for the jobs scored as follows: Job Minority Majority 1 7 12 1720 1822 2 15 19 19232525 The Employment Service calculation pools the scores for all jobs to obtain a difference of 7 between majority- and minority-group average scores. The difference between average scores for each job is 6. In order to assess the effect of the current within-group referral norm groups on actual jobs, we used 72 jobs from David Synk and David Swarthout's research (U.S. Department of Labor, 1987~. The differences between minority and nonminority mean test scores expressed in majority standard deviations showed wide variation, with a median of 0.85 and quartiles of 0.65 and 1.10. (The quartiles would be 0.74 and 0.96 if the variation were due only to sampling error; thus there is evidence of substantial real variation in the standardized population differences.) We applied the within-group referral rule to the incumbents in each job, with a selection ratio set so that 50 percent of the nonminority workers would be accepted. The median acceptance rate for minority workers was 55 percent. There is thus some evidence that the referral rule accepts minority workers at a slightly higher rate than nonminority workers. However, these are workers on the job, not applicants, and if there were greater differences between mean scores for applicants than for workers, the referral rates for minority and nonminority workers might be about the same. THE PROBLEM OF REPORTING SCORES The general principle that should guide policy on reporting test scores is that the employer and the applicants should be given sufficient information to make correct inferences about a candidate's likely job performance from the test score. This information should include one or more scores, a description of the method of computing the scores, and information about the validity of the test. We have suggested the possibility of using two scores in creating the group of applicants to be referred on a job order, a total-group percentile score and a within-group percentile score. For score reporting purposes we again find merit in a combination of scores because neither the
REFERRAL AND SCORE REPORTING 275 total-group nor the within-group percentile score is an entirely satisfac- tory means of communicating information about job applicants. Reporting Within-Group Percentile Scores In the VG-GATB Referral System as it now operates, the Employment Service reports the candidate's within-group percentile score to the employer with an explanation of the scoring method, but without infor- mation about which adjustment, if any, has be made to the score. The within-group percentile scores reported to the employer are potentially misleading. The purpose of the scoring method is to indicate an individual's predicted job performance with reference to other applicants within his or her own ethnic or racial group. But employers may mistakenly infer that two applicants with the same percentile score did equally well on the test, no matter what their racial or ethnic identity. Employers are not given the conversion tables and so have no way of determining the correspondence between scores obtained within different groups. On one hand, this could lead employers to underesti- mate the magnitude of group differences in raw scores (for example, on certain GATE composites a raw score that places an applicant at the 50th percentile among blacks would place an applicant at the 16th percentile among whites). On the other hand, it could lead employers to underestimate the amount of overlap in test scores that exists between the groups. The within-group percentile scores have been reported to applicants without their being informed that the percentile scores are based on different norm groups for different racial and ethnic groups. That practice is deceptive. Reporting Total-Group Percentile Scores Reporting total-group percentile scores is also potentially misleading, because the employer has no information about the levels of job perfor- mance that can be expected from a particular score. It is tempting for the employer to infer that a person at the 16th percentile of whatever norm group on the test score will be at the 16th percentile of the norm group in job performance; Employment Service literature promoting the VG- GATB Referral System indicates that the most able workers within each ethnic group are being referred. But the correspondence between test score percentile and job performance percentile depends on the correla- tion between test score and job performance. For example, if that correlation is .3, a person at the 16th percentile on the test score is expected to be at the 38th percentile on job performance. Finally,
276 CONCLUSIONS ED RECOMMENDATIONS TABLE 13-4 Total Group Percentiles and Corresponding Expectancy of Above-Average Job Performance (Test Score and Job Performance Are Jointly Normal with Correlation .3) Percentiles Expectancies of Above-Average Performance (Jo) 2.5 16.0 50.0 84.0 97.5 27 38 50 62 73 providing a score referenced to the total group without qualifying its relevance to a particular job could have a harmful effect on minority applicants, who, on the average, score lower on the GATB. They will appear to be unqualified for the job, but their scores may have only a modest relationship to performance on the job. Expectancy Reporting There are methods of reporting information to employers that directly incorporate the degree of predictability of job performance from test score. One such method uses expectancies specifying the probability that a worker with a given test score will be above average in job performance. Whereas percentile scores show where an applicant is located on the test with reference to all other applicants in the relevant population, an expectancy score tells the likelihood of above-average performance given the validity of the test. The real value of this approach to scoring is that it gives the employer a much more realistic basis for comparing candidates than is possible with raw scores or percentile scores. When a test has only modest validity for predicting job performance, score differences that look enormous when expressed as percentiles are shown to predict a much closer likelihood of above-average performance on the job. Suppose we take the average GATE validity of .3. As Table 13-4 shows, extreme scores on the test distribution correspond to modest scores on the expectancy distribution, reflecting the modest predictability of job performance from test score. Proposed Protocol for Reporting Scores In the committee s judgment, a combination of percentile and expect- ancy scores will provide job applicants and prospective employers with
REFERR'4L AND SCOW SPORTING 277 the best picture of the applicant's comparative suitability for the job. Our proposal is that two scores be reported for each applicant: 1. A within-group percentile score with the corresponding norm group identified. 2. An expectancy score (derived from the total-group score) equal to the probability that an applicant Will have above-average job performance. The first score indicates how the applicant fared on the test in comparison with others in the same ethnic or racial group. This informa- tion is particularly useful to employers who are actively working to increase the representation of minority groups in their work force. The second score gives the employer a better means of comparing applicants against the criterion of job performance. And in general it will show applicants and employers alike that low scorers on the test have a reasonable chance of being above-average workers. Examples of such a reporting protocol using a validity of .3 would look as follows: Within-Group Total-Group Expectancy Percentile Computed Score: Chance of Being Name for "Black" Group* Better-Than-Average Worker Grace Birley 16 25 James Jones 50 40 Shelton Pike 84 50 Within-Group Total-Group Expectancy Percentile Computed Score: Chance of Being Name for "Other" Group* Better-Than-Average Worker Nancy Rathouse 16 40 William Cole 50 50 Theresa Brewer 84 60 Within-Group Total-Group Expectancy Percentile Computed Score: Chance of Being Name for "Hispanic" Group* Better-Than-Average Worker . Juan Gomez 16 33 Chester Alverez 50 44 Olivia Gerber 84 56 *GATE subpopulation norms exist for "black," "Hispanic," and "other" groups. CONCLUSIONS Fair Use of the GATB 1. Use of GATE scores in strict top-down, rank-ordered fashion is fair in the sense that a given test score predicts about the same level of job
278 CONCLUSIONS ED COMMENDATIONS performance for majority-group and minority-group applicants. However, it would have severe adverse impact on minority job seekers. 2. This adverse effect on minority job seekers cannot be justified on the grounds of efficiency, for at the levels of validity typical of the GATB, the efficiency tosses from adjusting minority scores are slight. 3. Although the GATB does not appear to be inherently biased against minority-group test takers, the undoubted effect of imperfect prediction when social groups have different average test scores is to place the greater burden of measurement error on the shoulders of the lower- scoring group. Since black, Hispanic, and Native American minority groups have lower group means on the GATB, able workers in these groups will experience higher rejection rates than workers having the same level of job performance in the majority group when referral is based on a rank-ordering of all test scores. 4. In the judgment of the committee, fair test use requires at the very least that the inadequacies of the technology should not fall more heavily on the social groups already burdened by the effects of past and present discrimination. 5. The so-called Golden Rule procedure, a strategy for reducing group differences in test scores through the selection of test items, does not appear to be defensible technically and does not provide the intended practical remedy. 6. The committee therefore concludes that, for purposes of referral, equity and productivity will be best served by a policy of adjusting the GATB test scores of black, Hispanic, and Native American job seekers served by the Employment Service system. Referral Rules 7. Raw-score, top-down referral gives the highest expected perfor- mance in the referred group and the lowest proportion of minority-group members referred. At the levels of validity we find for the GATB, this referral method has an adverse impact on minority applicants that is out of all proportion to the productivity gains. 8. Within-group score, top-down referral achieves the highest propor- tions of minority referrals, with slight overall losses in estimated job performance. Given present GATB validities, this score adjustment strategy is an efficient way of referring workers at a given level of job performance in about the same proportion, whatever their racial or ethnic group. 9. Performance-referenced score, top-down referral (adjustments to minority scores based on the predictive validity of the test) produces results virtually identical to within-group score, top-down referral at the
REFERRAL AND SCOW SPORTING 279 validities observed for the GATB. It demonstrates similarly slight losses in efficiency and large gains in the proportion of minorities referred. However, this method is responsive to changes in test validities; with high validities, smaller score adjustments would be made and the proportion of minorities referred would be reduced. This may make it legally the more acceptable of the score adjustment strategies. 10. Both score adjustment strategies are race-conscious; both would virtually eliminate the adverse impact of the GATB on black and Hispanic subpopulations, and both adjustments would be commensurate with the far less than perfect relation between the GATB test score and job performance. 11. Minimum competency referral results in significant losses in ex- pected job performance and would still produce markedly unequal referral rates for majority and minority applicants. Reporting Test Scores 12. The test scores reported to employers and job seekers should allow them to make the most accurate possible judgments about likely job performance. 13. Neither the within-group percentile scores currently reported un- der the VG-GATB Referral System nor total-group percentile scores convey sufficient information, and both are potentially misleading. RECOMMENDATIONS If the Department of Labor continues to promote a test-based referral system for filling job orders, we recommend the following alterations to the current VG-GATB Referral Program. Referral Rule 1. The committee recommends the continued use of score adjustments for black and Hispanic applicants in choosing which applicants to refer to an employer, because the elects of imperfect prediction fall more heavily on minority applicants as a group due to their lower mean test scores. We endorse the adoption of score adjustments that give approximately equal chances of referral to able minority applicants and able majority appli- cants: for example, within-group percentile scores, performance-based scores, or other adjustments. Given current GATB validities, such adjustments are necessary to ensure that able black and Hispanic workers will not experience higher rejection rates than workers of the same level of job performance in the
280 CONCLUSIONS ED RECOMMENDATIONS majority group. Referral in order of within-group percentile scores is one effective way to balance the dual goals of productivity and racial equity, given the modest levels of GATE validities. Should these validities increase dramatically as testing technology improves, the performance- based rule would warrant consideration. 2. We also recommend that USES study the feasibility of what we call a Combined Rules Referral Plan, under which the referral group is composed of all those who would have been referred by the total-group or by the within-group ranking method. Score Reporting 3. The committee recommends that two scores be reported to employ- ers and applicants: a. A within-group percentile score with the corresponding norm group identified. b. An expectancy score (derived from the total-group percentile score) equal to the probability that an applicant will have above-average job performance. This combination of scores indicates how well an applicant performed on the test with reference to others of the same subpopulation, informa- tion that is useful to employers who are actively seeking to increase the representation of minorities in their work force under an affirmative action program. The expectancy score shows that even low scorers have a reasonable chance of success on the job and will help employers avoid placing totally unwarranted weight on small score differences. Norm Groups 4. If the within-group score adjustment strategy is chosen, we recom- mend that USES undertake research to develop more adequate norming tables. The data on Native Americans is particularly weak, but all of the norming samples are idiosyncratic convenience samples. As a conse- quence, there is reason to doubt that the particular constant factors added to minority scores are the most appropriate ones. 5. An attempt should be made to develop norms for homogeneous groups of jobs, at the least by job family, but if possible by more cohesive clusters of jobs in Job Families IV and V if possible. 6. The adjustment factor that should be computed is the mean score difference between majority-group and minority-group applicants for the same job, averaged over all jobs.