Read "Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery" at NAP.edu

« Previous: Summary

Page 17 Cite

Suggested Citation:"1 The Policy Context." National Research Council. 1989. Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery. Washington, DC: The National Academies Press. doi: 10.17226/1338.

Page 18 Cite

Page 19 Cite

Page 20 Cite

Page 21 Cite

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

1 The Policy Context Productivity has been one of the more worrisome public issues of this decade. Faltering American competitiveness vis-a-vis Japan, the deteri- oration of the steed industry and of manufacturing more generally, the decline in real income and the emergence of the two-earner family, the ballooning federal deficit and trade imbalances-such topics have become a staple of the popular press as well as the more ratified domain of · - economlc ana ysls. As part of the larger public discussion, the quality of the American work force has come under increasing scrutiny. Numerous articles and reports have described the decline in American public education and the failure of contemporary schools to prepare pupils to enter the labor market. There have been unflattering comparisons to the productivity of Japanese and Korean workers in the auto, electronics, and appliance industries. And there has been increased interest in finding better ways of selecting and using workers. There has been a resurgence of testing- testing to screen out applicants who are bad risks (drug testing, lie detector testing, honesty testing, health projections) and, to a lesser extent, ability or knowledge testing to identify the better prospects. THE USES EMPLOYMENT TESTING PROGRAM In response to the troubled state of the nation's economic health, the U.S. Employment Service (USES), a unit of the U.S. Department of Labor, developed a new role for its General Aptitude Test Battery 17

~ ~ BACKGROUND AND CONTEXT (GATB), a test of cognitive, perceptual, and motor skills used in state Job Service offices since 1947. Less than ~ percent of the approximately 9.5 million Job Service registrants who received some reportable service each year were given the GATB in the early l980s, and then mainly for purposes of vocational counseling. It was felt that far more extensive use of the test battery to fill job orders might be justified. Recent develop- ments in statistical methods specifically, mesa-analysis seemed to hold out scientific promise that the GATB could be used to identify good workers for a wide range of jobs. From a policy perspective, the anticipated contributions to productivity were attractive. The rationale for using general ability tests for employment screening is that ability tests can help employers identify good (more productive) workers. This proposition is based on a number of assumptions: First, that there is a wide range of potential job performance in the people likely to be candidates for a particular type of job; Second, that ability tests predict future job performance with a useful degree of accuracy; Third, that higher scorers on the test are better performers in the long term (that is, if everyone could be trained to proficiency in a short period of time, the advantages of selecting high-ability workers would be fleeting). If these things hold true, then selection of high-scoring applicants can be presumed to enhance work force efficiency and therefore contribute to the overall productivity of the firm. Although mental measurement specialists generally recognize that cognitive ability tests can measure only some of the attributes that contribute to successful job performance, they consider such tests to be, in the present state of the art, the most informative general predictor of proficiency for most jobs. (Note that we are talking about general employment screening, for which a single instrument is used to predict success in a range of jobs. For jobs that require extensive prior training and highly developed skills and knowledge, such as elec- tronics specialist, jet engine mechanic, and lawyer, custom-designed instruments would be more informative than a general cognitive test.) None of this was new in 1980 when the U.S. Employment Service began to envision a larger role for the GATB. But a catalytic innovation had occurred in one corner of the psychological measurement field during the 1970s. Traditional psychometric theory held that the validity of a given test is dependent on situational factors (the norming sample, geographic location, organizational climate) because the correlations between a test and the criterion of interest (e.g., job performance) were observed to vary from study to study. Thus, the theory went, a test valid in one setting might not be valid in another, and a new investigation of its

THE POLICY CONTEXT 19 validity would be required for each substantially new setting. This is the view that has informed federal equal employment opportunity policy and is officially recognized in the interagency Uniform Guidelines on Em- ployee Selection Procedures (29 CFR Part 1607 [198511. In the mid-1970s, a number of analysts, preeminent among them Frank Schmidt and John Hunter, began to challenge this theory of situational specificity, arguing that the observed differences in a given test s validity from one setting to another were not real, but rather were artificial differences introduced by sampling error, unreliability in the criterion measures, and other weaknesses of the individual validity studies. The application of meta-analytic techniques for combining data from large numbers of studies, statistical techniques that had proved useful in many other scientific areas, led Hunter and Schmidt to conclude both that the results of individual validity studies substantially understate the predic- tive power of cognitive tests and that the validity of such tests can be generalized to new situations, even to new jobs. Convinced that this evidence establishes the importance of g, or general intelligence, to all types of job performance, some proponents of validity generalization, as this type of meta-analysis is called, have come to argue that a well- developed ability test can be used for selecting applicants for virtually all jobs. If this held true of the GATB, it would enable USES to encourage the states to start using the test much more widely in the Employment Service. The Department of Labor contracted with John Hunter in 1980 to conduct validity generalization studies of the GATB, using the hundreds of individual validity studies that had been conducted since 1947. Hunter carried out four technical studies in 1981 in which he explicates his analysis of GATB validities and presents a dollar estimate of the economic gains that could accrue from using the GATB for personnel selection. The results were published as USES technical reports in 1983 (U.S. Department of Labor, 1983b,c,d,e). The reports advocate the generalizability of GATB validities to all 12,000 jobs in the U.S. economy. In the author s view, use of the GATB could be extended from the 500 jobs for which specific validation studies had been conducted to every job for which the Employ- ment Service might be asked by employers to refer candidates. Moreover, Hunter maintains that substantial-one could say dramatic economic gains would accrue from using test scores, from the highest score on down in rank-ordered fashion, to select the applicants to be referred to employ- ers. By his calculations (U.S. Department of Labor, 1983e), optimal use of the GATB in 1980 (i.e., top-down selection) to refer the approximately 4 million people placed by the Employment Service that year would have yielded gains of $79.36 billion. In comparison, total corporate tax reve- nues at all levels were $59.2 billion in 1982 (Sueyoshi, 1988~.

20 BACKGROUND AND CONTEXT Not surprisingly, USES officials were encouraged by these findings. They decided to implement a new test-based referral system, which we call the VG-GATB Referral System in this report, on an experimental basis. With USES approval, North Carolina began a pilot project in fall 1581. By the end of 1986, some 3& states had experimented with VG-GATB referral in at least one local office. Six states (Maine j Michi- gan, New Hampshire, North (Carolina; Utah j and Virginian introduced the test-based referral plan in all local offices j although only Virginia replaced the earlier system of having placement Counselors make referral deci- sions. Most states did not dramatically alter local office procedures, but simply supplemented existing procedures using the VG-GATB system, with its requirements for extra testing and file searchj to fill job orders if requested by employers. Out of 1,800 local Job Service offices nation- wide, approximately 400 introduced VG-GATB referral, typically in conjunction with the earlier system. Of these, 84 are located in North Carolina, 8Q in Michigan, 38 in Virginia, 35 in Mainej and 25 In Utah. Within-Group Scoring of the VG-GATB An integral part of the VG-GATB Referral System; as USES presented it to state-level Job Service officials, was the conversion of scores on the test battery to percentile ranks within the population categories of "black," "Hispanic,)' and "other" (which includes all those not in the first two categories). This was a carefully considered policy decision. Following the findings of the technical reports, USES designed the system to rank-order candidates by test score and to refer them from the top down in order to get the maximum economic benefit. There are, however, significant group differences in average test scores, which have been demonstrated with virtually all standardized tests. Blacks as a group score well below the majority group, and Hispanics tall roughly in between as a rule. As a consequence of these average group differences, strict top-down referral would adversely affect the employment chances of black and Hispanic applicants. To counteract this effect, the experimental referral system stipulated that raw scores be converted into group-based percentile ranks. USES provided the conversion tables for making the score adjustments. The resulting percentile scores reflect an applicant's standing with reference to his or her own racial or ethnic group, thus effectively erasing average group differences in test scores. ~ black applicant with a percentile- score of 50 has the same ranking for referral as a white candidate with a percentile score of 50j although their raw test scores (percentage correct) would be very different. For example, in the category of semiskilled jobs; blacks at the 50th percentile have raw scores of 276; Hispanics, 295; and

THE POLICY CONTEXT 21 others, 308. The meaning of these raw-score differences for estimated job performance is not self-evident. But to lend some perspective, a raw score of 308 within the black group is at the 84th percentile of that group. By combining this method of scoring the GATE with top-down selec- tion of the applicants to be referred to prospective employers, USES sought to further two policies that are considered very important by the federal government: the enhancement of national productivity (by serving the individual employerts interest in hiring the most able workers avail- able) and the promotion of federal equal employment opportunity and affirmative action goals. Without some sort of compensatory scoring system, in the agency's view, referral of candidates on the basis of GATB test scores from the top down would reduce the employment opportuni- ties of minority-group job candidates, thwarting the governmental interest in bringing minority workers into the economic mainstream and creating possible legal problems for both the Employment Service and the employers it serves. But if top-down selection were completely aban- doned, in the agency's view, work-force efficiency would super. The Justice Department Challenge to Within-Group Scoring Some years into the experiment with the VG-GATB Referral System, the Justice Department challenged the program because of the way test scores are derived. In a letter to the Director of the U.S. Employment Service, dated November 10, 1986, Wm. Bradford Reynolds, then As- sistant Attorney General for Civil Rights, strongly urged that all states that had adopted the validity generalization procedure be notified to cease and desist immediately. Mr. Reynolds adjudged the VG-GATB system to be an unlawful and unconstitutional violation of an applicant's rights to be free from racial discrimination because the within-group scoring procedure not only classifies Employment Service registrants by race or national origin, but also "requires job service offices to prefer some and disad- vantage other individuals based on their membership in racial or ethnic groups. Such a procedure constitutes intentional racial discrimination." The important point of difference between the two agencies was their judgment of the legality of extending race-conscious preferential treat- ment to some groups in society as a means of combating discrimination. Neither agency disputes the fact that there is a powerful legacy of discrimination to overcome. The question is means, not ends. The Department of Labor adopted a race-conscious scoring mechanism in order to avoid discrimination against minority-group members and to promote equal employment opportunity. Within-group scoring was thought of as an extension of a referral policy negotiated in 1972 among the Department of Labor, the Equal Employment Opportunity Commis

22 BACKGROUND AND CONTEXT sign, and the Department of Justice. The 1972 referral ratio policy stipulated that referrals of tested minority applicants should be in propor- tion to their presence in the applicant pool in all cases in which the tests had not been validated for minority applicants to the job in question (which in 1972 was virtually all 500 jobs for which testing was used). The Department of Justice viewed preferential consideration for one racial group as discrimination against all others, on the grounds that it illegally advances the interests of one classification of people at the expense of others. Officials of the Labor and Justice Departments agreed to a continuation of the status quo-no further extension of the VG-GATB Referral System and at the same time no cease-and-desist orders until a thorough study of the GATB validity base, validity generalization, scoring and referral policy, and the potential impact of the referral system could be carried out by a body of experts. This volume reports the results of the agreed-upon study. OTHER POLICY ISSUES Although the immediate reason for this study stems from the divergent views of two federal agencies about the legality of score adjustments, there are more general questions that should also receive careful policy review, questions about the nature of cognitive tests and about the wisdom of allowing any one procedure to dominate federal and state efforts to promote economic well-being by bringing suitable workers and jobs together. Validity Generalization and the Reemergence of g Development of the theory of validity generalization has coincided with, indeed encouraged, a revival of interest in the concept of g, or general intelligence. To make any sense, the idea that test validities observed for some jobs can be generalized to all other jobs depends on the complementary idea that the test measures some attribute that underlies performance in all jobs. This common underlying factor is usually thought of in terms of general intelligence, although some commentators, wary of the connotations of genetic determinism that surround the concept of intelligence, prefer to speak of a general mental factor or cognitive factor. (In his studies of the GATE, John Hunter identifies two such factors: general cognitive ability, which he describes as very similar to the classical concept of general intelligence, and psychomotor ability. The first general factor he finds linked to performance in all jobs, the correlation increasing with the cognitive complexity of the job; the

THE POLICY CONTEXT 23 psychomotor factor is principally associated with performance in one particular family of cognitively less complex jobs. A third factor, percep- tual ability, he found to be almost perfectly predictable from and causally dependent on the first two [Department of Labor, 1983b].) Early I Testing The idea of intelligence is closely bound up with the history of psychological testing in this century. The American adaptation of Alfred Binet's intelligence scale and introduction of the intelligence quotient in 1911, followed closely by the introduction of group intelligence testing with the Army Alpha in World War I, forged the link. From the beginning, ambitious claims have been made for such tests by those who saw them as a grand device for sorting people into the appropriate slot in society. In addressing Section H (Education) of the American Association for the Advancement of Science in 1912, E.L. Thorndike (1913:141) predicted: It will not be long before the members of this section will remember with amusement the time when education waited for the expensive test of actual trial to tell how well a boy or girl would succeed with a given trade, with the work of college and professional school, or with the general task of leading a decent, law-biding, humane life. Like many who would follow him, Thorndike (p. 142) read very expansive meanings into psychological tests: Tables of correlations seem dull, dry, unimpressive things beside the insights of poets and proverb makers but only to those who miss their meaning. In the end they will contribute tenfold more to man's mastery of himself. History records no career, war, or revolution that can compare in significance with the fact that the correlation between intellect and morality is approximately .3, a fact to which perhaps a fourth of the world's progress is due. Thorndike was not alone among the early testing enthusiasts, either in his grand expectations for mental measurement, in his readiness to measure morality, progress, and man's mastery of himself, or in his facile assumptions about the congruence of intellect and high moral character. In hindsight it is clear that many of the advocates of early testing allowed their scientific judgment to be influenced by contemporary racial and ethnic biases and by unexamined assumptions about the proper social order. Historian of science Daniel KevIes has documented the mutual attraction of the eugenics and mental measurement movements in the early twentieth century. Eugenicists, the early students of human genet- ics, asserted that the new science proved that mental defectiveness and criminality, immorality, and other deviant behaviors are fundamentally

24 BACKGROUND AND CONTEXT hereditary. Intelligence testing seemed a perfect tool for identifying those whose inferior genetic endowment would adulterate the gene pool. In this context, IQ tests could quickly turn into weapons of racial and class prejudice (Kevles, 1985~. But proponents of psychological testing were also responding to genuine social needs. Recall that the nation was undergoing massive growth at the turn of the century. In the three decades between 1890 and 1920, the population increased by some 68 percent and the high school population grew by more than 711 percent (Tyack, 1974~. The need for new institutions of social organization was urgent. To many psycholo- gists, educators, employers, and citizens in general, intelligence tests seemed to offer a scientific tool to bring order to the classroom and the workplace. The Army Alpha The Army experiment with group intelligence testing of recruits during World War I illustrates both the promise of the technology and its dark underside. Robert M. Yerkes led a team of prominent psychologists, among them Lewis Terman and Carl Brigham, in this first major effort to apply social science to the practical problems of taking the nation to war. As originally designed, the Binet and similar intelligence tests would have been of little use in situations in which large numbers of people were to be tested because they could be administered only to individuals and, theoretically, only by a psychologist. They were used primarily to assess mental retardation, not for mass screening. By redesigning a version of the Stanford Binet intelligence scale to allow its administration in a group setting, Yerkes and his colleagues put testing on the map. The Army Alpha, a paper-and-pencil test of general intellectual skills, made up of multiple-choice and true/false questions, and its oral analog, the Beta, were administered to almost 2 million recruits from June 1917 until the war ended, a noteworthy bureaucratic and logistic feat. One of the ironies of the story is that the Army testing program was largely experimental; it produced massive amounts of data but had little actual effect on selection or placement of recruits. Nevertheless, by transform- ing the intelligence scale into a test that could be administered to groups of people, and by using it to assess the intellectual skills of normal adults, the Army testing project legitimized the use of standardized, group- administered tests as a tool for making selection and placement decisions about individuals in mass society. Through the diligent promotion by Yerkes and others who had been associated with the Army testing project, the myth was established in the postwar period that it had been a great practical success (Kevles, 1968; Reed, 1987~.

THE POLICY CONTEXT 25 After the war, Yerkes spent a number of years at the National Academy of Sciences analyzing the test data on a randomly drawn sample of 160,000 recruits. The massive study published by the Academy (Yerkes, 1921) provides, among other things, a stark example of the dangers inherent in group testing and group comparisons. One of the most prominent themes to come out of the study involved the correlation between ethnic background and test scores. Yerkes' analysis showed that native whites scored highest on the Army Alpha. Of the immigrants, the highest scores were found for groups from northern and western Europe and the lowest for those from southern and eastern Europe. These findings fed the nativist sentiment of the period. They were picked up by anti-immigrationists as scientific corroboration that southern and eastern European immigrants i being intellectually inferior, would bring with them crime, unemployment, and delinquency. Yerkes' analysis also showed that test scores on the Army Alpha correlated highly with length of residence in the United States and with years of schooling. Yet these findings failed to impress. Yerkesj Car] Brigham, and other psychologists who had participated in the Army testing project supported eugenics and immigration restriction with he- reditarian arguments based on the Academy study. Critics of Intelligence Testing The claims made for the Army Alpha and the hereditarian interpreta- tion of test results did not go entirely unchallenged. Walter Lippmann published a trenchant series of articles in the The New Republic in 1922-1923, in which he mocked Yerkes' conclusion from the Army data that the average mental age of Americans is about 14. Lippmann, who had read widely in the social sciences, criticized both the technical and the social assumptions of intelligence testing. He objected particularly to the claim that the Army test or any other tests measured hereditary intelli- gence, comparing it to phrenology, palmistry, and "other Babu sciences.' i Intelligence, he pointed out, is not some concrete, readily measurable entity, but rather an extremely complex concept that no one had yet succeeded in defining (Lippmann, 19221. He summed up his discomfort with the psychometric vision of man by saying (Lippmann, 1923:1461: I admit it. I hate the impudence of a claim that in fifty minutes you can judge and . . classify a human being's predestined fitness for life. I hate the pretentiousness of that claim. I hate the abuse of scientific method which it involves. I hate the sense of superiority which it creates, and the sense of inferiority which it imposes. By the end of the decade, and in a somewhat less public forum, Carl Brigham came to similar conclusions about the arrogance of testers in the

26 BACKGROUND AND CONTEXT first flush of excitement over the new technology and its social uses. Brigham's Study of American Intelligence, published in 1923, had been an extremely influential popular exposition of the hereditary and racially determined nature of intelligence. Time and further study allowed him to disentangle his science from his social prejudices. His recantation came in a scholarly review of the status of intelligence testing of immigrant groups (Brigham, 1930:165~: This review has summarized some of the more recent test findings which show that comparative studies of various national and racial groups may not be made with existing tests, and which show, in particular, that one of the most pretentious of these comparative racial studies the writers own was without foundation. Relevance to Current Policy Sixty years and more have passed since the advent of group intelligence testing. The statistical underpinnings of psychometrics have become much more sophisticated. And in recent years there have been interesting advances in psychobiology and, to a lesser extent, in cognitive psychol- ogy that shed some light on intellective or cognitive functioning. But most specialists would still agree with Lippmann's assessment of the concept of intelligence: it is a very complicated notion that no one has been able to define very well. Even if we can show correlational relationships between a test of verbal and mathematical skills such as the GATE and supervisor ratings of job performance, we are still a very long way from being able to claim that what we are measuring is an unambiguous, unitary capacity that is the essential ingredient in successful job perfor mance. Moreover, we cannot escape the connotations that have surrounded the concept of intelligence since the early days. Most psychologists are much more circumspect about drawing causal relationships between test scores and such things as character, criminal tendencies, or degeneracy than they were in the 1920s. The more simplistic hereditarian notions have long since gone out of vogue, at least from the academic literature. The basic texts used to train recent generations of students in the intricacies of psychological testing-those of Cronbach (1984) and Anastasi (1976)- emphasize the contingent nature of what we call intelligence and the complex interplay of heredity and environment at all stages of human development. But in common usage such refinements can easily be lost, and there is very real danger that the renewed popularity of g and its promotion along with validity generalization could become a tool of racial and ethnic prejudice, generating feelings of superiority in some groups and inferiority in others.

THE POLICY CONTENT 27 l he potential tor social damage from overstating the claims of testing is as great today as it was in the 1920s, because the United States remains a country of many identifiable ethnic and racial subpopulations, some of which are relatively disadvantaged economically and educa- tionally. The target groups have changed to some extent southern and eastern European groups have long since been assimilated into the lingual and cultural majority group and have disappeared as objects of social disrespect. But blacks, Hispanics, and Native Americans, all groups of people who perform less well on average on written tests of verbal and mathematical abilities, and who are economically and socially disadvantaged, are vulnerable to being stereotyped as of inferior intelligence. Hence an important policy concern raised by the VG-GATB Referral System is that it may foster social division by encouraging Employment Service staff and clients to draw improper inferences about the potential contribution of minority-group members indeed of any low-scoring job seeker in the workplace or to society as a whole. Should There Be Diverse Routes to Employment? A second question for policy makers to consider is whether govern- mental endorsement of a test-based referral system, to the exclusion of other procedures, would be in the best interests of Employment Service clients or of the economy. If the VG-GATB Referral System is found legally defensible, it is not unreasonable to anticipate that the GATB could come to dominate entry to many kinds of jobs. Many employers would be drawn to use the Employment Service as a way of reducing their legal vulnerability to equal employment opportunity suits; some would also be attracted by the savings resulting from shifting the costs of testing and test validation to the government (although small companies can afford neither in any case). The implications of such a development need to be carefully weighed. There are, for example, possible social costs that should not be ignored. A universal system of referral based on GATB test scores implies that people with the lowest scores might well be perpetually unemployed. Although the number of people who are unemployed would not increase, the dominance of a single sorting device could have the effect of perpetually subjecting the same individuals to the ill effects of unemploy- ment. Is the GATB of sufficient utility to justify such an effect? Are there ways to prevent that from happening? Would the government's sponsorship of the VG-GATB system, with its promised legal umbrella, tend to cause employers to eliminate their own testing programs, which have been tailored to their own specific needs?

28 BACKGROUND ID CONTEXT Would this overload the states' capacity to respond and place an unexpected burden on their treasuries? We value pluralism in our society. Is it therefore wise for the govern- ment to focus on just one characteristic, even if it is what we know how to measure best, when we also know that successful personjob matches can be effected in other ways? Is a simple sorting mechanism fair to individuals of whatever color or ethnic derivation who do not do well on cognitive tests but are, nevertheless, capable of successful participation in the work force? THE INTERSECTIONS OF POLICY AND SCIENCE Whether within-group score adjustments can or should be any part of the system used by the Employment Service to bring employers and job seekers together will not be decided solely, or even primarily, on the basis of scientific evidence. Likewise, the question of government sponsorship of a particular test that could come to dominate certain segments of the labor market is not simply a matter of the quality of the test instrument. Nevertheless, there are important aspects of the question of appropriate and equitable use of standardized tests that can be clarified through scientific analysis. And many of the claims made about the VG-GATB Referral System in general lend themselves to scientific investigation. In the remainder of this report, we evaluate the claims made for the General Aptitude Test Battery, for validity generalization, and for the economic benefits of employment testing. We assess the likely impact of widespread implementation of the VG-GATB Referral System with and without score adjustments. We discuss possible alternatives to the within-group scoring system, including the so-called Golden Rule proce- dure for reducing group differences in ability test scores. We offer to the Department of Labor recommendations for policy alternatives that seem justified by the scientific evidence. And finally, we propose a research agenda for the agency to consider should it, through the U.S. Employ- ment Service, decide to continue to promote a more extensive role for the GATB.

Next: 2 Issues in Equity and Law »

Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery (1989)

Chapter: 1 The Policy Context

Welcome to OpenBook!

Get Email Updates