Read "Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery" at NAP.edu

« Previous: 3 The Public Employment Service

Page 73 Cite

Suggested Citation:"4 The GATB: Its Character and Psychometric Properties." National Research Council. 1989. Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery. Washington, DC: The National Academies Press. doi: 10.17226/1338.

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

. The GATB: Its Character and Psychometric Properties The General Aptitude Test Battery (GATB) has been in use for more than 40 years, and for most of that time it has remained virtually unchanged. Through the years it has been used in state Employment Service offices for vocational counseling and referral and in addition has been made available for testing and counseling to high schools and technical schools, labor union apprenticeship programs, public and pri- vate vocational rehabilitation services, and other authorized agencies. The obvious first task for the committee was to sift through the years of research and experience with the GATB to assess its suitability as the centerpiece of the proposed VG-GATB Referral System. We looked carefully at the development and norming of the instrument, its psycho- metric properties, and evidence that it actually measures the aptitudes it claims to measure. We also looked with some care at four other widely used tests of vocational aptitudes in order to get a sense of the relative quality of the GATB. This chapter describes the test and summarizes our analysis of its psychometric properties (a more detailed discussion appears in Jaeger, Linn' and Tesh, Appendix A). Chapter 5 addresses the two shortcom- ings that the committee feels must be dealt with if the GATB is to assume a central role in the Employment Service system of matching people to jobs: namely, the highly speeded nature of the test, which makes it vulnerable to coaching, and the paucity of available test forms, which makes it vulnerable to compromise. 73

74 ANALYSIS OF THE GENERAL APTITUDE TEST BAITER Y DEVELOPMENT OF THE GATB In the period 1942-1945, the U.S. Employment Service (USES) de- cided to develop a "general" aptitude battery that could be used for screening for many occupations. Drawing on the approximately 100 occupation-specific tests developed since 1934, USES staff identified a small number of basic aptitudes that appeared to have relevance for many jobs (U.S. Department of Labor, 1970:171: 1. Intelligence (G), defined as general learning ability; 2. Verbal aptitude (V), the ability to understand the meanings of words and language; 3. Numerical aptitude (N), the ability to perform arithmetic operations quickly and accurately; 4. Spatial aptitude (S), the ability to think visually of geometric forms and to comprehend the two-dimensional representation of three-dimen- sional objects; 5. Form perception (P), the ability to perceive pertinent detail in objects or in pictorial or graphic material; 6. Clerical perception (QY, the ability to perceive pertinent detail in verbal or tabular material a measure of speed of perception that is required in many industrial jobs even when the job does not have verbal or numerical content; 7. Motor coordination (K), the ability to coordinate eyes and hands or fingers rapidly and accurately in making precise movements; 8. Finger dexterity (F), the ability to move fingers and manipulate small objects with the fingers rapidly and accurately; and 9. Manual dexterity (M), the ability to move the hands easily and skillfully. Four of the nine aptitudes-clerical perception, motor coordination, finger rl~.xt~.ritv ~n`1 m~n''n1 `1exter'tv involve speed calf work as a major ~e~ ,, ~,,~ ~ component. From the USES inventory of job-specific tests, those providing the best measure of each of the nine basic aptitudes (based on several statistical criteria) were selected for inclusion in the new General Aptitude Test Battery, which became operational in 1947. The operational edition of the GATB, B-1002, was produced in two forms, A and B. Form A was reserved for the use of Employment Service offices; Form B was used for validation research and for retesting and was made available to other authorized users for vocational counseling and screening. It was not until 1983 that two additional forms, Forms C and D, of GATB edition B-1002 were introduced.

CHARACTER AND PSYCHOMETRIC PROPERTIES 75 THE STRUCTURE OF THE GATB The General Aptitude Test Battery consists of 12 separately timed subtests, which are combined to form nine aptitude scores. Eight of the subtests are paper-and-pencil tests, and the remainder are apparatus tests. Two of the paper-and-pencil subtests (name comparison and mark making), as well as all four subtests that require manipulation of objects, are intended to measure aptitudes that involve speed of work as a major component. Each subtest is scored as number correct, with no correction r for guessing. The following descriptions of the subtests in Forms A and B of the GATB are based on material in Section III of the Manualfor the USES GATB (U.S. Department of Labor, 1970:15-16~. Examples of various item types are drawn from a pamphlet published by the Utah Department of Employment Security. Subtest 1: Name Comparison This subtest contains two columns of 150 names. The examinee inspects each pair of names, one from each column, and indicates whether the names are the same or different. There is a time limit of 6 minutes, or 2.40 seconds per item. This is a measure of the aptitude of clerical perception, Q. Sample Item: Which pairs of names are the same (S) and which are different (D)? 1. W. W. Jason W. W. Jason ........... Johnson & Johnsen ......... Harold Jones and Co. 2. Johnson & Johnson 3. Harold Jones Co . . . Subtest 2: Computation This subtest consists of arithmetic exercises requiring addition, sub- traction, multiplication, or division of whole numbers. The items are presented in multiple-choice format with four alternative numerical answers and one "none of these." There are 50 items to be answered in 6 minutes, or 7.20 seconds per item. This is one of two measures of numerical aptitude, N. Sample Item: Add (I) 766 (A) 677 (C) 777 11 (B) 755 (D) 656 (E) none of these

76 ANALYSIS OF THE GENES ETUDE TEST BAKERY Subtest 3: Three-Dimensional Space This subtest consists of a series of exercises, each containing a stimulus figure and four drawings of three-dimensional objects. The stimulus figure is pictured as a flat piece of metal that is to be bent, rolled, or both. Dotted lines indicate where the stimulus figure is to be bent. The examinee indicates which one of the four drawings of three-dimensional objects can be made from the stimulus figure. There are 40 items with four options each, to be completed in 6 minutes, or 9.00 seconds per item. This subtest is one of three measures of intelligence, G. and the only measure of spatial aptitude, S. Sample Item: At the left in the drawing below is a flat piece of metal. Which object to the right can be made from this piece of metal? i Ill ~ i\: am/ it\ C D Subtest 4: Vocabulary Each item in this subtest consists of four words. The examinee indicates which two of the four words have either the same or opposite meanings. There are 60 items, each having six response alternatives (all possible pairs from four). The time limit is 6 minutes, or 6.00 seconds each. This subtest is one of three measures of intelligence, G. and the only measure of verbal aptitude, V. Sample Items: Which two words have the same meaning? (a) open (b) happy (c) glad (d) green Which two words have the opposite meaning? (a) old (b) dry (c) cold (d) young

CHARACTER AND PSYCHOMETRIC PROPERTIES 77 Subtest 5: Tool Matching This subtest consists of a series of exercises containing a stimulus drawing and four black-and-white drawings of simple shop tools. Dif- ferent parts of the tools are black or white. The examinee indicates which of the four black-and-white drawings is the same as the stimulus drawing. There are 49 items with a time limit of 5 minutes, or 6.12 seconds per item. This is one of two measures of form perception, P. Sample Item: At the left in the drawing below is a tool. Which object to the right is identical? Variations exist only in the distribution of black and white in each drawing. . Subtest 6: Arithmetic Reasoning This subtest consists of a number of arithmetic problems expressed verbally. There are five alternative answers for each item, with the fifth being "none of these." There are 25 items with a time limit of 7 minutes, or 16.80 seconds per item. This subtest is one of three measures of intelligence, G. and one of two measures of numerical aptitude, N. Sample Item: A man works ~ hours a day, 40 hours a week. He earns $1.40 an hour. How much does he earn each week? (A) $40.00 (B) $44.60 (C) $50.60 (D) $56.00 (E) none of these Subtest 7: Form Matching This subtest presents two groups of variously shaped line drawings. The examinee indicates which figure in the second group is exactly the same size and shape as each figure in the first or stimulus group. Total test time is 6 minutes, or 6.00 seconds per item. This subtest is one of two measures of form perception, P.

78 ~AlYSISOF THE GENE~LAPTITUDE TEST BA"ERY Sample Item: For questions 9 through 12 find the lettered figure exactly like the numbered figure. I\ r /9\ 1: ~ ' 12 ~ ' ( B ) Lid ~ (The actual test would have 25 or more items within a group.) Subtest 8: Mark Making This subtest consists of a series of small empty boxes in which the examinee is to make the same three pencil marks, working as rapidly as possible. The marks to be made are short lines, two vertical and the third a horizontal line beneath them: ~ . There are 130 boxes to be completed in 60 seconds, or 0.46 seconds per item. This subtest is the only measure of motor coordination, K. Subtest 9: Place The equipment used for Subtests 9 and 10 consists of a rectangular pegboard divided into two sections, each containing 48 holes. The upper section contains 48 cylindrical pegs. In Subtest 9, the examinee moves the pegs from the holes in the upper part of the board and inserts them in the corresponding holes in the lower part of the board, moving two pegs simultaneously, one in each hand. This performance (moving 48 pegs) is done three times, with the examinee working rapidly to move as many of the pegs as possible during the time allowed for each of the three trials, 15 seconds or 0.31 second per peg. The score is the number of pegs moved, summed over the three trials. There is no correction for dropped pegs. This test is one of two measures of manual dexterity, M.

CHARACTER AND PSYCHOMETRIC PROPERTIES 79 / ~; , ~ /\\ --Mu-\ ~. ~ - - Subtest 10: Turn For Subtest 10, the lower section of the board contains the 48 cylindrical pegs. The pegs, which are painted in two colors-one end red and the other end white" all show the same color. The examinee moves a wooden peg from a hole, turns the peg over so that the opposite end is up, and returns the peg to the hole from which it was taken, using only the For Right-Handed Examinees 01~ 813 01 824 023 821321 into iS2 a:. 830 0" 828 8~e 8" 037 ~S. 018 _ 426 043 042 (Examinee stands here)

80 ANALYSIS OF THE GENE~L APTITUDE TEST BAKERY preferred hand. The examinee works rapidly to turn and replace as many of the 48 cylindrical pegs as possible during the time allowed, 30 seconds. Three trials are given for this test. The score is the number of pegs the test taker attempted to turn, summed over the three trials. The time allowed is 0.63 second per peg and there is no correction for errors. This subtest is one of two measures of manual dexterity, M. Subtest 11: Assemble The equipment used for Subtests 11 and 12 consists of a small rectangular board (finger dexterity board) containing 50 holes and a rod to one side, and a supply of small metal rivets and washers. In Subtest 11, the examinee takes a small metal rivet from a hole in the upper part of the board with the preferred hand and at the same time removes a small metal washer from a vertical rod with the other hand; the examinee puts the washer on the rivet and inserts the assembled piece into the correspond- ing hole in the lower part of the board using only the preferred hand. The examinee works rapidly to move and assemble as many rivets and washers as possible during the time allowed. There is one scored trial of 90 seconds, or 1.80 seconds per rivet. The score is the number of rivets moved; there is no correction for dropped rivets or for moving rivets without washers. This subtest is one of two measures of finger dexterity, F. Subtest 12: Disassemble The equipment for this subtest is the same as that described for Subtest 11. The examinee removes the small metal rivet of the assembly from a hole in the lower part of the board, slides the washer to the bottom of the board, puts the washer on the rod with one hand and the rivet into the corresponding hole in the upper part of the board with the other (preferred) hand. The examinee works rapidly to move and replace as many rivets and washers as possible during the time allowed. There is one timed trial of 60 seconds, or 1.20 seconds per rivet. The score is the number of rivets moved; there is no correction for dropped rivets or washers. This subtest is one of two measures of finger dexterity, F. HOW GATB SCORES ARE DERIVED There are more than 750 items on the GATB all together. But an applicants score is not simply the sum of the correct answers on each subtest. The generation of GATB scores from subtest scores involves a number of conversion procedures intended to provide the scores with meaning and to suitably standardize and weight subtest scores in the

CHARACTER AND PSYCHOMETRIC PROPERTIES ~ ~ 6<, 1 500450 40g 35g 300 2S~ 200 150 109 SO 490440 390 349 299 249 199 ldg 99 4g 480439 389 339 28023g 189 139 89 3g 470 42g 3g 3Q tp 29 179 12g 7g 2g 4Cig 4 1g 36g 31g 26g 219 1 6g 1 1g hi 1g O 500 45O 440 TO 30O 2sO TO 150 10O SO 49O 440 HO 34O ~ TO 1~ 14O 90 40 4lO 430 38O 33O TO TO ·80 13O SO 3O 47O TO 3~ HO TO 2~ 17O 120 TO SO 46O 410 36O 31O 26O 210 t6O 110 .0 ·0 (Examinee sits here.) Subtest 11.

82 ANALYSIS OF THE GENERAL APTITUDE TEST BATTERY 50O 4SO 4~O 3sO 30O 2SO 20O lsO 100 50 49O 44O 39O 34O 29O 24O 190 1 4O 9O 4ao 43G 38O 330 28O 23O 18O 130 ~0 470 41~ 3< 32O 2~ 22O 170 1 ~lO "0 4,0 3dO 31O 26O IlO 16O 110 GO o 4O 30 20 10 ~4~ ~03 S~ ~ 580258 5 3 3 2 ~ 1 1 4984J3333432982481~139343 4 834 23 ~ 3 382 ~2 3~1 ~1 ~ 8~ 38 47~4~3~3282~2~1 ~12~ 78 28 46~41836331826328168tl862 18 (Examinee sits here.) Subte-st 12.

CHARACTER AND PSYCHOMETRIC PROPERTIES 83 various forms of the test. This section describes the mechanics of producing GATB scores under traditional procedures and under the new VG-GATB Referral System (U.S. Department of Labor, 1970, 1984c). It also looks briefly at the development of GATB norms and the equating of test forms, both of which influence the conversions made. Obtaining GATB Scores There are three steps in obtaining GATB scores under the traditional procedures: 1. The first step is to calculate the number of items correct for each of the 12 subtests. There is no penalty for wrong answers. 2. The second step is to convert each raw score so that it is referenced to the norming population. The specific conversion depends on which aptitude the subtest score will be used for (arithmetic reasoning has a different value for G. intelligence, than for N. numerical aptitude), the form of the GATB that was administered, and the type of answer sheet used. There is a conversion table for each subtest for each form of the GATB. Three of the subtests are components of two different aptitudes and hence have two conversion tables for each form. Each raw score will go through two or three transformations in becoming an aptitude score. 3. The third step is to sum the converted scores into aptitude scores. The conversion tables used to produce aptitude scores are designed to accomplish three things: first, to put all aptitude scores on a single measurement scale having a mean of 100 and a standard deviation of 20 in the norming population; second, to make scores on all operational forms of the test comparable with one another (so that a score of 109 on the verbal subtest in Form A means the same as a score of 109 on the verbal subtest in Form B); and third, to weight the components of an aptitude score when it consists of more than one subtest. The new VG-GATB Referral System, in which all jobs are clustered into one of five job families, and in which percentile scores are computed on the basis of group identity (black, Hispanic, other), requires two further steps: 4. The conversion of aptitude scores to "B" scores. There are two aspects to the process: the aptitudes are reduced to three composites- a cognitive composite (G + V + N); a perceptual composite (S + P + Q); and a psychomotor composite (K + F + M) and the composites are accorded different relative weights for each of the five job families according to their importance in predicting job performance in each family. There is a conversion table for each of the three composites, and each table has conversions for each of the five job families, for a total of

84 ANALYSIS OF THE GENE~L APTITUDE TEST BAKERY 15 B scores. USES Test Research Report No. 45 (U.S. Department of Labor, 1983b) describes regression equations relating the three composite scores to job performance in each of the five job families. Regression coefficients were used to formulate relative weights for the aptitude composites in forming B scores. 5. The final step is to calculate percentile scores from the B scores. For each of the five job families, the three B (composites scores are summed. Each of these five numbers is then converted into a percentile score for the appropriate population group (black, Hispanic, other). Test batteries usually require score conversions of some sort both to standardize the scale of measurement and to provide scores with mean- ing. However, the amount of manipulation that GATB scores undergo is of some concern to the committee. Each of the conversion tables is based on a set of judgments and analyses that we have not been able to fully reconstruct, despite a careful review of the GATB technical manuals. It is, therefore, difficult to comprehend the links between the raw scores and the within-group percentile scores. The several layers of computations have gradually accumulated over time. Exactly the same job family scores could be obtained by taking suitable linear combinations of the subtest scores. And indeed, predictors of almost the same validity as the job family scores would be obtained by an unweighted sum of the subtest scores. GATB Norms The purpose of norms is to show an individual's relative standing in some appropriate reference group. Norms for the GATB are based on what USES calls the General Working Population Sample, a subset of 4,000 of a total of 8,000 workers for whom complete GATB data were available in 1952. The sample of 4,000 was chosen to be representative of the work force as it appeared in the 1940 census, with one exception: the base population was restricted to employed workers ages 18 to 54 and included no farmers, foremen, proprietors, managers, or officials. The five occupational groups defined by the Bureau of the Census (profes- sional and semiprofessional; clerical, sales; craftsmen; operatives; labor- ers, except farm and mine) were represented in the standardization sample in proportion to their presence in the census. The sample was also stratified on the basis of sex, age, and (less successfully) geographic location. This General Working Population Sample is the reference population in which the GATB aptitudes are standardized to have a mean of 100 and a standard deviation of 20. A study conducted in 1966 with test data from

CHARACTER AND PSYCHOMETRIC PROPERTIES 85 23,428 workers indicated that the norms had remained stable to that point. We have not seen more recent information on the General Working Population Sample norms; the significant structural changes in the econ- omy since then, including the continued decline of manufacturing and emergence of new high-technology occupations, suggest the need for renewed attention to the GATB norming sample. Norms for Within-Group Scoring The GATB General Working Population Sample was not stratified by racial or ethnic group identity. As a consequence, implementation of the VG-GATB Referral System required additional normative data to permit within-group percentile scoring by job family (steps 4 and 5, above). The current norms are based on the 8,310 blacks, 2,102 Hispanics, and 18,359 "others" in an expanded data base of 143 validity studies conducted since 1972. Native American norms were produced in 1986. The samples used were 472 Native American employed workers from GATB validity studies and 1,349 Native American applicants prior to 1985. From the scant information available, the committee concludes that an improved normative base is required if group-based score adjustments continue to be used in the VG-GATB Referral System. The current norm groups are by no means nationally representative samples. Nor is it possible to evaluate how similar they might be to groups of applicants for particular jobs. Thus within-group percentile score conversions that produce the same distribution of percentile scores for all racial or ethnic groups in the norm group may provide quite different distributions among groups of applicants for particular jobs. Since most of the data are based on validity studies from only two job families (Job Families IV and V), special caution is appropriate for the remaining three job families. Equating Alternate Forms of the GATB As the Standards for Educational and Psychological Testing adopted by the major professional organizations point out (American Educational Research Association et al., 1985:31), alternate forms of a test would, in the ideal case, be interchangeable in use: "it should be a matter of indifference to anyone taking the test or to anyone using the results whether Form A or Form B of the test was used." However, even if considerable care is taken to make two forms of a test as similar as possible in terms of content and format, the forms cannot be expected to be precisely equal in difficulty. Consequently, the use of simple number- right scores without regard to form would place the people taking the

86 ANALYSIS OF THE GENERA APTITUDE TEST BAKERY more difficult of the two forms at a disadvantage. To take the unintended differences in difficulty into account, "it is usually necessary to convert the scores of one form to the units of the other, a process called test equating" (p. 311. There are a number of data collection designs and analytical techniques that can be used to equate forms of a test. Detailed descriptions of the various approaches can be found in Angoff (1971) and in Petersen et al. (19891. Regardless of the approach, however, there are two major issues that need to be considered in judging the adequacy of the equating: (1) the degree to which the forms measure the same characteristic or construct and (2) the magnitude of the errors in the equating due to the procedure and sampling. Our review of the evidence concerning the equating of GATB forms revealed a mixed picture (see Jaeger et al., Appendix A, for a detailed discussion; U.S. Department of Labor, 1984b). Alternate Forms A, C, and D of the first eight subtests of the GATB, which define all the aptitudes except finger dexterity and motor dexterity, have adequate intercorrelations and sufficiently similar patterns of correlations with the other subtest scores to treat the scores as interchangeable after equating. The procedures used to equate the alternate forms for the first eight Subtests are reasonable and, if appropriately applied, should yield equated scores with relatively small standard errors of equating. Missing details concerning Form B preclude judgment at this time. In the future, better documentation of the details of the equating analyses needs to be provided to enable an independent check on the equating. Forms C and D of Subtests 9 through 12 of the GATB do not correlate as highly with Form A as would be desirable to consider the forms to be interchangeable after equating. Furthermore, the pattern of intercorrela- tions among the subtests for Form D does not appear to be sufficiently similar to the pattern for Form C to conclude that the forms measure essentially the same characteristics. Finally, the equating procedures and sample sizes used for Subtests 9 through 12 produce results that are subject to larger errors of equating than the procedures and sample sizes used to equate Subtests 1 through 8. For these reasons, scores from the alternate forms that are based on Subtests 9 through 12 should not be considered to be interchangeable. RELIABILITY OF THE GATB APTITUDE SCORES Aptitude tests such as the GATB are intended to measure stable characteristics of individuals, rather than transient or ephemeral qualities. Such tests must measure these characteristics consistently, if they are to be useful. Reliability is the term used to describe the degree to which a test measures consistently. The psychometric literature includes a variety

CHARACTER AND PSYCHOMETRIC PROPERTIES 87 of methods for estimating test reliability. These methods differ in their sensitivity to various sources of measurement error, in their applicability to different types of tests, and in their usefulness for particular purposes. When tests are used to assess aptitudes or other traits that are expected to be stable across weeks, months, or years, the most appropriate reliability estimation procedures will reflect the stability of measurements across time. The committee conducted a careful review of studies of the temporal stability of the GATB. The time period between test administrations ranged from one day to four years, and the studies have involved samples of examiners that varied widely in age and level of education. Estimates of the temporal stability of GATB aptitude scores have also been computed for examinees of different races. Whether the stability coefficients of GATB aptitude scores are suffi- ciently large is a matter of interpretation. Certainly, the stabilities of the cognitive aptitudes (G. V, N) compare very well with those of corre- sponding aptitudes in other batteries and with those of many other tests used as a basis for selection and classification decisions concerning individuals. The gradualness of the degradation of these aptitudes' stability coefficients as a function of time interval is also impressive. Figure 4-1 is a scatter diagram that illustrates the stability coefficients as 1.0 0.9 m 0.8 In 0.7 0.6 o 0 0 -oo 0 o o o o o 0 0 0 ° 0 o 1 ,000 TIME INTERVAL (days) o l l l 2,000 FIGURE 4-1 Stability coefficient for G versus time between test administra- tions.

88 ANALYSIS OF THE GENE~[ APTITUDE TEST BAKERY a function of the time interval between test administrations for the G aptitude. The stability coefficients of the GATB perceptual aptitudes are somewhat smaller than those of the cognitive aptitudes, but again, compare well to those of corresponding aptitudes in other test batteries. The stability coefficients of psychomotor aptitudes F and M are substantially smaller than those of other aptitudes assessed by the GATB and, if these aptitude scores were to be used individually for making selection or classification decisions, would be regarded as unacceptably small. However, this is probably not a problem for the VG-GATB system, since referral decisions are based on composites of aptitude scores. Although direct estimates of the stability of the operational GATB aptitude composites (such as KFM) are not available, the estimated stability coefficient over a time interval of two weeks or less for a unit-weighted composite of abilities K, F. and M is 0.81. This value is sufficiently large not to preclude interpretation of scores for individual examiners. (Additional information on the reliability of the GATB can be found in Appendix A.) CONSTRUCT VALIDITY ISSUES In this section we report on evidence that bears on USES claims that the subtests of the GATB measure the aptitudes with which they are identified in the GATB Manual (U.S. Department of Labor, 1970) and nothing more. In particular, the committee conducted an exhaustive review of the literature on convergent validity, which reports the strength of relationships between subtests of the GATB and corresponding subtests of other test batteries. Evidence of strong positive relationships between measures purportedly of the same construct is supportive of construct validity claims for all related measurement instruments. Thus the claim that the subtests of the GATB measure the aptitudes attributed to them (e.g., intelligence, verbal aptitude, spatial aptitude) would be enhanced by data of this sort and weakened if small to moderate correlations between corresponding subtests were to be found. (A de- tailed discussion of convergent validity findings can be found in Appendix A.) Chapter 14 of Section III of the GATB Manual (U.S. Department of Labor, 1970), entitled "Correlations with Other Tests," is a primary source of convergent validity evidence. That chapter contains correlation matrices resulting from studies of the GATB and a variety of other aptitude tests and vocational interest measures. Results for 64 studies are reported. Since the publication of the GATB Manual, correlations between various GATB aptitudes or subtests and corresponding subtests

CHARACTER AND PSYCHOMETRIC PROPERTIES 89 TABLE 4-1 Summary Statistics for Distributions of Convergent Validity Coefficients for the Cognitive GATB Aptitudes (G. V, and N), the Perceptual GATB Aptitudes (S. P. and Qj, and the Psychomotor Aptitudes (K, F. and M) Number First Third Aptitude of Studies Minimum Quartile Median Quartile Maximum G 51 .45 .67 .75 .79 .89 V 59 .22 .69 .72 .78 .85 N 53 .43 .61 .68 .75 .85 S 19 .30 .58 .62 .70 .73 P 8 .38 .44 .47 .57 .65 Q 16 .24 .38 .50 .60 .76 K 1 .58 .58 .58 .58 .58 F 2 .37 .37 .39 .41 .41 M 1 .50 .50 .50 .50 .50 of other test batteries have been provided in studies by Briscoe et al. (1981), Cassel and Reier (1971), Cooley (1965), Dong et al. (1986), Hakstian and Bennett (1978), Howe (1975), Kettner (1976), Kish (1970), Knapp et al. (1977), Moore and Davies (1984), O'Malley and Bachman (1976), and Sakolosky (1970~. The sizes and compositions of examinee samples used in these studies are diverse, as are the aptitude batteries with which GATB subtests and aptitudes were correlated. They range from 40 ninth-grade students who completed both the GATB and the Differential Aptitude Test Battery (DAT), to 1,355 Australian army enlisters who completed the GATB and the Australian Army General Classification Test. However, in 8 of 13 studies (many of which consid- ered several independent samples of examiners), the samples consisted of high school students. Distributions of convergent validity coefficients for the GATB cognitive and perceptual aptitudes are summarized in Table 4-1 and, for ease of visual comparison, are depicted in Figure 4-2. As can be seen, the distributions for the cognitive aptitudes of the GATB (G. V, and N) provide moderately strong support for claims that these aptitudes are appropriately named and measured, with median coefficients of .75, .72, and .68, respectively. The results are based on more than 50 studies of each aptitude. Corresponding results for the perceptual aptitudes of the GATB (S. P. and Q) are somewhat less convincing. Data for the psychomotor aptitudes are so meager (because the GATB is one of very few tests that attempts to measure them) that judgment on their conver- gent validity must be withheld. Although the median convergent validity coefficient observed for the

90 ANALYSIS OF THE GENERALAPTITUDE TEST BATTERY 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 General Aptitude L: Verbal Aptitude EN Numerical Aptitude I ~ _ Spatial Aptitude S n = 19 p ~ Form Perception . Q Clerical Perception 1 1 1 0.0 0.1 0.2 LEGEND: , Minimum _ 1 1 1 1 1 1 1 1 0.3 0.4 0.5 _ _ ~ ~ First '. Third Quartile IVled~an Quartile 0.6 0.7 0.8 0.9 1.0 \ Maximum FIGURE 4-2 Distributions of convergent validity coefficients for GATE cogni- tive aptitudes (G. V, and N) and GATE perceptual aptitudes (S. P. and Q). The number of studies (n) on which the results are based are indicated for each aptitude. spatial aptitude (S) was respectably large, the corresponding median values for the form perception (P) and clerical perception (Q) aptitudes were smaller than would be desired. The three-dimensional-space subtest is said to measure both intelligence and spatial aptitude and might therefore require greater reasoning ability and inferential skill than is typical of measures of spatial aptitude found in other batteries. The name

CHARACTER AND PSYCHOMETRIC PROPERTIES 9) comparison subtest of the GATB appears to tap only a subset of the skills typically associated with clerical perception. COMPARISON WITH THE ASVAB AND OTHER TEST BATTERIES The GATB is one of a number of test batteries used in this country for vocational counseling or employee selection and classification. In order to gauge the relative quality of the GATB, the committee reviewed four of the more widely used of these tests: the Armed Services Vocational Aptitude Battery (ASVAB), the Differential Aptitude Test, the Employee Aptitude Survey, and the Wonderlic Personnel Test. For purposes of this report, we limit our discussion largely to the ASVAB testing program, since it provides the closest parallel to the way the VG-GATB would function and might reasonably be considered an appropriate model should the Employment Service proceed with test-based referral as a major component of its employment program. The ASVAB is the cognitive abilities test battery used to select and classify applicants for military service in the enlisted ranks. It is admin- istered annually to approximately 1 million applicants for military service, as well as to an equal number of students in the tenth through twelfth grades and postsecondary students. (The latter administrations provide Service recruiters with the names of prospects and provide the schools with a vocational aptitude test battery for their students at no cost.) The ASVAB is the most recent in a series of tests, beginning with the Army General Classification Test of the World War II era, used for initial screening of potential entrants into military service, for purposes of classification and assignment, or for both. Introduced in the late 1960s for use in the DOD Student Testing Program, the ASVAB was officially adopted in 1976 as the DOD enlistment screening and classification battery. In the 13 years of its operational use, new forms of the ASVAB have been introduced at about four-year intervals. ASVAB Forms 5, 6, and 7 made up the first operational test battery; Form 5 was designated for use in the student testing program and the latter two in the Enlistment Testing Program. For enlistment processing, Forms 6 and 7 were replaced by Forms 8, 9, and 10 in 1980; by Forms 11, 12, and 13 in 1984; and by Forms 15, 16, and 17 in 1989. (In 1984, Form 14 replaced Form 5 as the current form for school administrations). The three forms introduced in 1980 included certain significant changes in the test battery, including the deletion of the spatial abilities subtest. The 1984 and 1989 batteries were developed to be parallel to their predecessor. Among the reasons for this cycle of new forms is the need to maintain the integrity of the test battery in the all-volunteer environment. The pressures on military recruiters to

92 ANALYSIS OF THE GENERA APTITUDE TEST BAKERY meet enlistment quotas must be balanced by close attention to test security. ASVAB Test Parts The ASVAB includes 10 separately timed subtests and takes about three hours to administer. There are eight power subtests (tests for which speed of work has no influence on an examinee's score) and two speeded subtests. The test parts are: 1. General science (GS); 2. Arithmetic reasoning (AR); 3. Word knowledge (WK); 4. Paragraph comprehension (PC); 5. Numerical operations (NO) (speeded); 6. Coding speed (CS) (speeded); 7. Auto and shop information (AS); 8. Mathematical knowledge (MK); 9. Mechanical comprehension (MC); and 10. Electronics information (EI). Four of the subtests AR, WK, PC, and MK-make up the Armed Forces Qualification Test (AFQT). The AFQT, which is considered a general measure of trainability, is used to determine eligibility for enlistment. In addition, each Service has developed its own set of aptitude composites from the ASVAB subtests, which are used to qualify applicants for various career fields. For example, the Army uses a selector composite termed "combat" which includes the ASVAB Subtests AR + CS + AS + MC. Speededness of the ASVAB The eight power subtests of the ASVAB appear not to be speeded. This is documented in the ASVAB Technical Supplement (U.S. Depart- ment of Defense, 1984b), which presents a study showing the propor- tions of eleventh- and twelfth-grade students omitting the last item for each of the eight ASVAB power subtests. Higher omit rates were generally shown by the younger students and for the arithmetic reasoning and word knowledge subtests. However, none of these omit rates was particularly high. On average, about 7 percent of twelfth- grade students omitted the last item of the eight subtests. This evidence permits the assertion that the ASVAB subtests so labeled are indeed predominantly power tests.

CHARACTER AND PSYCHOMETRIC PROPERTIES 93 ASVAB Normative Data Until 1980, the aptitude levels of military recruits were established with reference to a normative base representing all males serving in the Armed Services during 1944 (Uhlaner and Bolanovich, 19521. In 1980, the Department of Defense, in cooperation with the Department of Labor, undertook a study called Profile of American Youth to assess the vocational aptitudes of a nationally representative sample of youth and to develop current norms for the ASVAB. Subsequent forms of the ASVAB have been calibrated to this 1980 Youth Population, making it the only vocational aptitude battery with nationally representative norms i The 1980 Youth Population norms were based on a sample of 9,173 people between the ages of 18 and 23 who were part of the nationally representative National Longitudinal Survey of Youth Labor Force Behavior. The sample included 4,550 men and 4,623 women and con- tained youth from rural as well as urban areas and from all major census regions. Certain groups blacks, Hispanics, and economically disadvan- taged whites-were oversampled to allow more precise analysis than would otherwise be possible (U.S. Department of Defense, 1982~. ASVAB Reliabilities Reliability data are available for the form of the ASVAB administered to high school students both for the individual subtests and for the aptitude composites. The reliability estimates reported are alternate-form reliability coefficients. This approach combines the measure of temporal stability previously presented for the GATE with the administration of two forms of the same test so that the risk of distortion due to memory effects can be avoided. The alternate-form reliabilities for subtests from ASVAB Forms 8, 9, and 10 range from .57 to .90 with a median of .79 (U.S. Department of Defense, 1984b). As would be expected, the reliabilities for the aptitude composites are higher; the academic composites ranged from .88 to .94, and the mechanical and crafts composites ranged from .84 to .95 (U.S. Department of Defense, 1984a). In comparison, the alternate-form reliabilities for the GATB cognitive aptitudes are close to .90 and for the perceptual aptitudes are in the low .80s (U.S. Department of Labor, 1986~. ASVAB Validities The ASVAB Test Manual (U.S. Department of Defense, 1984c) pre- sents tables of validity coefficients for military training, separately by

94 ANALYSIS OF THE GENERA ETUDE TEST BAKERY eight career fields. In all, 11 validity coefficients were provided by the Army, 47 by the Navy, 50 by the Marines, and 70 by the Air Force. Those Services reporting validities that were corrected for restriction of range computed the corrections using the 1980 Profile of American Youth Population. Validities were reported for both the AFQT and the aptitude or selector composite used to place recruits in each of the eight career fields. There are difficulties in trying to interpret these data. The training criterion is problematic when self-paced instruction is used or when courses are graded pass/fai} rather than along a numerical continuum. In addition, training criteria are dependent on the detail of records main- tained by the particular training school, which differs by occupational specialty and by Service. There are also difficulties in trying to summarize the data, largely because of differences in what each Service reported. Both the Army and the Navy reported uncorrected and corrected validities for the AFQT and the selector composites. The Air Force reported only selector validities, uncorrected, whereas the Marine Corps reported only corrected validi- ties, but for both AFQT and the selector composites. Nevertheless, enough data are presented to make an estimate of ASVAB validities. As is true of the GATB, there is a broad range of observed validities; there are examples of marginal predictive power and a few cases of dramatically high prediction the Navy selector composite for cryptologic technician produces uncorrected validities of .60. Over all combinations, we estimate the weighted mean validity of the AFQT for training to be .33 (uncorrected) and for the selector composites to be .37 (uncorrected). These correlations are at the same general level of predictive efficiency as the mean validities we estimate for the GATB against a training criterion and, as might be expected, somewhat higher than the validities for a performance criterion (supervisor ratings) (see Chapter 81. One trend in the military data that is pertinent in the context of this study of the GATB and validity generalization is that there is a tendency for the more job-specific selector composites to produce slightly higher validities than the AFQT. Of the studies that reported both AFQT and selector validities, the mean uncorrected selector validities were higher than the AFQT validities in 11 comparisons, were equal in 3 comparisons, and were lower in 5. This pattern in the relative validities of selector composites and the AFQT is confirmed in more extensive reports of the Service data. Wilbourn and colleagues' report (1984) on the relationships of ASVAB Forms 8, 9, and 10 to Air Force technical school grades shows compar- atively high mean validities for both AFQT and selector composites, with

CHARACTER AID PSYCHOMETRIC PROPERTIES 95 TABLE 4-2 Uncorrected Weighted Validities for Training by Selector Composite and AFQT for Four Air Force Career Fields Selector Composite Composite Components N n rcomp rAFQT n Mechanical GS, AS, MC 19 9,185 .43 .39 483 Administrative WK, PC, NO, CS 7 3,170 .21 .43 453 General WK, AR, PC 16 9,183 .43 .41 574 Electronic GS, AK, MK, EI 26 6,166 .48 .35 237 NOTE: N = number of studies; n = number of examiners; n = average number of examiners; romp = uncorrected weighted validity of selector composite for training; rAFQT = uncorrected weighted validity of the Armed Forces Qualification Test for training; GS = general science; AR = arithmetic reasoning; WK = word knowledge; PC = paragraph comprehension; NO = numerical operations (speeded); CS = coding speed (speeded); AS = auto and shop information; MK = mathematical knowledge; MC = mechanical compre- hension; EI = electronics information. SOURCE: Wilboun~, James M., Lonnie D. Valentine, Jr., and Malcolm J. Reel 1984. Relationships of the Armed Services Vocational Aptitude Battery (ASVAB) Forms 8, 9, and 10 to Air Force Technical School Final Grades. AFHRL Technical Paper 84-08. Working paper. Manpower and Personnel Division, Brooks Air Force Base, Texas. the selector composites producing slightly higher coefficients in three of the four aptitude areas (Table 4-2~. Army data are reported in McLaughlin et al. (1984~. The Army reports validities for training and for Skill Qualification Tests (job knowledge tests), based on 92 school classes and 112 groups of test takers, each group with 100 cases or more. The uncorrected validities of the selector composite and the general composite (equivalent to the AFQT) in nine occupational areas are shown in Table 4-3. In six occupational areas, the selector composite validity was higher, in two areas the mean weighted AFQT validity was higher, and in the remaining occupational area, the values were the same. There is some indication that the speeded nature of certain ASVAB subtests is what causes the break in the pattern of relative validities. According to McLaughlin et al. (1984), the reason lies in the lower validities of the two ASVAB speeded subtests (numerical operations, coding speed) compared with the higher validity of the two quantitative subtests (arithmetic reasoning, mathematics knowledge). As Tables 4-2 and 4-3 show, for both the Air Force and the Army, the administrative or clerical composite includes both speeded subtests but no test of mathe- matics. The AFQT, which then included a half-weighted numerical operations subtest plus a full weighted arithmetic reasoning subtest, has higher validity than composites with more of the speed factor and less of the quantitative factor.

96 ANALYSIS OF THE GENERAL APTITUDE TEST BATTERY TABLE 4-3 Uncorrected Weighted Validities for Training and SQT by Selector Composite and General Composite for Nine Army Career Fields Selector Composite Composite Components N n rComp rGeneral n Clerical CS, NO, WK, PC 16 10,368 .27 .39648 Combat CS, AR, MC, AS a 14,266 .33 .311,783 Electronics AR, EI, GS, MK 10 5,533 .29 .26553 Field artillery GS, AR, MC, MK 2 5,602 .36 .342,801 General maintenance GS, AS, MK, EI 14 2,571 .26 .23184 Mechanical maintenance NO, EI, MC, AS 18 7,073 .30 .27393 Operators/food NO, WK, PL, MC, AS 11 8,704 .30 .30791 Surveillance/communications NO, CS, WK, PC, AS 5 3,729 .26 .34746 Skilled technical WK, PC, MK, MC, GS 14 7,061 .33 .32504 NOTE: Genera = Uncorrected weighted validity of the General Composite for Training and Skill Qualifying Test. See note in Table 4-2 for identification of other components. SOURCE: Based on McLaughlin, Donald H., Paul G. Rossmeissl, Lauress L. Wise, David A. Brandt, and Ming-mei Wang. 1984. Validation of Current and Alternative Armed Services Vocational Aptitude Battery (ASVAB) Area Composites: Based on Training and Skill Qualification Test (SQT) Information on Fiscal Year 1981 and 1982 Enlisted Acces- sions. Technical Report 651. Alexandria, Va.: U.S. Army Research Institute for the Behavioral and Social Sciences. The reason the Air Force validities are higher than those for the Army is not clear. McLaughlin et al. (1984) suggested that the Army's adoption in the 1970s of criterion-referenced assessment for technical training courses (i.e., pass/fail), and the simultaneous conversion of many courses into a self-paced mode, led to a large reduction in the psychometric quality of available training measures for validation purposes. However, despite any difference in overall validities for these two Services, the appropriate selector composite is a slightly but generally better predictor than the general composite, or AFQT. CONCLUSIONS GATB Properties 1. In terms of the stability of scores over time and stability between parallel forms of the test, the GATB exhibits acceptable reliabilities. The reliabilities of the cognitive aptitudes are particularly high and compare well with those of other tests used for selection and classification. The stability coefficients of the perceptual aptitudes are somewhat smaller, but well within the acceptable range. The reliabilities of the individual psychomotor subtests are low, although not so low for the psychomotor composite as to preclude its use.

CHARACTER AND PSYCHOMETRIC PROPERTIES 97 2. Our review of a very large number of convergent validity studies provides moderately strong support for claims that the subtests of the GATB measure the cognitive constructs they purport to measure. The evidence for the perceptual aptitudes is mixed; the spatial aptitude test bears a respectably large relationship to similarly named subtests in other batteries, but evidence for the form perception and clerical perception subtests is less convincing. Since most aptitude test batteries do not have equivalent psychomotor subtests, this type of analysis is not useful in trying to establish that the K, F. and M subtests are appropriate measures of a psychomotor construct. 3. Only four operational forms of the GATB have been introduced in its 42-year history: Forms A and B were introduced in 1947 and were replaced by Forms C and D in 1983. So long as the GATB was used primarily as a counseling tool, this lack of new forms was probably no serious problem. If, however, the VG-GATB Referral System becomes a regular part of Em- ployment Service operations and the GATB takes on an important gatekeep- ing function, then the frequent production of new forms, similar to the program for developing new forms of the Armed Services Vocational Aptitude Battery, will be essential to maintain the integrity of the GATB. 4. The scoring system for the VG-GATB seems unduly complex. It involves so many conversions, the exact nature of which is not fully documented, that the link between raw scores and the final within-group percentile scores is clouded. 5. The norms for the GATB are problematic. The General Working Population Sample, developed in the early 1950s to be representative of the work force as it appeared in the 1940 census, is at this point a very dated reference population. There have been enormous structural changes in the economy and the work force in the intervening years. The more recent norms, developed for the computation of within-group percentile scores, are based on convenience samples that can claim neither to be nationally representative nor scientifically drawn from populations of those who would be applicants for homogeneous clusters of jobs. 6. Our review of the available evidence regarding test equating indi- cates that Subtests 1 through 8 of GATB Forms A, C, and D (evidence is lacking on Form B) are sufficiently related to one another that the scores can be considered interchangeable after equating. The scores from the alternate forms of the psychomotor subtests (Subtests 9 through 12), however, should not be considered interchangeable. Comparison with Other Test Batteries 7. On two dimensions of central importance predictive validity for training criteria and test reliability the GATB compares quite well with

98 ANALYSIS OF THE GENE~L APTITUDE TEST BAKERY the other test batteries we reviewed. For example, the mean uncorrected validities of the Armed Forces Qualifying Test for a training criterion we estimate to be about .33 across all Services (although some Services report substantially higher validities); for the GATB, the corresponding figure for predicting training criteria would be about .35 overall and .30 for studies since 1972. With the exception of one subtest (arithmetic reason- ing), GATB reliabilities are also about the same as those of other test batteries. S. However, if the GATB is to take on a much more important role in Employment Service operations if, in other words, it takes on a major gatekeeping function like that exercised by the ASVAB then it will need to be supported by a similar program of research and documentation. The areas in which the GATB program does not compare well with the best of the other batteries test security, the production of new forms, equating procedures, the strength of its normative data, the integrity of its power tests will take on heightened significance. RECOMMENDATIONS 1. If the VG-GATB Referral System becomes a regular part of Em- ployment Service operations, we recommend a research and development program that allows the introduction of new forms of the GATB at frequent intervals. The Department of Defense program of form develop- ment and equating research in support of the ASVAB provides an appropriate model. 2. Test equating will become far more important should the GATB become a central part of the Employment Service job referral system, because such use will necessitate the regular production of new forms of the test. The committee recommends both better documentation of equating procedures and special attention to creating psychometrically parallel forms of the apparatus-based subtests. 3. The USES long-term research agenda should include consideration of a simplified scoring system for the GATB. 4. The USES long-term research agenda should give attention to strengthening the normative basis of GATB scores. The General Working Population Sample should be updated to represent today s jobs and workers. In addition, more appropriate samples need to be drawn to support any score adjustment mechanisms adopted. 5. More reliable measurement of the psychomotor aptitudes deserves a place on the GATB research agenda.

Next: 5 Problematic Features of the GATB: Test Administration, Speedness, and Coachability »

Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery (1989)

Chapter: 4 The GATB: Its Character and Psychometric Properties

Welcome to OpenBook!

Get Email Updates