Read "Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery" at NAP.edu

« Previous: 4 The GATB: Its Character and Psychometric Properties

Page 99 Cite

Suggested Citation:"5 Problematic Features of the GATB: Test Administration, Speedness, and Coachability." National Research Council. 1989. Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery. Washington, DC: The National Academies Press. doi: 10.17226/1338.

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Problematic Features of the GATB: Test Administration, Speededness, and Coachability In this chapter we examine a number of characteristics of the GATB and the way it is administered that need immediate attention if the test is transformed from a counseling tool into the centerpiece of the U.S. Employment Service (USES) referral system. The difficulties we see range from easily cured problems with the current test administration procedures to some fundamental design features that must be revised if the General Aptitude Test Battery is to take on the ambitious role envisioned in the VG-GATB Referral System. TEST ADMINISTRATION PRACTICES Several features of USES-prescribed test administration procedures and the use of the National Computer Systems (NCS) answer sheet appear to be potential threats to the construct validity of the test. If these features affect members of various racial or ethnic groups to differing degrees, they could also be sources of test bias. Each of these issues warrants further investigation. Instructions to Examinees The GATB test booklet for each pencil-and-paper subtest instructs examinees to "work as quickly as you can without making mistakes." This instruction implies that examinees will be penalized for making errors when the subtests are scored. In fact, number-right scoring is used 99

i00 ANALYSIS OF THE GENERAL ETUDE TEST BAKERY for all pencil-and-paper GATB subtests, with no penalties for incorrect guessing or other sources of incorrect answers. When asked how test administrators responded to questions concern- ing the type of scoring used with the GATB, the committee was told by USES representatives that honest answers were given. Thus, test-wise examiners who ask about scoring rules have an advantage that is not shared by examinees who do not raise this question. Use of an instruction that misleads examiners about the scoring procedures employed is inconsistent with the Standards for Educational and Psychological Test- ing (American Educational Research Association et al., 1985~. It unnec- essar'ly adds a source of error variance to observed test scores that will reduce measurement reliability. In addition, to the extent that test-wise examiners are differentially distributed across racial and ethnic groups, the inconsistency between test instructions and scoring procedures is a source of test bias that could be readily eliminated. Our review of the GATB Manual (U.S. Department of Labor, 1970) and the contents of the GATB subtests has raised additional concerns about the vulnerability of the test battery to guessing. Consider Subtest 1, name comparison, a speeded test of clerical perception. Examinees are given 6 minutes to indicate whether the two names in each of 150 pairs of names are exactly the same or different. The GATB Manual indicates that the General Working Population Sample of 4,000 examiners was admin- istered Form A with an IBM answer sheet. The mean score for name comparison was just under 47 items correct with a standard deviation of 17, meaning that it is a highly speeded test. Let us hypothesize with the available statistics for Form A and an IBM answer sheet. If all scores were normally distributed, then scores at the 95th percentile for name comparison would be 75 items correct. On the basis of these statistics and assumptions, the optimal strategy for an examinee completing the name comparison subtest has two phases. The first would be to randomly mark one of the two bubbles for each of the 150 items as rapidly as possible, without reading the items in order to consider the stimulus names. Assuming an examinee could fill in 150 bubbles within 6 minutes, the second phase of the optimal strategy would then be to begin again with the first item, determine the correct answer, and change the answer already marked if necessary; the examinee would continue working through the subtest in this way until time was called. On one form of the GATB, the actual proportion of items with a correct answer of "exactly the same" was 0.493 (74 of 150 items). Since for half the items on the subtest the correct answer was "exactly the same," an expected score of 75 items correct would result from marking all answers the same way. This "chance" score is higher than the 98th percentile of the GATB General Working Population Sample on the name comparison

TEST ADMINISTRATION9 SPEEDEDNESS, AND COACHABIH~ 101 TABLE 5-1 Worksheet on Chance Scores and Coaching for Power Subtests (3) (4) (1) (2) Remaining Item Total Power Items Op Items Itemsa (1-2) tions (7) (5) (6) Stan Chance Average card (8) Score Score on Devi- Effect Size (1-2)/(4) the Test ation (5)/(7) Subtest 2 (computa- 50 18 32 tion) Subtest 3 (three dimen sional space) Subtest 4 (vocabu- 60 18 42 lary) Subtest 6 40 17 23 (arithmetic reasoning) 25 9 6.4 20 4.8 1.33 5.75 15.4 6 0.96 21 8.3 0.84 16 5 3.2 9.4 2.9 1.10 aNinety percent of majority examinees would complete this many. subtest. Scores could be improved further if the test taker were aware that short runs (3 to 4 items) on the name comparison subtest were identically scored (either "exactly the same" or "different". In any case, this modified random marking strategy would yield a very high score simply because the subtest is very long and highly speeded. Our analysis of individual item functioning demonstrates the potential effects of guessing in increasing GATB subtest scores. Table 5-1 presents a worksheet showing the score increase that could be expected for each of the would-be power tests, i.e., those where speed of work does not seem to be a defensible part of the construct (Subtests 2, 3, 4, and 61. The total number of items for each of the subtests can be compared with the number of items that would be included if the test were actually constructed as a power test. The power test limits were set such that 90 percent of the majority group would complete the test. Column 5 shows the typical chance score (added to one's regular score) that could be earned by randomly marking the remaining items. The gain due to chance is also shown as an effect size in standard deviation units (column 8~. The effects are large, roughly 1 standard deviation. Thus, assuming a normal distribution, a person scoring at the 50th percentile could increase his or her score to the 84th percentile by guessing on the unfinished portion of the test.

}02 ANALYSIS OF THE GENERA ETUDE TEST BAKERY It is possible that the current test could be improved by using a penalty for guessing on the straight speed tests and a correction for guessing on the would-be power tests. As a matter of professional ethics it is essential that the examinees be informed of whatever scoring procedure is to be used and told clearly what test-taking strategies it is in their interests to use. The above analysis documents how vulnerable the current test is to attempts to beat the system. It is not clear what combination of shortened test and change in directions would be best to be fair to aD examinees and to ensure the construct validity of each subtest. It would take both conceptual analysis and empirical work to arrive at the best solution. In considering alternatives, one would also have to ask how much the test could be changed without destroying the relevance of existing validity studies. The National Computer Systems Answer Sheet When USES first adapted the GATB to a separate, optically scanned answer sheet (the IBM 805 sheet), the test developers noted that "an attempt was made to devise answer sheets which would result in maximum clarity for the examinees and would facilitate the administra- tion of the tests" (U.S. Department of Labor, 1970:2~. Unfortunately, this objective is far less evident in the design of the currently used NCS answer sheet. The NCS answer sheet is in the form of a folded 12-inch by 17-inch, two-sided sheet that contains an area for examinee identification, a section for basic demographic information on the examinee, and a section for listing the form of the GATB that the examinee is attempting. In addition, the sheet has separate areas for recording answers to seven of the eight GATB pencil-and-paper subtests. Several features of the NCS answer sheet call on the test-wiseness of exam~nees. The bubbles on the NCS answer sheet are very large, and examinees are told to completely darken the bubbles that correspond to their answers to each question. Following this instruction precisely is a time- consuming task that is most likely to be interpreted literally by examiners with the least experience in using optically scannable test answer sheets. Since all of the GATB subtests are speeded (as described above and discussed below), this deficiency will affect the test scores of examinees who follow the instruction most closely. For some subtests, such as the name comparison test, the design of the NCS answer sheet might add a significant psychomotor component to the abilities required to perform well. THE INFLUENCE OF SPEED OF WORK Due in large part to the early work and influence of Charles Spearman (Hart and Spearman, 1914; Spearman, 1927:chap. 14), pioneers in the

TEST ADMINISTRATION, SPEEDEDNESS, AND COAcHABI~ry ~ 03 field of educational and psychological testing theorized that measures of speed of work and measures of quality of work were interchangeable indicators of a common construct. It was not until World War II, close to the time that the GATB was under development, that researchers such as Baxter (1941) and Davidson and Carroll (1945) reported the results of factor analytic studies showing different structures for the same tests administered under time-constrained and unlimited-time conditions. The distinctiveness of speed of work and accuracy of work has since been corroborated by Boag and Neild (1962), Daly and Stahmann (1968), Flaugher and Pike (1970), Kendall (1964), Mollenkopf (1960), Terranova (1972), and Wesman (1960), among others. A test for which speed of work has no influence on an examinee s score (i.e., a test in which every examinee is given all the time needed to attempt every test item) is called a pure power test. According to Gulliksen (19SOa:230) a pure speed test is one that is so easy that no examinee makes an error and one so long that no examinee finishes the test in the time allowed. Commonly used aptitude tests rarely, if ever, fit the definition of a pure power test or a pure speed test. Many such tests, including the subtests of the GATB, combine elements of speed of work and quality of work to a largely unknown degree. However, scores on the GATB appear to depend on speed of work to a far greater extent than is true of more modern aptitude batteries. All of the GATB subtests, whether intended to be tests of speed of work or power tests, have time limits that are extremely short. It is therefore likely that most examinees scores on these subtests are influenced substantially by the speed at which they work. The subtests were initially designed to insure that very few, if any, examinees would complete each test . . . . The speed requirements of the tests have been increased since their initial design through the use of separate answer sheets and, more recently, through use of the NCS answer sheet. The NCS answer sheet imposes sufficient additional burden on examinees that the 1970 Manual contains a table of positive scoring adjustments to accommodate its use (see U.S. Department of Labor, 1970:43, Table 7-71. Figures 5-1 and 5-2 illustrate the speeded nature of the GATB subtests. Subtest 5, tool matching, shown in Figure 5-1, was selected as an example of a speeded test, for which the ability to work quickly is logically a part of the intended construct. In contrast, Subtest 6, arith- metic reasoning, represents a construct that might be more accurately measured in an untimed or power test situation. (A power test is defined operationally as one where 90 percent of examinees have sufficient time to complete all of the test items.) The data were obtained for 7,418 white applicants, 6,827 black applicants, and 1,466 Hispanic applicants from

|04 ANALYSIS OF THE GENE^[ APTITUDE TEST BA"ERY 100 90 80 An LIJ LLI z 60 LL 11 o CD 6 z LD llJ 70 50 40 30 20 10 o 1 10 20 30 NUMBER OF ITEMS ~_~ o l l l on - ~ ox \ · . ~ \ ° ~ \ \ \ o \ a\ \ ~ .~d o\ \ o\ · \\ a\ ~ \ .\o\ \`o\ .. ~ o\ · ~ White Attempted Black Attempted O White Correct · Black Correct ·~ I 40 49 FIGURE 5-1 Percentages attempting and number of items correct for whites and blacks on Subtest 5, tool matching (speeded). two test centers in 1988. The percentage of test takers attempting each item and getting each item right is plotted. The steeply declining curves, drawn for whites and blacks only, demonstrate the speeded nature of the tests. For example in Figure 5-1, nearly 100 percent of both groups attempted the first 16 questions; then there is a sharp decrease in the number of examinees reaching each subsequent question such that by the midpoint of the test only 66 percent of whites and 53 percent of blacks are still taking the test. In pure speed tests the content of test questions is relatively easy, making it only a matter of how fast one works whether an item will be correct or incorrect. As would be expected in such a test, the percentage-correct curves in Figure 5-1 closely parallel the percentage-attempted curves, with some unaccounted-for difficulty at items 9 and 21. Figure 5-2 also shows a strong overriding influence of speed. To satisfy the definition of a power test for the white group, the test would end at item 8. By the midpoint of the test, only 50 percent of whites and 27

TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABILI7Y 105 percent of blacks are still taking the test. Although items 6 and 8 are relatively difficult even for examinees who reach them, the percentage correct on the majority of items follows the pattern delimited by the speeded nature of the test. The use of speeded subtests to measure constructs that do not include speed as an attribute is a potentially serious construct validity issue. First, the meaning of the constructs measured is likely to be different from the conventional meaning attached to those constructs. For example, do two tests that require correct interpretation of arithmetic problems stated in words and correct application of basic arithmetic operations to the solution of those problems measure the same aptitude, if one is highly speeded and the other is not? The research cited above suggests that the two tests would measure different constructs. Second, if the speed component of the tests does not assess the abilities of members of different racial or ethnic groups in the same way, the tests might be differentially valid for members of these groups. Helmstadter and Ortmeyer (1953:280) noted: 100 90 c,) 80 111 he o So LLJ '( 40 of () So 111 70 60 · o o o . '\\~ 6' \ ~ \ . o ~\ . o O\ \\\\ \ \ · \ \ \ \ 20 _ 10 _ O I 1 1 ~I I 1 1 1 1 . ~ · ° At\ White Attempted Black Attempted O White Correct · Black Correct I I ~ I t ~ 1 2 3 4 5 6 7 8 910 111213141516171819202122232425 NUMBER OF ITEMS FIGURE 5-2 Percentages attempting and number of items correct for whites and blacks on Subtest 6, arithmetic reasoning (power test).

|06 ANALYSIS OF THE GENE~L~TITUDE TEST BAKERY Although any test may rationally be considered as largely speed or largely power, the relative importance of these two components is not independent of the group being measured, and a test which samples depth of ability for one group may be measuring only a speed component for a second .... As an example of the way this problem might be evidenced for the GATB, Subtest 7, form matching, requires examinees to pair elements of two large sets of variously shaped two-dimensional line drawings. A total of 60 items is to be completed in 6 minutes. Within this time, examinees must not only find pairs of line drawings that are identical in size and shape, but must then find and darken the correct answer bubble on the NCS answer sheet from a set of 10 answer bubbles with labels consisting of single or double capitalized letters (e.g., GUI). The labeling of physically corresponding answer bubbles differs from one item to the next. Since the subtest is tightly timed, identification of the correct answer bubble from the relatively long list presented on the answer sheet might become a significant component of the skill assessed. One could, by inspection, confidently advance the argument that the subtest measures not only form perception, but also the speed of list processing and skill in decoding complex answer sheet formats. The latter skill is dependent on previous experience with tests. Since the extensiveness of such experience will differ for members of different racial or ethnic groups, the subtest might be differentially valid as a measure of form perception for white and black examinees. Third, the severe time limits of the GATE subtests might produce an adverse psychological reaction in examinees as they progress through the examination and might thereby reduce the construct validity of the subtests. Having attempted a relatively small proportion of items on each subtest, examinees might well become progressively discouraged and thus progressively less able to exhibit their best performance. With the use of separate, optically scanned answer sheets, the most vulnerable examinees are those least experienced with standardized tests, a group in which minority examinees will be overrepresented. These arguments on the racial or socioeconomic correlates of the effects of test speededness are admittedly speculative. Dubin and col- leagues (1969) found few such correlates in a study with test-experienced high school students. However, they cited research by Boger (19521; Eagleson (1937), Katzenmeyer (1962), Klineberg (1928), and Vane and Kessler (1964) that indicated positive effects of extra practice and test familiarity in reducing test performance differences between blacks and whites.

TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABlH ~)07 ITEM-BL\S ANALYSES Statistical procedures, referred to as item-bias indices, are used to evaluate whether items within a test are differentially more difficult for members of a particular subgroup taking the test. Two caveats govern the interpretation of item-bias statistics. First, these indices are measures of internal bias. Bias is defined as differential validity whereby individuals of equal ability but from different groups have different success rates on test items. To establish that individuals have equal ability, the various item-bias methods rely on total test score (or some transforma- tion of total score). Thus internal bias statistics are circular to some extent and cannot detect systematic bias. Systematic or pervasive bias could only be detected using an external criterion, as is done in predictive validity studies. What internal bias procedures are able to reveal are individual test questions that measure differently for one group compared with another. They provide information akin to factor analysis but at the item level. A large bias index signals that an item is relatively more difficult for one group. The second caveat has to do with the meaning of bias as signaled by these statistics. The analytic procedures were designed to detect irrelevant diffi- culty, that is, some aspect of test questions that would prevent examinees who know the concept from demonstrating that they know it. An example of irrelevant difficulty would be a high level of reading skill required on a math test, thus obscuring perhaps the true level of mathematics achievement for one group compared with another. However, the statistics actually work by measuring multidimensionality in a test. For example, if physics and chem- istry questions were combined into one science test, one subset of questions would probably produce many bias flags unless group differences in both subject areas were uniform. Thus many authors of item-bias procedures have cautioned that significant results are not automatically an indication of bias against a particular group. In fact, the statistical indices are often called measures of differential item functioning to prevent misinterpretation of the results. If each of the dimensions of the test is defensible and appropriate for the intended measurement, then the so-called bias indices have merely revealed differences in group performance. In order to explore at least partially how the GATE functions and whether it functions differently for different racial or ethnic groups, the committee undertook an analysis of actual answer sheets for a sample of Employment Service applicants. Standard statistical procedures were used to examine characteristics of GATE items within each subtest. These analyses were conducted separately for 6,827 black and 7,418 white test takers from a Michigan test center and for 873 whites and 1,466 Hispanics from a Texas test center. The proportion answering each item correctly, the proportion attempting each item, and point-biserial correlations were calculated. The

|08 ANALYSIS OF THE GENERA ETUDE TEST BAKERY proportion attempted can index test speed whereas the proportion correct can index item and test difficulty. Point-biserial correlations show the degree of relationship between performance on an individual item and total score on the subtest, reflecting both speed and difficulty. Proportion Attempted Inspection of the proportion-attempted statistics shows the same pat- tern in all seven of the GATE paper-and-pencil subtests. Figures 5-1 and 5-2 give proportion attempted and proportion correct for tool matching and arithmetic reasoning, respectively. Virtually 100 percent of examin- ees attempt the first item and fewer than 1 percent finish each subtest. Subtests 1 (name comparison), 5 (tool matchings, and 7 (form matching) are speeded tests; it is therefore not surprising that many examinees are unable to complete these tests. However, the number of items is far greater than is usual even for speeded tests. For example, Subtest 1 has 150 items, yet by item 75, only 9 percent of the Texas whites are still taking the test. Even smaller percentages of the other groups can be found at later items. Subtest 7 is 60 items long, but only 1 percent of the whites in the Texas sample make it to item 42. The effect of unrealistic time limits is also apparent on the tests intended to be unspeeded. Power tests, for which examinees have sufficient time to show what they know, are ordinarily defined by a 90 percent completion rate. Subtest 2 (computation), comprised of 50 items, should be complete by item 17 to be a power test for the sample of Michigan whites. Subtest 3 (three-dimensional space) would have to finish with item 17 instead of 40, Subtest 4 (vocabulary) with item 16 instead of 60, and Subtest 6 (arithmetic reasoning) with item 9 rather than 25. Thus these subtests are more than twice as long for the given time limits than is appropriate for power tests. The committee also conducted item-bias analyses using the Mantel- Haenszel procedure, whereby majority and minority examinees are matched on total score before examining differential performance on individual test items. In this case examinees were matched on total scores on a shortened test, defined as a power test or 90 percent completion test for the white group. These analyses consistently produced bias flags for a series of items in the middle of each test, suggesting that blacks were at a relative disadvantage in the range of the test at which the influence of time limits was most keenly felt. Proportion Correct Data on the proportion correct for each test item are difficult to interpret because of the pervasive effects of speed. For every group and

TEST ADMINIST~TION, SPEEDEDNESS, AND COACHABlLI~ ~ 09 test the proportion correct begins at item 1 with nearly 100 percent and trails off to O percent somewhere in the middle. Consistent with direct inspection of test content, this pattern in the statistics indicates that the items are arranged by difficulty, with very easy items first, then becoming increasingly more difficult. Even for the tests for which speed is not part of the construct, however, there is a very close correspondence between the proportion attempting an item and the proportion getting it correct. If examinees get to an item, they nearly always answer it correctly. Therefore, it is impossible to use these data to determine the actual difficulty of the items unconfounded by the effects of speed. Point-Biserial Correlations Point-biserial correlations have different meanings in speeded and un- speeded conditions. Subtest 1, name comparison, is primarily a speeded test. The items are all very similar in nature. The items in this test were inversely correlated with total test score on the basis of their location in the test rather than on the basis of their similarity-of-item content. That is, items at the beginning of the test correlated zero with total test score because all examinees got them right; these early items thus contribute nothing to the final ranking of examinees on total score. As an examinee progresses through this test, the effects of time limits begin to be felt and there is a gradual crescendo of point-biserial values. Examinees who work the fastest through the test (presuming they are not answering randomly) have higher test scores and get items right. There are, therefore, very high item-total correlations at the limits of good performance. These limits are somewhere in the middle of the test because it is so speeded. Eventually the peak in the point-biserial correlations trails off, presumably because some of the few remaining examinees are choosing speed rather than accuracy in order to answer more questions. The pattern of point-biserial correlations in the so-called unspeeded GATE tests also reflects the influence of speed on total score. Examinees who get further in the test have higher test scores and are still doing well on the items they attempt. The highest point-biserial values tend to occur at the point at which half of the examinees are still attempting the items. Available data are also pertinent to an entirely different topic. Earlier in the chapter we hypothesized possible strategies of random response to improve test scores. How test-wise are GATE test takers about the advantage of marking uncompleted items when time runs out? Although the rise and fall of point-biserial correlations suggests that a few examin- ees might be marking a few items randomly at the limit of their performance in order to obtain higher scores, the tong strings of near-zero attempts for the later items suggest that the great majority of examinees

~ 10 ANALYSIS OF THE GENERAL APTITUDE TEST BATTERY are not following this strategy. These test-taking habits would be likely to change substantially if examinees were coached in such effective ways to improve their scores, a likely prospect if the VG-GATB Referral System becomes important. Because the influence of speed so dominates all these GATB subtests, it is not possible to use point-bisenal correlations to judge the homogeneity of items in measuring the intended construct. Hence, internal consistency estimates of reliability, based on point bisenals, would be misleading. PRACTICE EFFECTS AND COACHING EFFECTS Because of the speededness of the GATB, the test is very vulnerable to practice effects and coaching. If the test comes to be widely used for referral, USES policy makers must be prepared for the growth of coaching schools of the kind that now provide coaching for the Scholastic Aptitude Test and tests for admission to professional schools. USES must also expect the publica- tion of manuals to optimize GATB scores, such as those already available for the Armed Services Vocational Aptitude Battery (ASVAB). Effects of Practice on GATB Scores Practice effects are attributable to several influences. If examinees are retested with the same form of an examination, their scores might increase because they remember their initial responses to items and can therefore use the same answers without considering the items in detail, or because they become wiser and more efficient test takers as a result of completing the examination once. If examinees are retested with an alternate form of an examination, specific memory effects will not be present, and gains in score are attributable only to the effects of practice. Data on the effects of practice on the GATB cognitive (G. V, and N), perceptual (S. P. and Qj, and psychomotor (K, F. and M) aptitudes are reported in Figures 5-3 and 5-4, which are based on studies detailed in Appendix B. Tables B-1 to B-6. As the figures show, the estimated size of the effects of retesting on the GATB were greatest when the same form of the test was repeated.) Figure 5-3 summarizes the erects of retesting ban estimated effect size is the difference between the mean score when examinees were tested initially and the mean score when examinees were retested, divided by the standard deviation of scores when examinees were tested initially. Thus an estimated effect size of 0.5 indicates that the mean score when examinees were retested was half a standard deviation unit higher than when the examinees were tested initially. With an effect size of O.S, an examinee who outscored 50 percent of the other examinees when tested initially would, when retested, outscore 69 percent of the other examinees in the initial-testing normal-score distribution.

TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABlLl 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 A-=- I I I I I , , I I I I I I General Aptitude Verbal Aptitude n = 8 Ad} Numerical Aptitude n = 8 {L it Spatial Aptitude I n=8 Form Perception Motor Coordination OK n= 10 {a} Clerical Perception . ~ Finger Dexterity _ F _ n= 10 13;74 L I I I I I I I I I I I I ~ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 Manual Dexterity LEGEND: | M Minimum ~ ~Maximum - First '. Third Quartile Median Quartile FIGURE 5-3 Practice effects when the same test form was used both times. Distributions of estimated effect sizes (initial testing to retesting) are expressed in standard deviations of initial aptitude distributions.

2 ANALYSIS OF THE GENERAL APTITUDE TEST BATTERY 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 .0 1.1 1.2 1 .3 1 1 1 1 1 1 1 1 1 1 1 1 General Aptitude n= 16 ~ Verbal Aptitude {I} Numer~calAptitude { I} Spatial Aptitude n= 16 - [} Form Perception n= 16 {I Clerical Perception { 4} Motor Coordination n= 14 I} Finger Dexterity n= 10 Manual Dexterity n= 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 Minimum {I First Quartile Median Maximum - Third Quartile FIGURE 5-4 Practice effects when a different test form is used each time. Distributions of estimated effect sizes (initial testing to retesting) are expressed in standard deviations of initial aptitude distributions.

TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABIH7Y 1 1 3 with the same form. For the cognitive aptitudes, mean scores increased by a third of a standard deviation from initial testing to retesting. Gains on the perceptual aptitudes averaged half a standard deviation to three- fou~ths of a standard deviation, and gains on the psychomotor aptitudes were even larger, approaching a whole standard deviation for the manual dexterity aptitude. As Figure 5-4 shows, however, a large component of these gains can be attributed to memory effects, since the corresponding gains were much smaller when an alternate form of the GATB was used for retesting. For the cognitive aptitudes, gains from practice alone averaged about a fifth of a standard deviation. For the perceptual and psychomotor aptitudes, gains due to practice were appreciably larger, averaging about a third of a standard deviation. These results suggest that examiners should not be retested using the same form of the GATB, since their retest results are likely to be spuriously high due to memory effects. In addition, the results suggest that practice effects on the GATB are large enough, even when an alternate form of the battery is used for retesting, to conclude that many retested examinees will be advantaged substantially by the experience of having completed the GATB once. We do not know if these findings have changed over the 20 years since these studies were completed. These estimated effects of practice on the GATB can be regarded as lower bounds on gains that might be realized through intensive coaching. Effects of Coaching on GATB Scores If the use of the GATB were to be extended to the point that earning high scores on the GATB had a substantial relationship with employabil- ity, as would be the case if the VG-GATB Referral System were to be implemented widely, it is likely that commercial coaching schools, such as those presently in operation for the widely used higher education admissions tests, would be developed. The coachability of the GATB would then be a major equity issue, since those who could not afford to attend commercial coaching schools would be at a disadvantage. Little direct information on the coachability of the GATB subtests is currently available. Rotman (1963) conducted a study with mentally retarded young adult males in which he provided an average of 4.55 days of instruction and practice on the GATB subtests that compose the psychomotor aptitudes K, F. and M. A group of 40 instructed subjects showed average gains in mean scores, expressed in units of estimated effect sizes, of 0.94 for K, 0.43 for F. and 1.23 for M. In comparison, a control group of 40 subjects who were retested with no intervening instruction showed average effect sizes of 0.52 for K, 0.04 for F. and 0.38 for M. Practice effects alone added substantially to the average scores of

~ 14 ANALYSIS OF THE GENERA ETUDE TEST BAKERY the control subjects on two of the three psychomotor aptitudes. Coaching added even more to mean scores for all three psychomotor aptitudes. Although the generalizability of these results to nonretarded examiners is questionable, the potential coachability of the GATB subtests that compose the psychomotor aptitudes is clearly indicated. TEST SECURITY If the Department of Labor decides to continue and expand the VG-GATB Referral System, USES will have to develop new test security procedures like those that surround the Scholastic Aptitude Test, the American College Testing Program, the Armed Services Vocational Aptitude Battery, and other major testing programs. So long as the GATB was used primarily for vocational counseling, the issue of security was not pressing. But if it is to be used to make important decisions affecting the job prospects of large numbers of Americans, then it is essential that no applicants have access to the test questions ahead of time. This will require much tighter test administra- tion procedures and strict control of every test booklet. State and local Employment Service personnel will require more extensive training in test administration procedures, and administrators will have to be selected with greater care. The need for test security will make it imperative that no operational GATB forms be made available to private vocational counselors, labor union apprenticeship programs, or high school guidance counselors. With the development of additional forms on a regular cycle, the use of retired forms for these other purposes may be appropriate, although the demon- strated effects of practice with parallel forms (Figure 5-4) suggest the need for caution. Most important, the new role envisioned for the VG-GATB will require a sustained test development program to produce more forms with greater frequency. The present GATB is administered from just two alternative forms, C and D, which replaced the 35- or 40-year-old Forms A and B. By contrast, three new forms of the ASVAB are introduced on a four-year cycle. There is much accumulated wisdom on the subject of test security in the Department of Defense Directorate for Accession Policy and in the private companies that administer large test batteries. USES would benefit from reviewing their protocols as a preliminary to drawing up provisions for maintaining the security of the GATB.

TEST ADMINIST~TION, SPEEDEDNESS, AND COACHABlM ~ ~ 5 CONCLUSIONS Test Administration Practices 1. The instructions to examiners, if followed, do not allow them to maximize their GATB scores. No guidance is given about guessing on items the examinee does not know. This practice is inconsistent with accepted professional standards. Speededness 2. Most of the GATB tests are highly speeded. This raises the issue of a potential distortion of the construct purportedly measured and could have effects on predictive validity. To compound the problem, the test answer sheet bubbles are very large and examiners are told to darken them completely, penalizing the conscientious. When used with highly speeded tests such as the GATB, the combined effects of the instructions given to examiners and the answer sheet format add a validity-reducing, psychomotor component to tests of other constructs. The excessive speededness of the GATB makes it very vulnerable to coaching. . Alternate Forms and Test Security 3. The paucity of new forms and insufficient attention to test security speak against any widespread operationalization of the VG-GATB with- out major changes in procedures. At the present time, there are only two alternate forms of the GATB; there have been just four in its 40 years of existence, although two new forms are under development. In contrast, the major college testing programs develop new forms annually, and the Department of Defense develops three new forms of the Armed Services Vocational Aptitude Battery at about four-year intervals. In addition, test security has not been a primary concern so long as the GATB was used largely as a counseling tool; it appears to be fairly easy for anyone to become a certified GATB user and obtain access to a copy of the test battery. Item Bias 4. There is minimal evidence on which to decide whether the items in the GATB are biased against minorities. On the basis of internal analysis, there appears to be no idiosyncratic item functioning due to item content, although there could be bias overall. There is a modicum of evidence that test speed affects black examinees differently from other examiners.

]6 ANALYSIS OF THE GENE~L APTITUDE TEST BAKERY Practice Effects and Coaching 5. GATB scores will be significantly improved by practice. A major reason for this is the speededness of the test parts. Experience with other large-scale testing programs indicates that the GATB would be vulnerable to coaching. This is a severe impediment to widespread operationalization of the GATB. The GATB's speededness, its consequent susceptibility to practice effects and coaching, the small number of alternate forms, and low test security in combination present a substantial obstacle to a broad expan- sion of the VG-GATB Referral System. RECOMMENDATIONS Test Security If the GATB is to be used in a widespread, nationwide testing program, we recommend the adoption of formal test security procedures. There are several components of test security to be considered in implementing a large testing program. 1. There are currently two alternate forms of the GATB operationally available and two under development. This is far too few for a nationwide testing program. Alternate forms need to be developed with the same care as the initial forms, and on a regular basis. Form-to-form equating will be necessary. This requires the attention to procedures and normative groups as described in the preceding chapter. 2. Access to operational test forms must be severely limited to only those Department of Labor and Employment Service personnel involved in the testing program and to those providing technical review. Strict test access procedures must be implemented. 3. Separate but parallel forms of the GATB should be made available for counseling and guidance purposes. Test Speededness 4. A research and development project should be put in place to reduce the speededness of the GATB. A highly speeded test, one that no one can hope to complete, is eminently coachable. For example, scores can be improved by teaching test takers to fill in all remaining blanks in the last minute of the test period. If this characteristic of the GATB is not altered, the test will not retain its validity when given a widely recognized gatekeeping function.

PART I-II VALIDITY GENERALIZATION AND GATB VALIDITIES Part III is the heart of the committee's assessment of the scientific claims made to justify the Department of Labor's proposed plan for the widespread use of the General Aptitude Test Battery to screen applicants for private- and public-sector jobs. Chapter 6 is an overview of the theory of validity generalization, which is a type of meta-analysis that is proposed for extrapolating the estimated validities of a test for perfor- mance on jobs that have been studied to others that have not. The committee then addresses the research supported by the Depart- ment of Labor to apply validity generalization to the GATB. Chapter 7 covers the first two parts of the analysis: reduction of the nine GATB aptitudes to (effectively) two general factors, cognitive and psychomotor ability, and the clustering of all jobs in the U.S. economy into five job families. Chapter 8 presents the department's validity generalization analysis of 515 GATB studies and compares those results with the committee's own analysis of a larger data set that includes 264 more recent studies. Chapter 9 addresses the question of whether the GATB functions in the same way for different demographic groups. It looks at the possibility that correlations of GATB scores with on-thejob performance measures differ by racial or ethnic group or gender, and the possibility that predictions of criterion performance from GATB scores differ by group. ~7

Next: 6 The Theory of Validity Generalization »

Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery (1989)

Chapter: 5 Problematic Features of the GATB: Test Administration, Speedness, and Coachability

Welcome to OpenBook!

Get Email Updates