Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 99
Problematic Features of the GATB:
Test Administration, Speededness, and
Coachability
In this chapter we examine a number of characteristics of the GATB
and the way it is administered that need immediate attention if the test is
transformed from a counseling tool into the centerpiece of the U.S.
Employment Service (USES) referral system. The difficulties we see
range from easily cured problems with the current test administration
procedures to some fundamental design features that must be revised if
the General Aptitude Test Battery is to take on the ambitious role
envisioned in the VG-GATB Referral System.
TEST ADMINISTRATION PRACTICES
Several features of USES-prescribed test administration procedures
and the use of the National Computer Systems (NCS) answer sheet
appear to be potential threats to the construct validity of the test. If these
features affect members of various racial or ethnic groups to differing
degrees, they could also be sources of test bias. Each of these issues
warrants further investigation.
Instructions to Examinees
The GATB test booklet for each pencil-and-paper subtest instructs
examinees to "work as quickly as you can without making mistakes."
This instruction implies that examinees will be penalized for making
errors when the subtests are scored. In fact, number-right scoring is used
99
OCR for page 100
i00 ANALYSIS OF THE GENERAL ETUDE TEST BAKERY
for all pencil-and-paper GATB subtests, with no penalties for incorrect
guessing or other sources of incorrect answers.
When asked how test administrators responded to questions concern-
ing the type of scoring used with the GATB, the committee was told by
USES representatives that honest answers were given. Thus, test-wise
examiners who ask about scoring rules have an advantage that is not
shared by examinees who do not raise this question. Use of an instruction
that misleads examiners about the scoring procedures employed is
inconsistent with the Standards for Educational and Psychological Test-
ing (American Educational Research Association et al., 1985~. It unnec-
essar'ly adds a source of error variance to observed test scores that will
reduce measurement reliability. In addition, to the extent that test-wise
examiners are differentially distributed across racial and ethnic groups,
the inconsistency between test instructions and scoring procedures is a
source of test bias that could be readily eliminated.
Our review of the GATB Manual (U.S. Department of Labor, 1970)
and the contents of the GATB subtests has raised additional concerns
about the vulnerability of the test battery to guessing. Consider Subtest 1,
name comparison, a speeded test of clerical perception. Examinees are
given 6 minutes to indicate whether the two names in each of 150 pairs of
names are exactly the same or different. The GATB Manual indicates that
the General Working Population Sample of 4,000 examiners was admin-
istered Form A with an IBM answer sheet. The mean score for name
comparison was just under 47 items correct with a standard deviation of
17, meaning that it is a highly speeded test.
Let us hypothesize with the available statistics for Form A and an IBM
answer sheet. If all scores were normally distributed, then scores at the
95th percentile for name comparison would be 75 items correct. On the
basis of these statistics and assumptions, the optimal strategy for an
examinee completing the name comparison subtest has two phases. The
first would be to randomly mark one of the two bubbles for each of the 150
items as rapidly as possible, without reading the items in order to consider
the stimulus names. Assuming an examinee could fill in 150 bubbles
within 6 minutes, the second phase of the optimal strategy would then be
to begin again with the first item, determine the correct answer, and
change the answer already marked if necessary; the examinee would
continue working through the subtest in this way until time was called.
On one form of the GATB, the actual proportion of items with a correct
answer of "exactly the same" was 0.493 (74 of 150 items). Since for half
the items on the subtest the correct answer was "exactly the same," an
expected score of 75 items correct would result from marking all answers
the same way. This "chance" score is higher than the 98th percentile of
the GATB General Working Population Sample on the name comparison
OCR for page 101
TEST ADMINISTRATION9 SPEEDEDNESS, AND COACHABIH~ 101
TABLE 5-1 Worksheet on Chance Scores and Coaching for Power
Subtests
(3) (4)
(1) (2) Remaining Item
Total Power Items Op
Items Itemsa (1-2) tions
(7)
(5) (6) Stan
Chance Average card (8)
Score Score on Devi- Effect Size
(1-2)/(4) the Test ation (5)/(7)
Subtest 2
(computa- 50 18 32
tion)
Subtest 3
(three
dimen
sional
space)
Subtest 4
(vocabu- 60 18 42
lary)
Subtest 6
40 17 23
(arithmetic
reasoning) 25 9
6.4 20 4.8 1.33
5.75 15.4
6 0.96
21 8.3 0.84
16 5 3.2 9.4 2.9 1.10
aNinety percent of majority examinees would complete this many.
subtest. Scores could be improved further if the test taker were aware that
short runs (3 to 4 items) on the name comparison subtest were identically
scored (either "exactly the same" or "different". In any case, this
modified random marking strategy would yield a very high score simply
because the subtest is very long and highly speeded.
Our analysis of individual item functioning demonstrates the potential
effects of guessing in increasing GATB subtest scores. Table 5-1 presents
a worksheet showing the score increase that could be expected for each
of the would-be power tests, i.e., those where speed of work does not
seem to be a defensible part of the construct (Subtests 2, 3, 4, and 61. The
total number of items for each of the subtests can be compared with the
number of items that would be included if the test were actually
constructed as a power test. The power test limits were set such that 90
percent of the majority group would complete the test.
Column 5 shows the typical chance score (added to one's regular score)
that could be earned by randomly marking the remaining items. The gain
due to chance is also shown as an effect size in standard deviation units
(column 8~. The effects are large, roughly 1 standard deviation. Thus,
assuming a normal distribution, a person scoring at the 50th percentile
could increase his or her score to the 84th percentile by guessing on the
unfinished portion of the test.
OCR for page 102
}02 ANALYSIS OF THE GENERA ETUDE TEST BAKERY
It is possible that the current test could be improved by using a penalty
for guessing on the straight speed tests and a correction for guessing on the
would-be power tests. As a matter of professional ethics it is essential that
the examinees be informed of whatever scoring procedure is to be used and
told clearly what test-taking strategies it is in their interests to use. The
above analysis documents how vulnerable the current test is to attempts to
beat the system. It is not clear what combination of shortened test and
change in directions would be best to be fair to aD examinees and to ensure
the construct validity of each subtest. It would take both conceptual
analysis and empirical work to arrive at the best solution. In considering
alternatives, one would also have to ask how much the test could be
changed without destroying the relevance of existing validity studies.
The National Computer Systems Answer Sheet
When USES first adapted the GATB to a separate, optically scanned
answer sheet (the IBM 805 sheet), the test developers noted that "an
attempt was made to devise answer sheets which would result in
maximum clarity for the examinees and would facilitate the administra-
tion of the tests" (U.S. Department of Labor, 1970:2~. Unfortunately, this
objective is far less evident in the design of the currently used NCS
answer sheet. The NCS answer sheet is in the form of a folded 12-inch by
17-inch, two-sided sheet that contains an area for examinee identification,
a section for basic demographic information on the examinee, and a
section for listing the form of the GATB that the examinee is attempting.
In addition, the sheet has separate areas for recording answers to seven of
the eight GATB pencil-and-paper subtests.
Several features of the NCS answer sheet call on the test-wiseness of
exam~nees. The bubbles on the NCS answer sheet are very large, and
examinees are told to completely darken the bubbles that correspond to their
answers to each question. Following this instruction precisely is a time-
consuming task that is most likely to be interpreted literally by examiners
with the least experience in using optically scannable test answer sheets.
Since all of the GATB subtests are speeded (as described above and
discussed below), this deficiency will affect the test scores of examinees who
follow the instruction most closely. For some subtests, such as the name
comparison test, the design of the NCS answer sheet might add a significant
psychomotor component to the abilities required to perform well.
THE INFLUENCE OF SPEED OF WORK
Due in large part to the early work and influence of Charles Spearman
(Hart and Spearman, 1914; Spearman, 1927:chap. 14), pioneers in the
OCR for page 103
TEST ADMINISTRATION, SPEEDEDNESS, AND COAcHABI~ry ~ 03
field of educational and psychological testing theorized that measures of
speed of work and measures of quality of work were interchangeable
indicators of a common construct. It was not until World War II, close to
the time that the GATB was under development, that researchers such as
Baxter (1941) and Davidson and Carroll (1945) reported the results of
factor analytic studies showing different structures for the same tests
administered under time-constrained and unlimited-time conditions. The
distinctiveness of speed of work and accuracy of work has since been
corroborated by Boag and Neild (1962), Daly and Stahmann (1968),
Flaugher and Pike (1970), Kendall (1964), Mollenkopf (1960), Terranova
(1972), and Wesman (1960), among others.
A test for which speed of work has no influence on an examinee s score
(i.e., a test in which every examinee is given all the time needed to
attempt every test item) is called a pure power test. According to
Gulliksen (19SOa:230) a pure speed test is one that is so easy that no
examinee makes an error and one so long that no examinee finishes the
test in the time allowed. Commonly used aptitude tests rarely, if ever, fit
the definition of a pure power test or a pure speed test. Many such tests,
including the subtests of the GATB, combine elements of speed of work
and quality of work to a largely unknown degree. However, scores on the
GATB appear to depend on speed of work to a far greater extent than is
true of more modern aptitude batteries.
All of the GATB subtests, whether intended to be tests of speed of
work or power tests, have time limits that are extremely short. It is
therefore likely that most examinees scores on these subtests are
influenced substantially by the speed at which they work. The subtests
were initially designed to insure that very few, if any, examinees
would complete each test . . . . The speed requirements of the tests
have been increased since their initial design through the use of separate
answer sheets and, more recently, through use of the NCS answer
sheet. The NCS answer sheet imposes sufficient additional burden on
examinees that the 1970 Manual contains a table of positive scoring
adjustments to accommodate its use (see U.S. Department of Labor,
1970:43, Table 7-71.
Figures 5-1 and 5-2 illustrate the speeded nature of the GATB
subtests. Subtest 5, tool matching, shown in Figure 5-1, was selected as
an example of a speeded test, for which the ability to work quickly is
logically a part of the intended construct. In contrast, Subtest 6, arith-
metic reasoning, represents a construct that might be more accurately
measured in an untimed or power test situation. (A power test is defined
operationally as one where 90 percent of examinees have sufficient time to
complete all of the test items.) The data were obtained for 7,418 white
applicants, 6,827 black applicants, and 1,466 Hispanic applicants from
OCR for page 104
|04 ANALYSIS OF THE GENE^[ APTITUDE TEST BA"ERY
100
90
80
An
LIJ
LLI
z
60
LL
11
o
CD
6
z
LD
llJ
70
50
40
30
20
10
o
1 10
20 30
NUMBER OF ITEMS
~_~
o
l l l
on -
~ ox \
· . ~ \
° ~ \
\ \
o \ a\
\ ~
.~d o\
\ o\
· \\ a\
~ \
.\o\
\`o\
.. ~ o\
· ~
White
Attempted
Black
Attempted
O White Correct
· Black Correct
·~ I
40 49
FIGURE 5-1 Percentages attempting and number of items correct for whites
and blacks on Subtest 5, tool matching (speeded).
two test centers in 1988. The percentage of test takers attempting each
item and getting each item right is plotted.
The steeply declining curves, drawn for whites and blacks only,
demonstrate the speeded nature of the tests. For example in Figure 5-1,
nearly 100 percent of both groups attempted the first 16 questions; then
there is a sharp decrease in the number of examinees reaching each
subsequent question such that by the midpoint of the test only 66 percent
of whites and 53 percent of blacks are still taking the test. In pure speed
tests the content of test questions is relatively easy, making it only a
matter of how fast one works whether an item will be correct or incorrect.
As would be expected in such a test, the percentage-correct curves in
Figure 5-1 closely parallel the percentage-attempted curves, with some
unaccounted-for difficulty at items 9 and 21.
Figure 5-2 also shows a strong overriding influence of speed. To satisfy
the definition of a power test for the white group, the test would end at
item 8. By the midpoint of the test, only 50 percent of whites and 27
OCR for page 105
TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABILI7Y 105
percent of blacks are still taking the test. Although items 6 and 8 are
relatively difficult even for examinees who reach them, the percentage
correct on the majority of items follows the pattern delimited by the
speeded nature of the test.
The use of speeded subtests to measure constructs that do not include
speed as an attribute is a potentially serious construct validity issue. First,
the meaning of the constructs measured is likely to be different from the
conventional meaning attached to those constructs. For example, do two
tests that require correct interpretation of arithmetic problems stated in
words and correct application of basic arithmetic operations to the
solution of those problems measure the same aptitude, if one is highly
speeded and the other is not? The research cited above suggests that the
two tests would measure different constructs.
Second, if the speed component of the tests does not assess the abilities
of members of different racial or ethnic groups in the same way, the tests
might be differentially valid for members of these groups. Helmstadter
and Ortmeyer (1953:280) noted:
100
90
c,) 80
111
he
o So
LLJ
'( 40
of
() So
111
70
60
· o o o
. '\\~
6' \
~ \
.
o ~\
.
o
O\
\\\\
\
\
· \ \
\ \
20 _
10 _
O I 1 1 ~I I 1 1 1 1
.
~ · ° At\
White
Attempted
Black
Attempted
O White Correct
· Black Correct
I I ~ I t ~
1 2 3 4 5 6 7 8 910 111213141516171819202122232425
NUMBER OF ITEMS
FIGURE 5-2 Percentages attempting and number of items correct for whites
and blacks on Subtest 6, arithmetic reasoning (power test).
OCR for page 106
|06 ANALYSIS OF THE GENE~L~TITUDE TEST BAKERY
Although any test may rationally be considered as largely speed or largely power,
the relative importance of these two components is not independent of the group
being measured, and a test which samples depth of ability for one group may be
measuring only a speed component for a second ....
As an example of the way this problem might be evidenced for the
GATB, Subtest 7, form matching, requires examinees to pair elements
of two large sets of variously shaped two-dimensional line drawings. A
total of 60 items is to be completed in 6 minutes. Within this time,
examinees must not only find pairs of line drawings that are identical in
size and shape, but must then find and darken the correct answer bubble
on the NCS answer sheet from a set of 10 answer bubbles with labels
consisting of single or double capitalized letters (e.g., GUI). The labeling
of physically corresponding answer bubbles differs from one item to the
next. Since the subtest is tightly timed, identification of the correct
answer bubble from the relatively long list presented on the answer
sheet might become a significant component of the skill assessed. One
could, by inspection, confidently advance the argument that the subtest
measures not only form perception, but also the speed of list processing
and skill in decoding complex answer sheet formats. The latter skill is
dependent on previous experience with tests. Since the extensiveness
of such experience will differ for members of different racial or ethnic
groups, the subtest might be differentially valid as a measure of form
perception for white and black examinees.
Third, the severe time limits of the GATE subtests might produce an
adverse psychological reaction in examinees as they progress through the
examination and might thereby reduce the construct validity of the
subtests. Having attempted a relatively small proportion of items on each
subtest, examinees might well become progressively discouraged and
thus progressively less able to exhibit their best performance. With the
use of separate, optically scanned answer sheets, the most vulnerable
examinees are those least experienced with standardized tests, a group in
which minority examinees will be overrepresented.
These arguments on the racial or socioeconomic correlates of the
effects of test speededness are admittedly speculative. Dubin and col-
leagues (1969) found few such correlates in a study with test-experienced
high school students. However, they cited research by Boger (19521;
Eagleson (1937), Katzenmeyer (1962), Klineberg (1928), and Vane and
Kessler (1964) that indicated positive effects of extra practice and test
familiarity in reducing test performance differences between blacks and
whites.
OCR for page 107
TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABlH ~)07
ITEM-BL\S ANALYSES
Statistical procedures, referred to as item-bias indices, are used to
evaluate whether items within a test are differentially more difficult for
members of a particular subgroup taking the test.
Two caveats govern the interpretation of item-bias statistics. First, these
indices are measures of internal bias. Bias is defined as differential validity
whereby individuals of equal ability but from different groups have different
success rates on test items. To establish that individuals have equal ability,
the various item-bias methods rely on total test score (or some transforma-
tion of total score). Thus internal bias statistics are circular to some extent
and cannot detect systematic bias. Systematic or pervasive bias could only
be detected using an external criterion, as is done in predictive validity
studies. What internal bias procedures are able to reveal are individual test
questions that measure differently for one group compared with another.
They provide information akin to factor analysis but at the item level. A large
bias index signals that an item is relatively more difficult for one group.
The second caveat has to do with the meaning of bias as signaled by these
statistics. The analytic procedures were designed to detect irrelevant diffi-
culty, that is, some aspect of test questions that would prevent examinees
who know the concept from demonstrating that they know it. An example of
irrelevant difficulty would be a high level of reading skill required on a math
test, thus obscuring perhaps the true level of mathematics achievement for
one group compared with another. However, the statistics actually work by
measuring multidimensionality in a test. For example, if physics and chem-
istry questions were combined into one science test, one subset of questions
would probably produce many bias flags unless group differences in both
subject areas were uniform. Thus many authors of item-bias procedures
have cautioned that significant results are not automatically an indication of
bias against a particular group. In fact, the statistical indices are often called
measures of differential item functioning to prevent misinterpretation of the
results. If each of the dimensions of the test is defensible and appropriate for
the intended measurement, then the so-called bias indices have merely
revealed differences in group performance.
In order to explore at least partially how the GATE functions and whether
it functions differently for different racial or ethnic groups, the committee
undertook an analysis of actual answer sheets for a sample of Employment
Service applicants. Standard statistical procedures were used to examine
characteristics of GATE items within each subtest. These analyses were
conducted separately for 6,827 black and 7,418 white test takers from a
Michigan test center and for 873 whites and 1,466 Hispanics from a Texas
test center. The proportion answering each item correctly, the proportion
attempting each item, and point-biserial correlations were calculated. The
OCR for page 108
|08 ANALYSIS OF THE GENERA ETUDE TEST BAKERY
proportion attempted can index test speed whereas the proportion correct
can index item and test difficulty. Point-biserial correlations show the degree
of relationship between performance on an individual item and total score on
the subtest, reflecting both speed and difficulty.
Proportion Attempted
Inspection of the proportion-attempted statistics shows the same pat-
tern in all seven of the GATE paper-and-pencil subtests. Figures 5-1 and
5-2 give proportion attempted and proportion correct for tool matching
and arithmetic reasoning, respectively. Virtually 100 percent of examin-
ees attempt the first item and fewer than 1 percent finish each subtest.
Subtests 1 (name comparison), 5 (tool matchings, and 7 (form matching)
are speeded tests; it is therefore not surprising that many examinees are
unable to complete these tests. However, the number of items is far
greater than is usual even for speeded tests. For example, Subtest 1 has
150 items, yet by item 75, only 9 percent of the Texas whites are still
taking the test. Even smaller percentages of the other groups can be found
at later items. Subtest 7 is 60 items long, but only 1 percent of the whites
in the Texas sample make it to item 42.
The effect of unrealistic time limits is also apparent on the tests intended
to be unspeeded. Power tests, for which examinees have sufficient time to
show what they know, are ordinarily defined by a 90 percent completion
rate. Subtest 2 (computation), comprised of 50 items, should be complete by
item 17 to be a power test for the sample of Michigan whites. Subtest 3
(three-dimensional space) would have to finish with item 17 instead of 40,
Subtest 4 (vocabulary) with item 16 instead of 60, and Subtest 6 (arithmetic
reasoning) with item 9 rather than 25. Thus these subtests are more than
twice as long for the given time limits than is appropriate for power tests.
The committee also conducted item-bias analyses using the Mantel-
Haenszel procedure, whereby majority and minority examinees are
matched on total score before examining differential performance on
individual test items. In this case examinees were matched on total scores
on a shortened test, defined as a power test or 90 percent completion test
for the white group. These analyses consistently produced bias flags for a
series of items in the middle of each test, suggesting that blacks were at
a relative disadvantage in the range of the test at which the influence of
time limits was most keenly felt.
Proportion Correct
Data on the proportion correct for each test item are difficult to
interpret because of the pervasive effects of speed. For every group and
OCR for page 109
TEST ADMINIST~TION, SPEEDEDNESS, AND COACHABlLI~ ~ 09
test the proportion correct begins at item 1 with nearly 100 percent and
trails off to O percent somewhere in the middle. Consistent with direct
inspection of test content, this pattern in the statistics indicates that the
items are arranged by difficulty, with very easy items first, then becoming
increasingly more difficult. Even for the tests for which speed is not part
of the construct, however, there is a very close correspondence between
the proportion attempting an item and the proportion getting it correct. If
examinees get to an item, they nearly always answer it correctly.
Therefore, it is impossible to use these data to determine the actual
difficulty of the items unconfounded by the effects of speed.
Point-Biserial Correlations
Point-biserial correlations have different meanings in speeded and un-
speeded conditions. Subtest 1, name comparison, is primarily a speeded
test. The items are all very similar in nature. The items in this test were
inversely correlated with total test score on the basis of their location in the
test rather than on the basis of their similarity-of-item content. That is,
items at the beginning of the test correlated zero with total test score
because all examinees got them right; these early items thus contribute
nothing to the final ranking of examinees on total score. As an examinee
progresses through this test, the effects of time limits begin to be felt and
there is a gradual crescendo of point-biserial values. Examinees who work
the fastest through the test (presuming they are not answering randomly)
have higher test scores and get items right. There are, therefore, very high
item-total correlations at the limits of good performance. These limits are
somewhere in the middle of the test because it is so speeded. Eventually
the peak in the point-biserial correlations trails off, presumably because
some of the few remaining examinees are choosing speed rather than
accuracy in order to answer more questions.
The pattern of point-biserial correlations in the so-called unspeeded
GATE tests also reflects the influence of speed on total score. Examinees
who get further in the test have higher test scores and are still doing well
on the items they attempt. The highest point-biserial values tend to occur
at the point at which half of the examinees are still attempting the items.
Available data are also pertinent to an entirely different topic. Earlier in
the chapter we hypothesized possible strategies of random response to
improve test scores. How test-wise are GATE test takers about the
advantage of marking uncompleted items when time runs out? Although
the rise and fall of point-biserial correlations suggests that a few examin-
ees might be marking a few items randomly at the limit of their
performance in order to obtain higher scores, the tong strings of near-zero
attempts for the later items suggest that the great majority of examinees
OCR for page 110
~ 10 ANALYSIS OF THE GENERAL APTITUDE TEST BATTERY
are not following this strategy. These test-taking habits would be likely to
change substantially if examinees were coached in such effective ways to
improve their scores, a likely prospect if the VG-GATB Referral System
becomes important.
Because the influence of speed so dominates all these GATB subtests, it is
not possible to use point-bisenal correlations to judge the homogeneity of
items in measuring the intended construct. Hence, internal consistency
estimates of reliability, based on point bisenals, would be misleading.
PRACTICE EFFECTS AND COACHING EFFECTS
Because of the speededness of the GATB, the test is very vulnerable to
practice effects and coaching. If the test comes to be widely used for referral,
USES policy makers must be prepared for the growth of coaching schools of
the kind that now provide coaching for the Scholastic Aptitude Test and tests
for admission to professional schools. USES must also expect the publica-
tion of manuals to optimize GATB scores, such as those already available for
the Armed Services Vocational Aptitude Battery (ASVAB).
Effects of Practice on GATB Scores
Practice effects are attributable to several influences. If examinees are
retested with the same form of an examination, their scores might
increase because they remember their initial responses to items and can
therefore use the same answers without considering the items in detail, or
because they become wiser and more efficient test takers as a result of
completing the examination once. If examinees are retested with an
alternate form of an examination, specific memory effects will not be
present, and gains in score are attributable only to the effects of practice.
Data on the effects of practice on the GATB cognitive (G. V, and N),
perceptual (S. P. and Qj, and psychomotor (K, F. and M) aptitudes are
reported in Figures 5-3 and 5-4, which are based on studies detailed in
Appendix B. Tables B-1 to B-6. As the figures show, the estimated size of
the effects of retesting on the GATB were greatest when the same form of
the test was repeated.) Figure 5-3 summarizes the erects of retesting
ban estimated effect size is the difference between the mean score when examinees were
tested initially and the mean score when examinees were retested, divided by the standard
deviation of scores when examinees were tested initially. Thus an estimated effect size of 0.5
indicates that the mean score when examinees were retested was half a standard deviation unit
higher than when the examinees were tested initially. With an effect size of O.S, an examinee
who outscored 50 percent of the other examinees when tested initially would, when retested,
outscore 69 percent of the other examinees in the initial-testing normal-score distribution.
OCR for page 111
TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABlLl
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
A-=- I I I I I , , I I I I I I
General Aptitude
Verbal Aptitude
n = 8
Ad}
Numerical Aptitude
n = 8
{L
it Spatial Aptitude
I n=8
Form Perception
Motor Coordination OK
n= 10
{a} Clerical Perception
. ~
Finger Dexterity _ F _
n= 10
13;74
L I I I I I I I I I I I I ~
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
Manual Dexterity
LEGEND:
| M
Minimum ~ ~Maximum
-
First '. Third
Quartile Median Quartile
FIGURE 5-3 Practice effects when the same test form was used both times.
Distributions of estimated effect sizes (initial testing to retesting) are expressed in
standard deviations of initial aptitude distributions.
OCR for page 112
2 ANALYSIS OF THE GENERAL APTITUDE TEST BATTERY
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 .0 1.1 1.2 1 .3
1 1 1 1 1 1 1 1 1 1 1 1
General Aptitude
n= 16
~ Verbal Aptitude
{I} Numer~calAptitude
{ I} Spatial Aptitude
n= 16
- [} Form Perception
n= 16
{I Clerical Perception
{ 4} Motor Coordination
n= 14
I} Finger Dexterity
n= 10
Manual Dexterity
n= 10
1 1 1 1 1 1 1 1 1 1 1
1 1 1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
Minimum
{I
First
Quartile Median
Maximum
-
Third
Quartile
FIGURE 5-4 Practice effects when a different test form is used each time.
Distributions of estimated effect sizes (initial testing to retesting) are expressed in
standard deviations of initial aptitude distributions.
OCR for page 113
TEST ADMINISTRATION, SPEEDEDNESS, AND COACHABIH7Y 1 1 3
with the same form. For the cognitive aptitudes, mean scores increased
by a third of a standard deviation from initial testing to retesting. Gains on
the perceptual aptitudes averaged half a standard deviation to three-
fou~ths of a standard deviation, and gains on the psychomotor aptitudes
were even larger, approaching a whole standard deviation for the manual
dexterity aptitude. As Figure 5-4 shows, however, a large component of
these gains can be attributed to memory effects, since the corresponding
gains were much smaller when an alternate form of the GATB was used
for retesting. For the cognitive aptitudes, gains from practice alone
averaged about a fifth of a standard deviation. For the perceptual and
psychomotor aptitudes, gains due to practice were appreciably larger,
averaging about a third of a standard deviation.
These results suggest that examiners should not be retested using the
same form of the GATB, since their retest results are likely to be
spuriously high due to memory effects. In addition, the results suggest
that practice effects on the GATB are large enough, even when an
alternate form of the battery is used for retesting, to conclude that many
retested examinees will be advantaged substantially by the experience of
having completed the GATB once. We do not know if these findings have
changed over the 20 years since these studies were completed. These
estimated effects of practice on the GATB can be regarded as lower
bounds on gains that might be realized through intensive coaching.
Effects of Coaching on GATB Scores
If the use of the GATB were to be extended to the point that earning
high scores on the GATB had a substantial relationship with employabil-
ity, as would be the case if the VG-GATB Referral System were to be
implemented widely, it is likely that commercial coaching schools, such
as those presently in operation for the widely used higher education
admissions tests, would be developed. The coachability of the GATB
would then be a major equity issue, since those who could not afford to
attend commercial coaching schools would be at a disadvantage.
Little direct information on the coachability of the GATB subtests is
currently available. Rotman (1963) conducted a study with mentally
retarded young adult males in which he provided an average of 4.55 days
of instruction and practice on the GATB subtests that compose the
psychomotor aptitudes K, F. and M. A group of 40 instructed subjects
showed average gains in mean scores, expressed in units of estimated
effect sizes, of 0.94 for K, 0.43 for F. and 1.23 for M. In comparison, a
control group of 40 subjects who were retested with no intervening
instruction showed average effect sizes of 0.52 for K, 0.04 for F. and 0.38
for M. Practice effects alone added substantially to the average scores of
OCR for page 114
~ 14 ANALYSIS OF THE GENERA ETUDE TEST BAKERY
the control subjects on two of the three psychomotor aptitudes. Coaching
added even more to mean scores for all three psychomotor aptitudes.
Although the generalizability of these results to nonretarded examiners is
questionable, the potential coachability of the GATB subtests that
compose the psychomotor aptitudes is clearly indicated.
TEST SECURITY
If the Department of Labor decides to continue and expand the
VG-GATB Referral System, USES will have to develop new test security
procedures like those that surround the Scholastic Aptitude Test, the
American College Testing Program, the Armed Services Vocational
Aptitude Battery, and other major testing programs.
So long as the GATB was used primarily for vocational counseling,
the issue of security was not pressing. But if it is to be used to make
important decisions affecting the job prospects of large numbers of
Americans, then it is essential that no applicants have access to the test
questions ahead of time. This will require much tighter test administra-
tion procedures and strict control of every test booklet. State and local
Employment Service personnel will require more extensive training in
test administration procedures, and administrators will have to be
selected with greater care.
The need for test security will make it imperative that no operational
GATB forms be made available to private vocational counselors, labor
union apprenticeship programs, or high school guidance counselors. With
the development of additional forms on a regular cycle, the use of retired
forms for these other purposes may be appropriate, although the demon-
strated effects of practice with parallel forms (Figure 5-4) suggest the
need for caution.
Most important, the new role envisioned for the VG-GATB will
require a sustained test development program to produce more forms
with greater frequency. The present GATB is administered from just
two alternative forms, C and D, which replaced the 35- or 40-year-old
Forms A and B. By contrast, three new forms of the ASVAB are
introduced on a four-year cycle.
There is much accumulated wisdom on the subject of test security in
the Department of Defense Directorate for Accession Policy and in the
private companies that administer large test batteries. USES would
benefit from reviewing their protocols as a preliminary to drawing up
provisions for maintaining the security of the GATB.
OCR for page 115
TEST ADMINIST~TION, SPEEDEDNESS, AND COACHABlM ~ ~ 5
CONCLUSIONS
Test Administration Practices
1. The instructions to examiners, if followed, do not allow them to
maximize their GATB scores. No guidance is given about guessing on
items the examinee does not know. This practice is inconsistent with
accepted professional standards.
Speededness
2. Most of the GATB tests are highly speeded. This raises the issue of
a potential distortion of the construct purportedly measured and could
have effects on predictive validity.
To compound the problem, the test answer sheet bubbles are very large
and examiners are told to darken them completely, penalizing the
conscientious. When used with highly speeded tests such as the GATB,
the combined effects of the instructions given to examiners and the
answer sheet format add a validity-reducing, psychomotor component to
tests of other constructs.
The excessive speededness of the GATB makes it very vulnerable to
coaching.
.
Alternate Forms and Test Security
3. The paucity of new forms and insufficient attention to test security
speak against any widespread operationalization of the VG-GATB with-
out major changes in procedures.
At the present time, there are only two alternate forms of the GATB;
there have been just four in its 40 years of existence, although two new
forms are under development. In contrast, the major college testing
programs develop new forms annually, and the Department of Defense
develops three new forms of the Armed Services Vocational Aptitude
Battery at about four-year intervals.
In addition, test security has not been a primary concern so long as the
GATB was used largely as a counseling tool; it appears to be fairly easy
for anyone to become a certified GATB user and obtain access to a
copy of the test battery.
Item Bias
4. There is minimal evidence on which to decide whether the items in
the GATB are biased against minorities. On the basis of internal analysis,
there appears to be no idiosyncratic item functioning due to item content,
although there could be bias overall.
There is a modicum of evidence that test speed affects black examinees
differently from other examiners.
OCR for page 116
]6 ANALYSIS OF THE GENE~L APTITUDE TEST BAKERY
Practice Effects and Coaching
5. GATB scores will be significantly improved by practice. A major
reason for this is the speededness of the test parts. Experience with other
large-scale testing programs indicates that the GATB would be vulnerable
to coaching. This is a severe impediment to widespread operationalization
of the GATB.
The GATB's speededness, its consequent susceptibility to practice
effects and coaching, the small number of alternate forms, and low test
security in combination present a substantial obstacle to a broad expan-
sion of the VG-GATB Referral System.
RECOMMENDATIONS
Test Security
If the GATB is to be used in a widespread, nationwide testing program,
we recommend the adoption of formal test security procedures.
There are several components of test security to be considered in
implementing a large testing program.
1. There are currently two alternate forms of the GATB operationally
available and two under development. This is far too few for a nationwide
testing program. Alternate forms need to be developed with the same care
as the initial forms, and on a regular basis. Form-to-form equating will be
necessary. This requires the attention to procedures and normative
groups as described in the preceding chapter.
2. Access to operational test forms must be severely limited to only
those Department of Labor and Employment Service personnel involved
in the testing program and to those providing technical review. Strict test
access procedures must be implemented.
3. Separate but parallel forms of the GATB should be made available
for counseling and guidance purposes.
Test Speededness
4. A research and development project should be put in place to reduce
the speededness of the GATB. A highly speeded test, one that no one can
hope to complete, is eminently coachable. For example, scores can be
improved by teaching test takers to fill in all remaining blanks in the last
minute of the test period. If this characteristic of the GATB is not altered,
the test will not retain its validity when given a widely recognized
gatekeeping function.
OCR for page 117
PART I-II
VALIDITY GENERALIZATION AND
GATB VALIDITIES
Part III is the heart of the committee's assessment of the scientific
claims made to justify the Department of Labor's proposed plan for the
widespread use of the General Aptitude Test Battery to screen applicants
for private- and public-sector jobs. Chapter 6 is an overview of the theory
of validity generalization, which is a type of meta-analysis that is
proposed for extrapolating the estimated validities of a test for perfor-
mance on jobs that have been studied to others that have not.
The committee then addresses the research supported by the Depart-
ment of Labor to apply validity generalization to the GATB. Chapter 7
covers the first two parts of the analysis: reduction of the nine GATB
aptitudes to (effectively) two general factors, cognitive and psychomotor
ability, and the clustering of all jobs in the U.S. economy into five job
families. Chapter 8 presents the department's validity generalization
analysis of 515 GATB studies and compares those results with the
committee's own analysis of a larger data set that includes 264 more
recent studies.
Chapter 9 addresses the question of whether the GATB functions in the
same way for different demographic groups. It looks at the possibility that
correlations of GATB scores with on-thejob performance measures differ
by racial or ethnic group or gender, and the possibility that predictions of
criterion performance from GATB scores differ by group.
~7
OCR for page 118
Representative terms from entire chapter:
test administration