| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 303
A
A Synthesis of Research on Some
Psychometric Properties of the GATB
Richard M. Jaeger, Robert L. Linn, and Anita S. Tesh
This paper provides a detailed evaluation of three topics that bear on
the overall quality of the General Aptitude Test Battery (GATB): its
construct validity as supported by convergent validity evidence; its
reliability; and the interchangeability of its forms. The first section
presents the results of an exhaustive literature search for evidence that
the GATB aptitude composites measure the same characteristics as other
similarly named aptitude tests. The second section brings together the
research on the stability of GATB aptitude scores over time and among
forms of the test battery. And finally, the paper addresses the compara-
bility of the GATE] subtests from one form to another.
CONSTRUCT VALIDITY ISSUES
According to the Standardsfor Educational and Psychological Testing,
a statement of standards for the development and use of tests that is
adhered to by the major professional societies in the testing field, validity
is of paramount concern in assessing the use of tests (American Educa-
tional Research Association et al., 1985:9~:
Validity is the most important consideration in test evaluation. The concept refers
to the appropriateness, meaningfulness, and usefulness of the specific inferences
made from test scores. Test validation is the process of accumulating evidence to
support such inferences.... Although evidence may be accumulated in many
ways, validity always refers to the degree to which that evidence supports the
303
OCR for page 304
304 APPENDIX A
inferences that are made from the scores. The inferences regarding specific uses
of a test are validated, not the test itself.
We are concerned here with a particular category of validity evidence
involving construct-related evidence. According to the Standards (p. 9),
"evidence classed in the construct-related category focuses primarily on
the test score as a measure of the psychological characteristic of inter-
est." The Standards also provide examples of the types of evidence that
can be used to support construct validity claims (p. 10~:
Evidence for the construct interpretation of a test may be obtained from a variety
of sources. Intercorrelations among items may be used to support the assertion
that a test measures pomar~ly a single construct. Substantial relationships of a test
to other measures that are purportedly of the same construct and the weaknesses
of relationships to measures that are purportedly of different constructs support
both the identification of constructs and distinctions among them. Relationships
among different methods of measurement and among various non-test variables
similarly sharpen and elaborate the meaning and interpretation of constructs.
In this section we examine evidence that bears on claims that the
subtests of the GATB measure the aptitudes with which they are
identified in the GATB Manual (U.S. Department of Labor, 1970), and
nothing more. We provide a summary of correlations between subtests or
aptitudes of the GATB and correspondingly labeled subtests or aptitudes
of other test batteries.
Convergent Validity Evidence
The psychometric literature contains a substantial number of studies of
the strength of relationships between subtests of the GATB and corre-
sponding subtests of other batteries. As noted above' evidence of strong
positive relationships between purported measures of the same construct
is supportive of construct validity claims for all related measurement
instruments. Thus the claim that the subtests of the GATB measure the
aptitudes attributed to them would be enhanced by data of this sort and
weakened if small to moderate correlations between corresponding
subtests were to be found.
Chapter 14 of Section III of the GATB Manual (U.S. Department of
Labor, 1970) is entitled `'Correlations with Other Tests." The chapter
contains correlation matrices resulting from studies of the GATB and a
variety of other aptitude tests and vocational interest measures. Results
from 64 studies are reported, including several involving the initial edition
of the GATB (B-10011. In this summary, we restrict attention to studies
involving the current version of the GATB (B-1002, Forms A-D) and
appropriate aptitude tests. Since the publication of the GATB Manual,
OCR for page 305
APPENDIX A 305
correlations between various GATB aptitudes or subtests and corresponding
subtests of other test batteries have been provided in studies by Briscoe et al.
(1981~; Cassel and Reier (1971~; Cooley (1965~; Dong et al. (19861; Hakstian
and Bennett (1978~; Howe (19751; Kettner (19761; Kish (19701; Knapp et al.
(1977~; Moore and Davies (1984~; O'Malley and Bachman (1976~; and
Sakalosky (19701. The sizes and compositions of examinee samples used in
these studies are diverse, as are the aptitude batteries with which GATB
subtests and aptitudes were correlated. They range from 40 ninth-grade
students who completed both the GATB and the Differential Aptitude Test
Battery (DAT), to 1,355 Australian army enlisters who completed the
GATB and the Australian Army General Classification Test. However, in
~ of 13 studies (many of which examined several independent samples of
examinees), the samples consisted of high school students.
Three rules were followed in selecting appropriate studies of the
convergent validity of the GATB aptitudes with other, corresponding
aptitude measures. First, only correlations between GATB aptitudes or
subtests and corresponding components of other aptitude batteries were
included. Thus, correlations with self-reports of aptitude or with achieve-
ment measures or performance scores were purposefully omitted. Sec-
ond, only correlations with aptitude battery components having titles
similar to the GATB measure of interest were retained; for example, a
correlation between GATB Aptitude G and the abstract reasoning score
on the DAT was included; a correlation between GATB Aptitude G and
the numerical reasoning score on the DAT was excluded. Third, in studies
that reported correlations between all possible pairs of measures com-
posed of a GATB aptitude and an aptitude from another battery, only the
largest correlation between any GATB aptitude and an aptitude from the
other battery was retained. When rules two and three were applied
simultaneously, a correlation was included only if it reflected a relation-
ship between a GATB aptitude and the appropriate aptitude from another
battery and only if it exceeded the correlations between that GATB
aptitude and any other aptitude assessed by the other battery.
Data on the convergent validity of the GATB aptitudes were tabulated
for each aptitude (Table Am. Distributions of convergent validity coef-
ficients for the three cognitive aptitudes (G. V, and N) and the three
perceptual aptitudes (S. P. and Q) are displayed in pictorial form in Figure
4-2, and in tabular form in Table A-2.
Convergent Validity of GATB Aptitudle G (General Intelligence)
The 51 convergent validity coefficients that were reported for the
GATB-G aptitude ranged from .45 to .89, with a median value of .75.
Since G is a broadly defined construct that is assessed through the
OCR for page 306
306 APPENDIX A
TABLE A-1 Stem-and-Leaf Displays of Convergent Validity
Coefficients for the GATE Aptitudes
a. G. General
Intelligence
Stem Leaf
8 001112244579
778888889999
0112334555 7
777899
0135
66799
4
b. V, Verbal Ability
Stem Leaf
8 5
8 0001133
5566788889999
7 00000011222223334444
6 55678889999
6 0024
5 79
2 2
c. N. Numerical
Ability
Stem Leaf
8 5
8
7
7
6
6
5
5
4
01
5556666678
0111222224
55667778999
00222233
6777888
134
d. S. Spatial
Aptitude
Stem Leaf
7 01123
6 689
6
5
5 1
4 5
o
0223
7899
e.- P. Form Perception f. Q. Clerical
Perception
Stem Leaf
7 6
59
Stem Leaf
6 5
5 8
5 3
4 59
4 44 f
3 8
5 558
5 00
4 77
4 4
3 6
3 23
2 4
g. K, Motor h. F. Finger Dextenty
Coordination
Stem Leaf Stem Leaf
5 8 4 1
3 7
i. M, Manual
Dexterity
Stem Leaf
5 0
NOTE: The stem unit is .1. Therefore, the stem entry 8 followed by a leaf entry of 0
indicates a correlation coefficient of .80; each digit in a sequence of "leaves" indicates a
different correlation coefficient.
OCR for page 307
APPENDIX A 307
TABLE A-2 Summary Statistics for Distributions of Convergent
Validity Coefficients for the Cognitive GATB Aptitudes (G. V, and N)
and the Perceptual GATB Aptitudes (S. P. and Q)
Validity Coefficients
Number First Third
Aptitude of Studies Minimum Quartile Median Quartile Maximum
G 51 .45 .67 .75 .79 .89
V 59 .22 .69 .72 .78 .85
N 53 .43 .61 .68 .75 .85
S 19 .30 .58 .62 .70 .73
P 8 .38 .44 .47 .57 .65
Q 16 .24 .38 .50 .60 .76
K 1 .58 .58 .58 .58 .58
F 2 .37 .37 .39 .41 .41
M 1 .50 .50 .50 .50 .50
arithmetic, vocabulary, and spatial subtests of the GATB, a median
convergent validity coefficient of .75 does provide adequate evidence of
the convergent validity of the GATB intelligence aptitude. Data on the
convergent validity of GATB-G are presented in Table A-la.
Convergent Validity of GATB Aptitude V (Verbal Ability)
The 59 convergent validity coefficients that were reported for the
GATB-V aptitude ranged from .22 to .85, with a median value of .72.
Considering the variety of measures with which GATB-V was correlated,
and the less-than-perfect reliabilities of the GATB subtests that contribute
to V and the tests with which it was correlated, a median validity
coefficient of .72 provides adequate evidence of convergent validity for
the GATB verbal ability measure. Although the minimum observed
validity coefficient of .22 is discomforting, it is not at all representative of
validity coefficients in the lowest fourth of the distribution for V; the
next-lowest observed coefficient was .57. Data on the convergent validity
of GATB-V are presented in Table A-lb.
Convergent Validity of GATB Aptitude N (Numerical AbilityJ
The 53 convergent validity coefficients that were found for the GATB-
N aptitude ranged from .43 to .85, with a median value of .68. A median
convergent validity coefficient of .68 is somewhat smaller than would be
desired for a measure of numerical ability. However, a claim to conver-
gent validation for GATB-N is reasonably well supported by the data at
hand, since three-fourths of the coefficients exceed .61 and a fourth are
OCR for page 308
308 APPENDIX A
larger than .75. It should also be noted that, in several of the studies
reviewed, correlations were provided for GATB subtests rather than
GATB aptitudes. Such correlations will be attenuated by smaller reliabil-
ities than would be found for the GATB aptitudes. Data on the convergent
validity of GATB-N are presented in Table A-lc.
Convergent Validity of GATB Aptitude S (Spatial Aptitudej
The 19 convergent validity coefficients that were found for the GATB-S
aptitude ranged from .30 to .73, with a median value of .62. The GATB
spatial ability aptitude (S) is somewhat less highly correlated with its
counterpart measures in other test batteries than is the verbal ability
aptitude. A median concurrent validity coefficient of .62 with a range from
.30 to .73 and a fourth of the coefficients below .58 suggests that
somewhat different spatial perception constructs are measured in various
batteries, or that the reliabilities of spatial ability measures are somewhat
lower than those of corresponding verbal ability measures. Although
these data do not cast serious doubt on the construct validity of the spatial
ability aptitude, they are not as supportive as the evidence amassed for
the verbal ability measure. Data on the convergent validity of GATB-S
are presented in Table A-ld.
Convergent Validity of GATB Aptitude P (Form Perception)
The eight convergent validity coefficients that were found for the GATB-P
aptitude ranged from .38 to .65, with a median value of .47. The convergent
validity of the GATB form perception aptitude (P) is thus not well supported
by the evidence compiled in this review. As measured by the GATB, form
perception depends on examiners' abilities to discriminate among detailed
patterns shown on common tools and to match the outlines of two-
dimensional geometric forms represented by line drawings. Both tests are
somewhat speeded, perhaps adding an ability component that is not as
prevalent in the other test batteries used to generate the validity coefficients.
Whatever the basis for these results, it would seem prudent to undertake a
comparative content analysis of the tool matching and form matching
subtests of the GATB and the supposedly corresponding measures in the
test batteries used to generate these convergent validity coefficients. Data
on the convergent validity of GATB-P are presented in Table A-le.
Convergent Validity of GATB Aptitude Q (Clerical Perception)
The 16 convergent validity coefficients that were reported for the
GATB-Q aptitude ranged from .24 to .76, with a median value of .50. The
OCR for page 309
APPENDIX A 309
literature reviewed provided fewer convergent validity coefficients for the
GATB clerical perception aptitude (Q) than for a number of other GATB
aptitudes. Although many of the validity coefficients found for clerical
perception were larger than those found for the form perception aptitude
(P), evidence supporting the convergent validity of clerical perception
was not as compelling as that found for the three cognitive aptitudes (G.
V, and N). Even when somewhat smaller reliabilities are considered, a
median validity coefficient of .50 (uncorrected for unreliability) suggests
that the GATB name comparison subtest might measure a somewhat
different construct than do the subtests that contribute to clerical percep-
tion measures in other test batteries. Indeed, the description of the
clerical perception aptitude provided in the GATB Manual suggests a
somewhat broader aptitude (including arithmetic perception) than does
the description of the name comparison subtest on which it is based. Data
on the convergent validity of GATB-Q are presented in Table A-lf.
Convergent Validity of the Psychomotor Aptitudes K (Motor
Coordination), F (Finger DexterityJ, and M (Manual Dexterity)
Unfortunately, review of the literature since publication of the GATB
Manual produced very few studies of the convergent validity of subtests
underlying the psychomotor aptitudes (K, F. and M) of the GATB. There
was one correlation for K, motor coordination (.58), two for F. finger
dexterity (.37, .41), and one for M, manual dexterity (.501. And the
correlations reported for these aptitudes in the Manual cannot be re-
garded as convergent validity coefficients. Data on the convergent valid-
ity of aptitudes K, F. and M are presented in Table A-lg, h, and i,
respectively.
Summary of Convergent Validity Results
Distributions of convergent validity coefficients for the cognitive apti-
tudes of the GATB (G. V, and N) provide moderately strong support for
claims that these aptitudes are appropriately named and measured.
Corresponding results for the perceptual aptitudes of the GATB (S. P. and
Q) are less convincing. Data for the psychomotor aptitudes are so meager
that judgment on their convergent validity must be withheld.
Although the median convergent validity coefficient observed for the
spatial aptitude (S) was respectably large, the corresponding median
values for the form perception (P) and clerical perception (Q) aptitudes
were smaller than would be desired. The three-dimensional space
subtest is said to measure both intelligence and spatial aptitude and
might therefore require greater reasoning ability and inferential skill
OCR for page 310
310 APPENDIX A
than is typical of measures of spatial aptitude found in other batteries.
As has already been noted, the name comparison subtest of the GATB
appears to tap only a subset of the skills typically associated with clerical
perception.
Distributions of convergent validity coefficients for the cognitive and
perceptual GATB aptitudes are summarized in Table A-2 and for ease of
visual comparison, in Figure 4-2.
RELIABILITY OF THE GATB APTITUDE SCORES
Aptitude tests such as the GATB are intended to measure stable
characteristics of individuals, rather than transient or ephemeral qualities.
Such tests must measure these characteristics consistently, if they are to
be useful. Reliability is the term used to describe the degree to which a
test measures consistently. The Standarcis define reliability as follows
(American Educational Research Association et al., 1985:19~:
Reliability refers to the degree to which test scores are free from errors of
measurement. A test taker may perform differently on one occasion than on
another for reasons that may or may not be related to the purpose of measure-
ment. A person may try harder, be more fatigued or anxious, have greater
familiarity with the content of questions on one test form than another, or simply
guess correctly on more questions on one occasion than another. For these and
other reasons, a person's score will not be perfectly consistent from one occasion
to the next.... Differences between scores from one form to another or from
one occasion to another may be attributable to what is commonly called errors of
measurement.... Measurement errors reduce the reliability (and therefore the
generalizability) of the score obtained for a person from a single measurement.
Fundamental to the proper evaluation of a test are the identification of major
sources of measurement error, the size of the errors resulting from these sources,
the indication of the degree of reliability to be expected between pairs of scores
under particular circumstances, and the generalizability of scores across items,
forms, raters, administrations, and other measurement facets.
The Standards further state (p. 19) that test developers are primarily
responsible for assessing a test's reliability and for identifying major
sources of measurement error. When a test is composed of subtests, the
reliability of each must be investigated and reported in adequate detail, so
that test users can determine whether the test and subtests are sufficiently
reliable to be used for the purposes intended.
The reliability coefficient of a test is defined technically as the square of
the correlation between a hypothetical true score and the score actually
observed. The reliability coefficient represents the degree to which
differences among test takers' scores represent actual differences in their
abilities, rather than errors of measurement. If a test had a reliability
OCR for page 311
APPENDIX A 3 ~ ~
coefficient of 1.0, all of the differences among test takers' scores (i.e.,
variation in their scores) would represent differences in their abilities, and
none would represent errors of measurement. We would describe such a
test as being "perfectly reliable." If a test had a reliability coefficient of
zero, differences among test takers' scores would be due solely to errors
of measurement, and the test would be termed "totally unreliable." In
practice, tests of human abilities and aptitudes are neither perfectly
reliable nor totally unreliable; some variation in test scores reflects true
differences among test takers' abilities and some reflects errors of
measurement.
Because test reliability generally increases as test length increases, and
subtests are shorter than the test they compose, reliability coefficients for
subtests are typically smaller than corresponding coefficients for the test
as a whole. When the adequacy of a test's reliability is judged, it is
therefore important to consider the reliability of every score that is
separately reported and interpreted. In the case of the GATB, reliabilities
of aptitude scores are of central interest.
The psychometric literature includes a variety of methods for estimat-
ing test reliability. Popular methods differ in their sensitivity to various
sources of measurement error, in their applicability to different types of
tests, and in their usefulness for particular purposes. When tests are used
to assess aptitudes or other traits that are expected to be stable across
weeks, months, or years, the most appropriate reliability estimation
procedures will reflect the stability of measurements across time. Such
reliability estimates are termed stability coefficients or indices of temporal
stability. Stability coefficients are based on the consistency of examiners'
performances during two test administrations and might be spuriously
inflated because of examiners' memories of their initial responses. Risks
of distortion due to memory effects can be avoided if reliability estimates
are based on the administration of two forms of the same test, separated
by the amount of time the aptitudes measured are assumed to remain
stable. Such reliability estimates are termed "equivalent-forms reliability
coei~icients." A number of studies of the temporal stability and equiva-
lent-forms reliability of the GATB aptitude scores are summarized below.
Temporal Stability of the GATB Aptitude Scores
Studies of the temporal stability of GATB aptitude scores have exam-
ined a variety of time periods between test administrations, ranging from
one day to four years. These studies of the consistency of GATB aptitude
scores have used samples of examiners that vary widely in age and level
of education, including employed adults, junior high school students, high
school students, and college students. Estimates of the temporal stability
OCR for page 312
3 12 APPENDIX A
of GATB aptitude scores have also been computed for examinees of
different races. Table A-3 contains a summary of indices of the temporal
stability of GATB aptitude scores reported by Senior (1952), Showier and
Droege (1969), and the U.S. Department of Labor (1970, 1986~. The
estimates presented below were based on sequential administration of
either the same or different GATB test forms.
Temporal Stability by Age
As shown in Table A-3, coefficients of stability for the GATB cognitive
aptitudes G. V, and N consistently exceed .80 when samples are
composed of adults and time intervals between successive test adminis-
trations are no more than three years. For corresponding examinee
samples and time periods, coefficients of stability for the GATB percep-
tual aptitudes S. P. and Q are at least 0.70. Aptitude K showed similar
stability in all but one study. The other GATB psychomotor aptitudes, F
and M, which are measured by subtests that require manipulation of
objects, were found to have somewhat lower coefficients of stability than
did aptitudes measured by pencil-and-paper subtests. Coefficients of
stability for GATB aptitudes F and M were reported to be at least .57
when estimated for samples of adult examinees over time periods up to
three years. The temporal stabilities of the GATB psychomotor aptitudes
are not as well estimated as are those of the cognitive and spatial
aptitudes, since fewer studies have included the GATB subtests that
require manipulation of objects.
All of the GATB aptitude scores are less stable for samples of ninth-
and tenth-graders than for samples of adults. A portion of this instability
might be attributed to the maturation of these younger examiners during
the period between successive administrations of the GATB, and might
therefore reflect valid changes in the relative ordering of the examinees on
the aptitudes assessed, rather than instability due to measurement error.
Temporal Stability by Time Interval
The range of stability coefficients for GATB aptitude scores across
test-retest intervals from one day to four years varied from .51 to .94. For
specific time intervals, stability coefficients also varied greatly across
aptitudes. Stability coefficients tended to be largest for the cognitive
aptitudes (G. V, and N) and smallest for the psychomotor aptitudes (K, F.
and M).
For GATB cognitive aptitudes G. V, and N. an increase of more than
seven weeks in the time interval between successive test administrations
is necessary to reduce the average stability coefficient by .01. For the
OCR for page 313
313
o
V)
._
4 -
¢
m
EM
Ado
._
-
._
Ct
a
o
CQ
4 -
a,
.o
Cal
w
as
o
¢
En
V,
O
(Q
s
Z V,
Ct
C)
_
U:
CC ~
.E
- ¢
U)
Cal 00 ~ ~ ~ ~ V)
\0 ~ ~
· . . . · . . . . · · . · ~
00 0 \0 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 00
~ ~ ~ ~ ~ \0 \0 ~0 ~ AD rip ~ ~ vie
. . . . . . . . . . . . . .
ON 00 ~ ~ ~ ~ ~ ~ US ON ~ ~ ~ ~ ~ ~ ~ O
~ 00 ~ 00 00 ~ 00 ~ ~ ~ ~ 00 ~ ~ 00 00 ~ ~ ~ 00 ~ ~ ~
·. . . · . . . . · . . . . · · · . . . . . . .
~ oo ~ oN vo u~ ~ ~ ~ ~o ~ ~ oN ~ ~ ~ oo ~ oo o \8 ~ ~ ko o ~ ~ oN
oo oo oo oo oo oo oo ~ ~ ~ ~ oo ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ v
) v)
oo oo ~oo oo oo oo oo o
~ ~ ~ ~ oo ~ ~ oo ~ ~ o ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o ~ ~ ~ ~ ~
oo oo ~oo oo oo oo oo oo oo oo oo oo oo oo oO oO ~ ~ r~ ~ ~ ~ ~ oo
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o ~ ~ o oo oo ~ oo ~ ~ o ~
oo o~ ~ ~ oo ~ ~ c~ oo oo oo oo ~ ~ oo oo oo oo oo ~ oo r~ oo r~ ~ ~ r~ oo ~
~ ~ ~ O ~ ~ ~ ~ oO ~D \0 0N ~ ~ ~ ~ C~ ~ ~ r~ ~ c~ ~ ~ ~ ~ ~ ~ oo 0
oo c~ oo oo cn ~ ox ~ crx oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo ~ ~ ~ ~ ~o oo
· · ·.·.·· ·.·..
c~ ~ ~ c~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o ~ o ~ ~ c~ u~ ~ ~ c~ o ~ oo o ~ ~ o
oo ~ ~ ~ ~ ~ oo oo oo oo oN ~ ~ oo oo oo oo oo oo oo ~ ~ ~ oo ~ vo oo
ca u, CQ cn
· - o ~- - ~ ~ o . - o - (~ -
=_ =3
~.s ° ~ ~ ~ ~ ~ ~ 53 ~ ~ ~ oo ~ oo ~ oo oo o
~ ~ ~ ~ ~ ~o oo ~ ~ ~ ~ ~ ~ ~ ~ ~ M0 ~ 0 °° 00 ~ ~ ~ 0 ~ ~ ~
0 ~ ~v~ ~ ~ ~ .= ~ ~ 0 ~ ~ ~ ~ ax 0 ~ ~ 0 ~ ~ ~ 0
c~ ~ ~ ~ ~ u~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ r~
co u: ca co
co cc ca ~cc ~: ~:
o ° 3 3
3 3 3 3 3 33 ~- = ^^ ^^ ^^
~ ~ ~ ~D ~ ~ ~ ~ ~ ~
~o
mv v m ~m mmmm mm mm °~
~ Q ~I
¢< ¢ ¢~¢ m ¢¢ m ¢¢¢¢ ¢¢ ¢¢ m
~_ ~_ ~_ _-
C~ ~C ~0N
~D
C ~0N 0N 0\ 0N
O ~ O (L) O -- --
~ oo ~ ~ r~ O4 r ~os ~oc
a~ ~c'
O ~0 ~ ~ \, ~ '_ I, ~ _ =,
t=~S ~C ~C ~ C ~ ~ ~ ~ ~ ~ ~ ~ .O
C~ ~ ~ ~ ~ ~_ ,o ~ _ ~ ~ _ Z
OCR for page 314
3 ~ 4 APPENDIX A
GATB perceptual aptitudes S. P. and Q. estimates are, on average,
initially smaller than for the GATB cognitive aptitudes (G. V, and N) and
also vary more widely at specific test-retest time intervals. The relation-
ship between average stability coefficient and test-retest time interval is
modeled less well (by ordinary least-squares linear regression) for these
aptitudes than for the GATB cognitive aptitudes.
The GATB psychomotor aptitudes, K, F. and M, have still lower
estimated stability coefficient intercepts of .84, .70, and .76, respectively.
Aptitude K, which is measured with a pencil-and-paper subtest, has an
initial stability coefficient that is in the same range as those of the GATB
perceptual aptitudes. The two psychomotor aptitudes that are measured
with subtests requiring the manipulation of objects, F and M, have
somewhat lower average initial stability coefficients. It also appears to be
the case that the stabilities of these two psychomotor aptitudes degrade at
a faster rate as a function of time interval than is true of GATB aptitudes
that are measured with pencil-and-paper subtests.
Data on coefficients of stability of the GATB aptitudes were summa-
rized by computing simple linear regressions of stability coefficients as a
function of the time interval between the initial administration and the
second administration of the GATB. A scatter diagram that illustrates this
relationship for the GATB-G aptitude is shown in Figure 4-1. From the
data in the figure we can conclude that stability coefficients for the
GATB-G are large (approximating .91, on average) for very small time
intervals and degrade slowly as the time interval is increased. Similar
patterns were observed for the other GATB aptitudes, although initial
values of the stability coefficient were smaller for the spatial aptitudes
than for the cognitive aptitudes, and smaller still for the psychomotor
aptitudes. These data are summarized in Table A-4, which contains initial
values of stability coefficients, the degradation of stability coefficients
(amount by which they decrease from initial values) for a 100-day interval
between the initial administration of the GATB and the second adminis-
tration, and the proportion of variance in GATB stability coefficients that
is explained by a linear regression on the time interval between the initial
and the second GATB administration. The relationship is well explained
by a linear relationship for the cognitive and spatial aptitudes, but not for
the psychomotor aptitudes.
Equivalent Forms Reliability
Stability coefficients that are based on two administrations of the same
test form such as those discussed above are subject to spurious inflation
because examinees might remember, and merely duplicate, their initial
responses to test items. To avoid this problem, test reliability is some
OCR for page 315
APPENDIX A 3 ~ 5
TABLE A-4 Initial Values and Degradation of GATB Stability
Coefficients, and Proportion of Variance Explained by Linear
Relationship, by GATB Aptitude
Degradation
of Stability
Initial Value Coefficient Proportion
GATB of Stability (100-day of Variance
Aptitude Coefficient interval) Explained
G .9089 .0099 .58
V .8936 .0094 .64
N .8943 .0099 .56
S .8323 .0082 .42
P .8074 .0153 .49
Q .8108 .0106 .44
K .8390 .0088 .11
F .6994 .0069 .19
M .7572 .0068 .24
times estimated by correlating examiners' scores on two different forms
of a test. The forms are designed to be psychometrically parallel; that is,
equivalent in format, in length, and in the distribution of difficulties and
correlations of their items. Parallel test forms are sometimes administered
at the same time and are sometimes administered with an intervening time
interval. In the former case, correlations between examiners' scores are
most sensitive to differences in their performances on the two samples of
items that compose the parallel forms. Such estimates of reliability are
called "coefficients of equivalence." In the latter case, the temporal
instability of examiners' performances also attenuates correlations.
These estimates of reliability are called "test-retest parallel-forms esti-
mates" and reflect temporal stability as well as equivalence of perfor-
mance across parallel test forms.
Four operational forms of the GATB, labeled A, B. C, and D, are
currently in use. Forms A and B have been used since 1947 and are now
restricted to retesting of initially tested examinees and other low-inci-
dence uses. Forms C and D were normed in 1983 and are currently the
primary operational forms of the GATB. Many of the estimates of
temporal stability discussed above were based on the administration of
parallel forms of the GATB. Results of these studies, as well as other
studies of the equivalence of alternate forms of the GATB, are summa-
rized in Table A-5.
The coefficients of equivalence reported in the table are similar in
pattern to the coefficients of stability described earlier. Form A and B
coefficients tend to be a bit larger than Form A and C coefficients or Form
A and D coefficients. Although conclusions must be tentative because the
.
.
OCR for page 316
316
5 -
o
C)
CQ
._
m
o
a
o
CC
._
C)
U)
Lye
Em
o,
U)
z
C)
O
ca
an
~ Pa
Cal
0-
O .=
of
Can
~00 0
00 ~ 00
GoO Go O ~ ~cr ~ ~
U) ~ O~ 00 0 ~000
~ 00 ~ ~ 00 ~00 ~00 ~ ~00 00 ~00
············· ·· ·~
~ ~ ~ ~ ~ d ~ ~ ~ d" ~) C ~ O
oo oo oo o0 oo oo ~ ~ ~ \0 ~ \0 oo oo o0 ~oo oo
······.·· .. ·. ·· .· ·~
~ ~ ~ ~ ~ O ~ oo ~ ~ ~O ~ ~ ~
oo oo oo oo oo oo oo ~ ~ \0 ~ ~oo oo oo oo ~ oo
··.·.··.· .. ·· ·. .· ..
~n 0 oo ~ ~ ~ ~ ~ ~c~ ~ 0 ~ 0 ~ ~ ~
oo oo ~ o0 oo oo oo oo ~ ~oo ~oo oo oo oo a0 oo
·· ·.·. ·. · . . · . · · ~
O ~ ~ ~ ~ ~ ~
0N 0N ~ ~ ~ 0N oo oo
.........
oo ~ ~ ~ C ~t- ~
oo ~ oo oo oo oo oo oo o0 oO oo
......... ..
oo ~
· .
C~ ~ O CN ~ O ~ O ~ ~ ~
0N ~ ~ ~ ~ ~ ~ ~ o0 oo oO
~ ~ ~ ~ O ~
_ ~ 4,_ ~ _ ~
C~ ~ ~ ~ ~ ~ ~ ~ ~ == ~ ==
O ~ ~ ~ ~ ~ ~ O ~ O
~ ~ oo ~) ~ ~) ~ ~) \0 ~ ~~)
oo ~ ~oo ~c'
oo ~oo oo oo oo oc
· · · · · ~
o
ln ~cr oooo ~css 0
oo ~oo oooo oooo ~
· ·· ·· ~· ~
o oNo oooo ooo ~
oN ~ oooo oo~ oo
· · · · · · · -
=: ~
- ~
- ~
o & ~ ~ ~ ~ ~ o ~ ~
vo ~ ~ ~ ~ ~ ~
. ~ ~ ~ ~ ~ ~ In ~)
cn ~
~: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~:
~ 3 3 3 3 3 ~ ~ ~ ~3 3 3 3 3 3 3 3
~ ~ ~o ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 1_
mm mmmmmm mm mm u~u~o
<¢ <~¢< <¢ << ~U<~
OCR for page 317
APPENDIX A 3 ~ 7
number of studies is small, coefficients of equivalence between Forms C and
D appear to be of about the same magnitude as Form A and B coefficients.
As was true of coefficients of stability for the GATB aptitudes, the cognitive
aptitudes (G. V, and N) appear to be the most reliable. In addition, the
coefficients of equivalence of the psychomotor aptitudes that are assessed by
tests that require manipulation of objects (F and M) tend to be smallest. With
the exception of these latter two aptitudes, scores on the GATB aptitudes
appear to be assessed with acceptably large interform equivalence and
stability for time intervals of one year or less.
It is also possible to estimate the reliability of a test using data collected
In a single test administration. Such reliability estimates are called
"internal consistency coefficients" since they are effectively correlations
of two or more subtests with each other. Internal consistency estimates
are not appropriate for assessing the reliability of speeded tests because
the consistency actually measured is the consistency of the speed of
response, not the consistency of correct answers or of ability.
No estimates of the internal consistency of the GATB subtests or
aptitudes are provided in the GATB Manual (U.S. Department of Labor,
1970) or in more recent literature that was reviewed in preparing this
appendix. Since all subtests of the GATB are administered under condi-
tions that impose severe time limits on examiners, estimates of the
internal consistency of the GATB subtests are likely to be spuriously high
(Anastasi and Drake, 1954; Cronbach and Warrington, 1951; Gulliksen,
1950a,b; Helmstadter and Ortmeyer, 1953; Lord and Novick, 1968;
Mollenkopf, 1960; Morrison, 1960; Rindler, 1979; Stafford, 1971; Wes-
man, 19604. It is therefore appropriate that Employment Service publica-
tions on the GATB and other psychometric literature be devoid of
estimates of internal consistency for the GATB.
EQUATING ALTERNATE FORMS OF THE GATB
The Standards describe the goal for equated forms of a test (American
Educational Research Association et al., 1985:311:
Ideally, alternate forms of a test are interchangeable in use. That is, it should be
a matter of indifference to anyone taking the test or to anyone using the test results
whether form A or form B of the test was used. Of course, such an ideal cannot
be attained fully in practice. Even minor variations in content from one form to
the next can prevent the forms from being interchangeable since one form may
favor individuals with particular strengths, whereas the other form may favor
those with slightly different strengths.
Although considerable care may be taken to make two forms of a test
as similar as possible in terms of content and format, the forms cannot be
OCR for page 318
3 ~ ~ APPENDIX A
expected to be precisely equal in difficulty. Consequently, the use of
simple number-right scores without regard to form is generally inappro-
priate because such scores would place the people taking the more
difficult of the two forms at a disadvantage. To take the unintended
differences in difficulty into account, it is usually necessary to convert
the scores of one form to the units of the other, a process called test
equating (American Educational Research Association et al., 1985:311.
There are a number of data collection designs and analytical techniques
that may be used to equate forms of a test. Detailed descriptions of the
various approaches can be found in Angoff (1971) and in Petersen et al.
(1989~. Regardless of the approach, however, there are two major issues
that need to be considered in judging the adequacy of the equating: (1) the
degree to which the forms measure the same characteristic or construct
and (2) the magnitude of the errors in the equating due to the procedure
and sampling.
Comparability of Constructs Measured
Although the analytical techniques used in equating could be applied to
scores from any pair of tests, it makes sense to consider scores to be
interchangeable only when forms measure essentially the same characteris-
tics. For example, equating techniques could be used to convert scores on an
achievement test in, say, biology such that the distribution of converted
biology scores had the same mean and standard deviation as scores on a
physics achievement test for some large sample of test takers. Clearly,
however, it would not be a matter of indifference to most test takers
whether they were administered the biology test or the physics test. Those
with greater strengths in biology would prefer the biology test, whereas the
converse would be true for those with greater strengths in physics.
In practice, of course, the differences between forms that are to be
equated are not so obvious or extreme as the difference between a biology
and a physics test. However, subtle differences in test content or the
format of the test items can sometimes lead to important shifts in what the
test forms measure and therefore to conversions that can give an
unintended advantage or disadvantage to particular test takers.
If alternate forms of a test measure the same characteristics, one would
expect scores on the two forms to be highly correlated. In the case of a
battery of tests, such as the GATE, one would also expect the pattern of
correlations among the parts of the tests to be similar for different forms.
Evidence regarding both of these issues is provided in an addendum to the
1970 Manual for USES General Aptitude Test Battery, Section III:
Development, titled Reliability and Comparability: Forms C and D
(U.S. Department of Labor, 1986~.
OCR for page 319
APPENDIX A 3 ~ 9
Pairs of three forms of Subtests 1 through 8 of the GATB were
administered to a sample of 3,344 people in 20 participating states. A total
of six subsamples ranging in size from 545 to 567 were administered pairs
of Forms A, C, and D in counterbalanced order. That is, one subsample
(AC) was administered Form A first and then Form C a week later, a
second subsample (CA) was administered Form C first and Form A a
week later. The remaining four subsamples (AD, DA, CD, and DC) were
defined in an analogous fashion.
The correlations between the scores on the alternate forms for each of
Subtests 1 through ~ and for the seven aptitude scores that are based on the
first eight subtests of the GATB provide estimates of the alternate form
reliabilities of the Subtests and of the aptitude scores. The alternate-form
reliabilities of the first three aptitudes (intelligence, verbal, and numerical)
are close to .90, a level that is consistent with alternate-form reliabilities of
aptitude tests of high technical quality that are provided by major test
publishers. The reliabilities of the remaining four aptitudes (spatial, form
perception, clerical, and motor coordination) are lower, close to .80.
The alternate-form reliabilities of the Subtests are in the .80s for
Subtests 1 through 4 (name comparison, computation, three-dimensional
space, and vocabulary) and the high .70s to .80 for Subtests 5, 6, and 8
(tool matching, arithmetic reasoning, and mark making). The lowest
reliabilities were obtained for Subtest 7, form matching, where two of the
correlations, both involving the relationship between Forms A and C, fall
below .70.
In addition to providing alternate-form reliabilities for the first eight
Subtests of the GATB, the addendum to the Manual (U.S. Department of
Labor, 1986) also reports matrices of intercorrelations among the eight
scores for each of the three forms involved in the reliability and
comparability study. Correlations for each pair of subtests show that the
three forms of the GATB are virtually indistinguishable in terms of the
pattern of correlations among the subtests. The similarity in the patterns
of intercorrelations coupled with the reasonably high alternate-form
reliabilities suggest that, with the possible exception of Subtest 7, form
matching, the forms are measuring sufficiently similar characteristics to
equate the forms and use the equated scores interchangeably.
Much less evidence is provided regarding the comparability of Subtests
9 through 12 of the GATB than was provided for Subtests 1 through 8.
The evidence that is available is based on smaller samples of test takers
and provides less support for concluding that the subtest scores can be
treated as being interchangeable after equating.
Subtests 9 through 12 of Form A were administered to 273 test takers
followed by Form C one week later. Another sample of 260 test takers
was administered Forms A and D in the same pattern (Table Am. The
OCR for page 320
320 APPENDIX A
TABLE A-6 Alternate-Form Reliabilities of Subtests 9 Through 12 of
the GATE and of the Aptitudes Based on Those Subtests, Based on
Samples of Sizes 266 and 273
Reliabilities
Forms A,C
Forms A,D
Subtest
9. Place .62 .72
10. Turn .52 .56
11. Assemble .45 .60
12. Disassemble .54 .70
Aptitude
Manual Dexterity .65 .74
Finger Dexterity .57 .71
alternate-form reliabilities for both the subtest scores and for the two
aptitudes that are based on Subtests 9 through 12 are lower than were
obtained for Subtests 1 through 8. The correlations of the Form C with
Form A scores are particularly low.
In addition to having low alternate form reliabilities, the degree to
which the alternate forms measure the same skills is brought into question
by the differences in the patterns of correlations among the subtests. The
intercorrelations for each pair of subtest scores are listed in Table A-7.
Two correlations for each pair of subtests are shown for Form A. The first
is based on the subsample that also took Form C; the second correlation
is based on the subsample that also took Form D.
These correlations are subject to greater sampling error than the ones
that were summarized above for Subtests 1 through 8 because of their
smaller sample sizes. In addition, there are no replications for Forms C
and D. Nonetheless, the pattern of correlations for Form C seems to differ
from that of Form D. Indeed, considering the Form C versus Form D
TABLE A-7 Correlations Between Subtest Scores by Form
Correlations
Subtest Pairs Form A Form C Form D
9 with 10 .66 and .62 .73 .53
9 with 11 .40 and .42 .50 .32
9 with 12 .50 and .52 .44 .49
10 with 11 .44 and .40 .50 .27
10 with 12 .56 and .56 .50 .35
11 with 12 .57 and .46 .50 .49
OCR for page 321
APPENDIX A 32 1
correlation between subtests, one pair of subtests at a time, four of the six
pairs of correlations are significantly different at the .05 level.
In summary, the alternate-form reliabilities and patterns of correlations
do not provide sufficient evidence that Subtests 9 through 12 of Forms A,
C, and D measure essentially the same characteristics. Hence, the scores
from different forms for these Subtests should not be considered to be
interchangeable .
Equating Procedure and Sampling Errors
The first GATE equating was to place B-1002 (Form A and Form B) on
the scale developed for B-1001. To equate Form A, four groups of high
school students were administered the old and one of the two new forms,
in counterbalanced order. The representativeness of these groups is
questionable. One group had to be dropped because test scores were not
comparable; the remaining total sample size was 585. The conversion of
Form B was similar, but was based on 412 high school students. These are
very small samples on which to base equating. The equating procedures,
while not specified, appeared to be linear.
The information that is available in the 1984 and 1986 addenda to the
Manual (U.S. Department of Labor, 1984, 1986) is inadequate for a
complete evaluation of the equating of Forms C and D. Thus, the
comments on this aspect of the equating are brief and inconclusive.
The design used to investigate the reliability and comparability of
Subtests 1 through 8 of Forms A, C, and D is a standard equating design.
It is referred to as "Design II: Random groups both tests administered
to each group, counterbalanced" by Angoff (19711. This design provides
equating results that have much greater precision, i.e., smaller standard
errors of equating, than the more commonly used designs in which only
one of the forms is administered to each group. Given this design, the
sample sizes were consistent with accepted practice for test equating.
Details of the equating of Subtests 1 through ~ of Forms C and D to Form
A are not reported. The data from the reliability and comparability study
could have been used with one of the analytical procedures described by
Angoff (1971) for Design II. However, information about the specific
procedure that was used is not provided. No evidence regarding the
adequacy of the equating (e.g., comparisons of linear and equipercentile
equating results or estimates of standard errors of equating at different score
levels) is presented. Hence, an independent evaluation is not possible.
For Subtests 9 through 12, reservations about equating have already
been expressed because of questions about the comparability of the
characteristics measured by the different forms. The order of administra-
tion was not counterbalanced in the design that was used to investigate
OCR for page 322
322 APPENDIX A
the alternate-form reliability and comparability of Subtests 9 through 12;
consequently, a separate standardization sample was obtained for pur-
poses of equating.
The standardization sample (for Subtests 9 through 12) consisted of a
total of 2,092 persons from 11 states. Form C was administered to 981
people and Form D to 1,111 people in the standardization sample.
Although details of assignment to the two subsamples are not presented,
they were apparently considered to be random samples from the same
population as the Form A standardization sample to which Forms C and
D were equated. Random assignment is a critical part of Ango~s (1971:
569) "Design I: Random groups~ne test administered to each group."
Hence, it seems important to understand the basis for considering the
Form C and D standardization samples to be randomly equivalent to the
Form A standardization sample.
The 1986 addendum to the Manual indicates that both linear and
equipercentile equating procedures were obtained for the equating of
Subtests 9 through 12, but because the two procedures produced very
similar results it was decided that the linear equating would be used (U.S.
Department of Labor, 19861. Such a decision is consistent with accepted
practice. However, no information comparing the results of the two
procedures is provided. Hence it is not possible to provide an indepen-
dent evaluation of this decision.
REFERENCES
American Educational Research Association, American Psychological Association, and
National Council on Measurement in Education
1985 Standards for Educational and Psychological Testing. Washington, D.C.: Ameri-
can Psychological Association.
Anastasi, A., and R. Drake
1954 An empirical comparison of certain techniques for estimating the reliability of
speeded tests. Educational and Psychological Measurement 14:529-540.
Angoff, W.H.
1971 Scales, norms, and equivalent scores. In R.L. Thorndike, ea., Educational
Measurement' 2d ed. Washington, D.C.: American Council on Education.
Briscoe, C.D., W. Muelder, and W. Michael
1981 Concurrent validity of self-estimates of abilities relative to criteria provided by
standardized test measures of the same abilities for a sample of high school
students eligible for participation in the CETA program. Educational and Psycho-
logical Measurement 41 :1285-1294.
Cassel, R.N., and G.W. Reier
1971 Comparative analysis of concurrent and predictive validity for the GATE Clerical
Aptitude Test Battery. Journal of Psychology 79:135-140.
Cooley, W.W.
1965 Further relationships with the TALENT battery. Personnel and Guidance Journal
44:295-303.
OCR for page 323
APPENDIX A 323
Cronbach, Lee J., and W.G. Warrington
1951 Time limit tests: estimating their reliability and degree of speeding. Psychometrika
14: 167-188.
Dong, H., Y. Sung, and S. Goldman
1986 The validity of the Ball Aptitude Test Battery (BAB) III: relationship to the CAB,
DAT, and GATB. Educational and Psychological Measurement 46:245-250.
Gulliksen, H.A.
1950a The reliability of speeded tests. Psychometrika 15:259-260.
1950b Theory of Mental Tests. New York: Wiley.
Hakstian, A.R., and R. Bennett
1978 Validity studies using the Comprehensive Ability Battery (CAB) II: Relationship
with the DAT and GATB. Educational and Psychological Measurement 38: 1003-
1015.
Helmstadter, G.C., and D.H. Ortmeyer
1953 Some techniques for determining the relative magnitude of speed and power
components of a test. Educational and Psychological Measurement 12:280-287.
Howe, M.A.
1975 General Aptitude Test Battery Q: an Australian empirical study. Australian
Psychologist 10:32-44.
Kettner, N.
1976 Armed Services Vocational Aptitude Battery (ASVAB Form 5): Comparison with
GATB and DAT Tests. Final report, May 1975-October 1976. Armed Services
Human Resources Laboratory, Brooks Air Force Base, Texas.
Kish, G.B.
1970 Alcoholics' GATB and Shipley profiles and their interrelationships. Journal of
Clinical Psychology 26:482-484.
Knapp, R., L. Knapp, and W. Michael
1977 Stability and concurrent validity of the Career Ability Placement Survey (CAPS)
against the DAT and the GATB. Educational and Psychological Measurement 37:
1081-1085.
Lord, F.M., and Melvin R. Novick
1968 Statistical Theories of Mental Test Scores. Reading, Mass.: Addison-Wesley.
Mollenkopf, W.G.
1960 Time limits and the behavior of test takers. Educational and Psychological
Measurement 20:223-230.
Moore, R., and J. Davies
1984 Predicting GED scores on the basis of expectancy, valence, intelligence, and
pretest skill levels with the disadvantaged. Educational and Psychological Mea-
surement 44:483-489.
Morrison, E.J.
1960 On test variance and the dimensions of the measurement situation. Educational
and Psychological Measurement 20:231-250.
O'Malley, P., and J. Bachman
1976 Longitudinal evidence for the validity of the Quick Test. Psychological Reports 38:
1247-1252.
Petersen, Nancy S., Michael J. Kolen, and H.D. Hoover
1989 Scaling, norming, and equating. Chap. 6 in Robert L. Linn, ea., Educational
Measurement, 3d ed. New York: Macmillan.
Rindler, S.E.
1979 Pitfalls in assessing test speededness. Journal of Educational Measurement 16:
261-270.
OCR for page 324
324 APPENDIX A
Sakalosky, J.C.
1970 A Study of the Relationship Between the Differential Aptitude Test Battery and the
General Aptitude Test Battery Scores of Ninth Graders. Master's thesis, Millers-
ville State College.
Senior, N.
1952 An Analysis of the Effect of Four Years of College Training on General Aptitude
Test Battery Scores. Unpublished master's thesis, University of Utah, Provo.
Showler, W.K., and R.C. Droege
1969 Stability of aptitude scores for adults. Educational and Psychological Measure-
ment 29:681-686.
Stafford, R.E.
1971 The speededness quotient: A new descriptive statistic for tests. Journal of
Educational Measurement 8:275-278.
U.S. Department of Labor
1970 Manual for the USTES General Aptitude Test Battery. Section III: Development.
Washington, D.C.: Manpower Administration, U.S. Department of Labor.
1984 Forms C and D of the General Aptitude Test Battery: An Historical Review of
Development. Division of Counseling and Test Development, Employment and
Training Administration, U.S. Department of Labor, Washington, D.C.
1986 Reliability and Comparability: Forms C and D. Addendum to Manual for the
USTES General Aptitude Test Battery. Section IlI: Development. U.S. Employ-
ment Service, Employment and Training Administration. Washington, D.C.: U.S.
Department of Labor.
Wesman, A.G.
1960 Some effects of speed in test use. Educational and Psychological Measurement 20:
267-274.
Representative terms from entire chapter:
gatb aptitude