Read "Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery" at NAP.edu

« Previous: References

Page 303 Cite

Suggested Citation:"Appendix A: A Synthesis of Research on Some Psychometric Properties of the GATB." National Research Council. 1989. Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery. Washington, DC: The National Academies Press. doi: 10.17226/1338.

Page 304 Cite

Page 305 Cite

Page 306 Cite

Page 307 Cite

Page 308 Cite

Page 309 Cite

Page 310 Cite

Page 311 Cite

Page 312 Cite

Page 313 Cite

Page 314 Cite

Page 315 Cite

Page 316 Cite

Page 317 Cite

Page 318 Cite

Page 319 Cite

Page 320 Cite

Page 321 Cite

Page 322 Cite

Page 323 Cite

Page 324 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A A Synthesis of Research on Some Psychometric Properties of the GATB Richard M. Jaeger, Robert L. Linn, and Anita S. Tesh This paper provides a detailed evaluation of three topics that bear on the overall quality of the General Aptitude Test Battery (GATB): its construct validity as supported by convergent validity evidence; its reliability; and the interchangeability of its forms. The first section presents the results of an exhaustive literature search for evidence that the GATB aptitude composites measure the same characteristics as other similarly named aptitude tests. The second section brings together the research on the stability of GATB aptitude scores over time and among forms of the test battery. And finally, the paper addresses the compara- bility of the GATE] subtests from one form to another. CONSTRUCT VALIDITY ISSUES According to the Standardsfor Educational and Psychological Testing, a statement of standards for the development and use of tests that is adhered to by the major professional societies in the testing field, validity is of paramount concern in assessing the use of tests (American Educa- tional Research Association et al., 1985:9~: Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences.... Although evidence may be accumulated in many ways, validity always refers to the degree to which that evidence supports the 303

304 APPENDIX A inferences that are made from the scores. The inferences regarding specific uses of a test are validated, not the test itself. We are concerned here with a particular category of validity evidence involving construct-related evidence. According to the Standards (p. 9), "evidence classed in the construct-related category focuses primarily on the test score as a measure of the psychological characteristic of inter- est." The Standards also provide examples of the types of evidence that can be used to support construct validity claims (p. 10~: Evidence for the construct interpretation of a test may be obtained from a variety of sources. Intercorrelations among items may be used to support the assertion that a test measures pomar~ly a single construct. Substantial relationships of a test to other measures that are purportedly of the same construct and the weaknesses of relationships to measures that are purportedly of different constructs support both the identification of constructs and distinctions among them. Relationships among different methods of measurement and among various non-test variables similarly sharpen and elaborate the meaning and interpretation of constructs. In this section we examine evidence that bears on claims that the subtests of the GATB measure the aptitudes with which they are identified in the GATB Manual (U.S. Department of Labor, 1970), and nothing more. We provide a summary of correlations between subtests or aptitudes of the GATB and correspondingly labeled subtests or aptitudes of other test batteries. Convergent Validity Evidence The psychometric literature contains a substantial number of studies of the strength of relationships between subtests of the GATB and corre- sponding subtests of other batteries. As noted above' evidence of strong positive relationships between purported measures of the same construct is supportive of construct validity claims for all related measurement instruments. Thus the claim that the subtests of the GATB measure the aptitudes attributed to them would be enhanced by data of this sort and weakened if small to moderate correlations between corresponding subtests were to be found. Chapter 14 of Section III of the GATB Manual (U.S. Department of Labor, 1970) is entitled `'Correlations with Other Tests." The chapter contains correlation matrices resulting from studies of the GATB and a variety of other aptitude tests and vocational interest measures. Results from 64 studies are reported, including several involving the initial edition of the GATB (B-10011. In this summary, we restrict attention to studies involving the current version of the GATB (B-1002, Forms A-D) and appropriate aptitude tests. Since the publication of the GATB Manual,

APPENDIX A 305 correlations between various GATB aptitudes or subtests and corresponding subtests of other test batteries have been provided in studies by Briscoe et al. (1981~; Cassel and Reier (1971~; Cooley (1965~; Dong et al. (19861; Hakstian and Bennett (1978~; Howe (19751; Kettner (19761; Kish (19701; Knapp et al. (1977~; Moore and Davies (1984~; O'Malley and Bachman (1976~; and Sakalosky (19701. The sizes and compositions of examinee samples used in these studies are diverse, as are the aptitude batteries with which GATB subtests and aptitudes were correlated. They range from 40 ninth-grade students who completed both the GATB and the Differential Aptitude Test Battery (DAT), to 1,355 Australian army enlisters who completed the GATB and the Australian Army General Classification Test. However, in ~ of 13 studies (many of which examined several independent samples of examinees), the samples consisted of high school students. Three rules were followed in selecting appropriate studies of the convergent validity of the GATB aptitudes with other, corresponding aptitude measures. First, only correlations between GATB aptitudes or subtests and corresponding components of other aptitude batteries were included. Thus, correlations with self-reports of aptitude or with achieve- ment measures or performance scores were purposefully omitted. Sec- ond, only correlations with aptitude battery components having titles similar to the GATB measure of interest were retained; for example, a correlation between GATB Aptitude G and the abstract reasoning score on the DAT was included; a correlation between GATB Aptitude G and the numerical reasoning score on the DAT was excluded. Third, in studies that reported correlations between all possible pairs of measures com- posed of a GATB aptitude and an aptitude from another battery, only the largest correlation between any GATB aptitude and an aptitude from the other battery was retained. When rules two and three were applied simultaneously, a correlation was included only if it reflected a relation- ship between a GATB aptitude and the appropriate aptitude from another battery and only if it exceeded the correlations between that GATB aptitude and any other aptitude assessed by the other battery. Data on the convergent validity of the GATB aptitudes were tabulated for each aptitude (Table Am. Distributions of convergent validity coef- ficients for the three cognitive aptitudes (G. V, and N) and the three perceptual aptitudes (S. P. and Q) are displayed in pictorial form in Figure 4-2, and in tabular form in Table A-2. Convergent Validity of GATB Aptitudle G (General Intelligence) The 51 convergent validity coefficients that were reported for the GATB-G aptitude ranged from .45 to .89, with a median value of .75. Since G is a broadly defined construct that is assessed through the

306 APPENDIX A TABLE A-1 Stem-and-Leaf Displays of Convergent Validity Coefficients for the GATE Aptitudes a. G. General Intelligence Stem Leaf 8 001112244579 778888889999 0112334555 7 777899 0135 66799 4 b. V, Verbal Ability Stem Leaf 8 5 8 0001133 5566788889999 7 00000011222223334444 6 55678889999 6 0024 5 79 2 2 c. N. Numerical Ability Stem Leaf 8 5 8 7 7 6 6 5 5 4 01 5556666678 0111222224 55667778999 00222233 6777888 134 d. S. Spatial Aptitude Stem Leaf 7 01123 6 689 6 5 5 1 4 5 o 0223 7899 e.- P. Form Perception f. Q. Clerical Perception Stem Leaf 7 6 59 Stem Leaf 6 5 5 8 5 3 4 59 4 44 f 3 8 5 558 5 00 4 77 4 4 3 6 3 23 2 4 g. K, Motor h. F. Finger Dextenty Coordination Stem Leaf Stem Leaf 5 8 4 1 3 7 i. M, Manual Dexterity Stem Leaf 5 0 NOTE: The stem unit is .1. Therefore, the stem entry 8 followed by a leaf entry of 0 indicates a correlation coefficient of .80; each digit in a sequence of "leaves" indicates a different correlation coefficient.

APPENDIX A 307 TABLE A-2 Summary Statistics for Distributions of Convergent Validity Coefficients for the Cognitive GATB Aptitudes (G. V, and N) and the Perceptual GATB Aptitudes (S. P. and Q) Validity Coefficients Number First Third Aptitude of Studies Minimum Quartile Median Quartile Maximum G 51 .45 .67 .75 .79 .89 V 59 .22 .69 .72 .78 .85 N 53 .43 .61 .68 .75 .85 S 19 .30 .58 .62 .70 .73 P 8 .38 .44 .47 .57 .65 Q 16 .24 .38 .50 .60 .76 K 1 .58 .58 .58 .58 .58 F 2 .37 .37 .39 .41 .41 M 1 .50 .50 .50 .50 .50 arithmetic, vocabulary, and spatial subtests of the GATB, a median convergent validity coefficient of .75 does provide adequate evidence of the convergent validity of the GATB intelligence aptitude. Data on the convergent validity of GATB-G are presented in Table A-la. Convergent Validity of GATB Aptitude V (Verbal Ability) The 59 convergent validity coefficients that were reported for the GATB-V aptitude ranged from .22 to .85, with a median value of .72. Considering the variety of measures with which GATB-V was correlated, and the less-than-perfect reliabilities of the GATB subtests that contribute to V and the tests with which it was correlated, a median validity coefficient of .72 provides adequate evidence of convergent validity for the GATB verbal ability measure. Although the minimum observed validity coefficient of .22 is discomforting, it is not at all representative of validity coefficients in the lowest fourth of the distribution for V; the next-lowest observed coefficient was .57. Data on the convergent validity of GATB-V are presented in Table A-lb. Convergent Validity of GATB Aptitude N (Numerical AbilityJ The 53 convergent validity coefficients that were found for the GATB- N aptitude ranged from .43 to .85, with a median value of .68. A median convergent validity coefficient of .68 is somewhat smaller than would be desired for a measure of numerical ability. However, a claim to conver- gent validation for GATB-N is reasonably well supported by the data at hand, since three-fourths of the coefficients exceed .61 and a fourth are

308 APPENDIX A larger than .75. It should also be noted that, in several of the studies reviewed, correlations were provided for GATB subtests rather than GATB aptitudes. Such correlations will be attenuated by smaller reliabil- ities than would be found for the GATB aptitudes. Data on the convergent validity of GATB-N are presented in Table A-lc. Convergent Validity of GATB Aptitude S (Spatial Aptitudej The 19 convergent validity coefficients that were found for the GATB-S aptitude ranged from .30 to .73, with a median value of .62. The GATB spatial ability aptitude (S) is somewhat less highly correlated with its counterpart measures in other test batteries than is the verbal ability aptitude. A median concurrent validity coefficient of .62 with a range from .30 to .73 and a fourth of the coefficients below .58 suggests that somewhat different spatial perception constructs are measured in various batteries, or that the reliabilities of spatial ability measures are somewhat lower than those of corresponding verbal ability measures. Although these data do not cast serious doubt on the construct validity of the spatial ability aptitude, they are not as supportive as the evidence amassed for the verbal ability measure. Data on the convergent validity of GATB-S are presented in Table A-ld. Convergent Validity of GATB Aptitude P (Form Perception) The eight convergent validity coefficients that were found for the GATB-P aptitude ranged from .38 to .65, with a median value of .47. The convergent validity of the GATB form perception aptitude (P) is thus not well supported by the evidence compiled in this review. As measured by the GATB, form perception depends on examiners' abilities to discriminate among detailed patterns shown on common tools and to match the outlines of two- dimensional geometric forms represented by line drawings. Both tests are somewhat speeded, perhaps adding an ability component that is not as prevalent in the other test batteries used to generate the validity coefficients. Whatever the basis for these results, it would seem prudent to undertake a comparative content analysis of the tool matching and form matching subtests of the GATB and the supposedly corresponding measures in the test batteries used to generate these convergent validity coefficients. Data on the convergent validity of GATB-P are presented in Table A-le. Convergent Validity of GATB Aptitude Q (Clerical Perception) The 16 convergent validity coefficients that were reported for the GATB-Q aptitude ranged from .24 to .76, with a median value of .50. The

APPENDIX A 309 literature reviewed provided fewer convergent validity coefficients for the GATB clerical perception aptitude (Q) than for a number of other GATB aptitudes. Although many of the validity coefficients found for clerical perception were larger than those found for the form perception aptitude (P), evidence supporting the convergent validity of clerical perception was not as compelling as that found for the three cognitive aptitudes (G. V, and N). Even when somewhat smaller reliabilities are considered, a median validity coefficient of .50 (uncorrected for unreliability) suggests that the GATB name comparison subtest might measure a somewhat different construct than do the subtests that contribute to clerical percep- tion measures in other test batteries. Indeed, the description of the clerical perception aptitude provided in the GATB Manual suggests a somewhat broader aptitude (including arithmetic perception) than does the description of the name comparison subtest on which it is based. Data on the convergent validity of GATB-Q are presented in Table A-lf. Convergent Validity of the Psychomotor Aptitudes K (Motor Coordination), F (Finger DexterityJ, and M (Manual Dexterity) Unfortunately, review of the literature since publication of the GATB Manual produced very few studies of the convergent validity of subtests underlying the psychomotor aptitudes (K, F. and M) of the GATB. There was one correlation for K, motor coordination (.58), two for F. finger dexterity (.37, .41), and one for M, manual dexterity (.501. And the correlations reported for these aptitudes in the Manual cannot be re- garded as convergent validity coefficients. Data on the convergent valid- ity of aptitudes K, F. and M are presented in Table A-lg, h, and i, respectively. Summary of Convergent Validity Results Distributions of convergent validity coefficients for the cognitive apti- tudes of the GATB (G. V, and N) provide moderately strong support for claims that these aptitudes are appropriately named and measured. Corresponding results for the perceptual aptitudes of the GATB (S. P. and Q) are less convincing. Data for the psychomotor aptitudes are so meager that judgment on their convergent validity must be withheld. Although the median convergent validity coefficient observed for the spatial aptitude (S) was respectably large, the corresponding median values for the form perception (P) and clerical perception (Q) aptitudes were smaller than would be desired. The three-dimensional space subtest is said to measure both intelligence and spatial aptitude and might therefore require greater reasoning ability and inferential skill

310 APPENDIX A than is typical of measures of spatial aptitude found in other batteries. As has already been noted, the name comparison subtest of the GATB appears to tap only a subset of the skills typically associated with clerical perception. Distributions of convergent validity coefficients for the cognitive and perceptual GATB aptitudes are summarized in Table A-2 and for ease of visual comparison, in Figure 4-2. RELIABILITY OF THE GATB APTITUDE SCORES Aptitude tests such as the GATB are intended to measure stable characteristics of individuals, rather than transient or ephemeral qualities. Such tests must measure these characteristics consistently, if they are to be useful. Reliability is the term used to describe the degree to which a test measures consistently. The Standarcis define reliability as follows (American Educational Research Association et al., 1985:19~: Reliability refers to the degree to which test scores are free from errors of measurement. A test taker may perform differently on one occasion than on another for reasons that may or may not be related to the purpose of measure- ment. A person may try harder, be more fatigued or anxious, have greater familiarity with the content of questions on one test form than another, or simply guess correctly on more questions on one occasion than another. For these and other reasons, a person's score will not be perfectly consistent from one occasion to the next.... Differences between scores from one form to another or from one occasion to another may be attributable to what is commonly called errors of measurement.... Measurement errors reduce the reliability (and therefore the generalizability) of the score obtained for a person from a single measurement. Fundamental to the proper evaluation of a test are the identification of major sources of measurement error, the size of the errors resulting from these sources, the indication of the degree of reliability to be expected between pairs of scores under particular circumstances, and the generalizability of scores across items, forms, raters, administrations, and other measurement facets. The Standards further state (p. 19) that test developers are primarily responsible for assessing a test's reliability and for identifying major sources of measurement error. When a test is composed of subtests, the reliability of each must be investigated and reported in adequate detail, so that test users can determine whether the test and subtests are sufficiently reliable to be used for the purposes intended. The reliability coefficient of a test is defined technically as the square of the correlation between a hypothetical true score and the score actually observed. The reliability coefficient represents the degree to which differences among test takers' scores represent actual differences in their abilities, rather than errors of measurement. If a test had a reliability

APPENDIX A 3 ~ ~ coefficient of 1.0, all of the differences among test takers' scores (i.e., variation in their scores) would represent differences in their abilities, and none would represent errors of measurement. We would describe such a test as being "perfectly reliable." If a test had a reliability coefficient of zero, differences among test takers' scores would be due solely to errors of measurement, and the test would be termed "totally unreliable." In practice, tests of human abilities and aptitudes are neither perfectly reliable nor totally unreliable; some variation in test scores reflects true differences among test takers' abilities and some reflects errors of measurement. Because test reliability generally increases as test length increases, and subtests are shorter than the test they compose, reliability coefficients for subtests are typically smaller than corresponding coefficients for the test as a whole. When the adequacy of a test's reliability is judged, it is therefore important to consider the reliability of every score that is separately reported and interpreted. In the case of the GATB, reliabilities of aptitude scores are of central interest. The psychometric literature includes a variety of methods for estimat- ing test reliability. Popular methods differ in their sensitivity to various sources of measurement error, in their applicability to different types of tests, and in their usefulness for particular purposes. When tests are used to assess aptitudes or other traits that are expected to be stable across weeks, months, or years, the most appropriate reliability estimation procedures will reflect the stability of measurements across time. Such reliability estimates are termed stability coefficients or indices of temporal stability. Stability coefficients are based on the consistency of examiners' performances during two test administrations and might be spuriously inflated because of examiners' memories of their initial responses. Risks of distortion due to memory effects can be avoided if reliability estimates are based on the administration of two forms of the same test, separated by the amount of time the aptitudes measured are assumed to remain stable. Such reliability estimates are termed "equivalent-forms reliability coei~icients." A number of studies of the temporal stability and equiva- lent-forms reliability of the GATB aptitude scores are summarized below. Temporal Stability of the GATB Aptitude Scores Studies of the temporal stability of GATB aptitude scores have exam- ined a variety of time periods between test administrations, ranging from one day to four years. These studies of the consistency of GATB aptitude scores have used samples of examiners that vary widely in age and level of education, including employed adults, junior high school students, high school students, and college students. Estimates of the temporal stability

3 12 APPENDIX A of GATB aptitude scores have also been computed for examinees of different races. Table A-3 contains a summary of indices of the temporal stability of GATB aptitude scores reported by Senior (1952), Showier and Droege (1969), and the U.S. Department of Labor (1970, 1986~. The estimates presented below were based on sequential administration of either the same or different GATB test forms. Temporal Stability by Age As shown in Table A-3, coefficients of stability for the GATB cognitive aptitudes G. V, and N consistently exceed .80 when samples are composed of adults and time intervals between successive test adminis- trations are no more than three years. For corresponding examinee samples and time periods, coefficients of stability for the GATB percep- tual aptitudes S. P. and Q are at least 0.70. Aptitude K showed similar stability in all but one study. The other GATB psychomotor aptitudes, F and M, which are measured by subtests that require manipulation of objects, were found to have somewhat lower coefficients of stability than did aptitudes measured by pencil-and-paper subtests. Coefficients of stability for GATB aptitudes F and M were reported to be at least .57 when estimated for samples of adult examinees over time periods up to three years. The temporal stabilities of the GATB psychomotor aptitudes are not as well estimated as are those of the cognitive and spatial aptitudes, since fewer studies have included the GATB subtests that require manipulation of objects. All of the GATB aptitude scores are less stable for samples of ninth- and tenth-graders than for samples of adults. A portion of this instability might be attributed to the maturation of these younger examiners during the period between successive administrations of the GATB, and might therefore reflect valid changes in the relative ordering of the examinees on the aptitudes assessed, rather than instability due to measurement error. Temporal Stability by Time Interval The range of stability coefficients for GATB aptitude scores across test-retest intervals from one day to four years varied from .51 to .94. For specific time intervals, stability coefficients also varied greatly across aptitudes. Stability coefficients tended to be largest for the cognitive aptitudes (G. V, and N) and smallest for the psychomotor aptitudes (K, F. and M). For GATB cognitive aptitudes G. V, and N. an increase of more than seven weeks in the time interval between successive test administrations is necessary to reduce the average stability coefficient by .01. For the

313 o V) ._ 4 - ¢ m EM Ado ._ - ._ Ct a o CQ 4 - a, .o Cal w as o ¢ En V, O (Q s Z V, Ct C) _ U: CC ~ .E - ¢ U) Cal 00 ~ ~ ~ ~ V) \0 ~ ~ · . . . · . . . . · · . · ~ 00 0 \0 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 00 ~ ~ ~ ~ ~ \0 \0 ~0 ~ AD rip ~ ~ vie . . . . . . . . . . . . . . ON 00 ~ ~ ~ ~ ~ ~ US ON ~ ~ ~ ~ ~ ~ ~ O ~ 00 ~ 00 00 ~ 00 ~ ~ ~ ~ 00 ~ ~ 00 00 ~ ~ ~ 00 ~ ~ ~ ·. . . · . . . . · . . . . · · · . . . . . . . ~ oo ~ oN vo u~ ~ ~ ~ ~o ~ ~ oN ~ ~ ~ oo ~ oo o \8 ~ ~ ko o ~ ~ oN oo oo oo oo oo oo oo ~ ~ ~ ~ oo ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ v ) v) oo oo ~oo oo oo oo oo o ~ ~ ~ ~ oo ~ ~ oo ~ ~ o ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o ~ ~ ~ ~ ~ oo oo ~oo oo oo oo oo oo oo oo oo oo oo oo oO oO ~ ~ r~ ~ ~ ~ ~ oo ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o ~ ~ o oo oo ~ oo ~ ~ o ~ oo o~ ~ ~ oo ~ ~ c~ oo oo oo oo ~ ~ oo oo oo oo oo ~ oo r~ oo r~ ~ ~ r~ oo ~ ~ ~ ~ O ~ ~ ~ ~ oO ~D \0 0N ~ ~ ~ ~ C~ ~ ~ r~ ~ c~ ~ ~ ~ ~ ~ ~ oo 0 oo c~ oo oo cn ~ ox ~ crx oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo ~ ~ ~ ~ ~o oo · · ·.·.·· ·.·.. c~ ~ ~ c~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o ~ o ~ ~ c~ u~ ~ ~ c~ o ~ oo o ~ ~ o oo ~ ~ ~ ~ ~ oo oo oo oo oN ~ ~ oo oo oo oo oo oo oo ~ ~ ~ oo ~ vo oo ca u, CQ cn · - o ~- - ~ ~ o . - o - (~ - =_ =3 ~.s ° ~ ~ ~ ~ ~ ~ 53 ~ ~ ~ oo ~ oo ~ oo oo o ~ ~ ~ ~ ~ ~o oo ~ ~ ~ ~ ~ ~ ~ ~ ~ M0 ~ 0 °° 00 ~ ~ ~ 0 ~ ~ ~ 0 ~ ~v~ ~ ~ ~ .= ~ ~ 0 ~ ~ ~ ~ ax 0 ~ ~ 0 ~ ~ ~ 0 c~ ~ ~ ~ ~ u~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ r~ co u: ca co co cc ca ~cc ~: ~: o ° 3 3 3 3 3 3 3 33 ~- = ^^ ^^ ^^ ~ ~ ~ ~D ~ ~ ~ ~ ~ ~ ~o mv v m ~m mmmm mm mm °~ ~ Q ~I ¢< ¢ ¢~¢ m ¢¢ m ¢¢¢¢ ¢¢ ¢¢ m ~_ ~_ ~_ _- C~ ~C ~0N ~D C ~0N 0N 0\ 0N O ~ O (L) O -- -- ~ oo ~ ~ r~ O4 r ~os ~oc a~ ~c' O ~0 ~ ~ \, ~ '_ I, ~ _ =, t=~S ~C ~C ~ C ~ ~ ~ ~ ~ ~ ~ ~ .O C~ ~ ~ ~ ~ ~_ ,o ~ _ ~ ~ _ Z

3 ~ 4 APPENDIX A GATB perceptual aptitudes S. P. and Q. estimates are, on average, initially smaller than for the GATB cognitive aptitudes (G. V, and N) and also vary more widely at specific test-retest time intervals. The relation- ship between average stability coefficient and test-retest time interval is modeled less well (by ordinary least-squares linear regression) for these aptitudes than for the GATB cognitive aptitudes. The GATB psychomotor aptitudes, K, F. and M, have still lower estimated stability coefficient intercepts of .84, .70, and .76, respectively. Aptitude K, which is measured with a pencil-and-paper subtest, has an initial stability coefficient that is in the same range as those of the GATB perceptual aptitudes. The two psychomotor aptitudes that are measured with subtests requiring the manipulation of objects, F and M, have somewhat lower average initial stability coefficients. It also appears to be the case that the stabilities of these two psychomotor aptitudes degrade at a faster rate as a function of time interval than is true of GATB aptitudes that are measured with pencil-and-paper subtests. Data on coefficients of stability of the GATB aptitudes were summa- rized by computing simple linear regressions of stability coefficients as a function of the time interval between the initial administration and the second administration of the GATB. A scatter diagram that illustrates this relationship for the GATB-G aptitude is shown in Figure 4-1. From the data in the figure we can conclude that stability coefficients for the GATB-G are large (approximating .91, on average) for very small time intervals and degrade slowly as the time interval is increased. Similar patterns were observed for the other GATB aptitudes, although initial values of the stability coefficient were smaller for the spatial aptitudes than for the cognitive aptitudes, and smaller still for the psychomotor aptitudes. These data are summarized in Table A-4, which contains initial values of stability coefficients, the degradation of stability coefficients (amount by which they decrease from initial values) for a 100-day interval between the initial administration of the GATB and the second adminis- tration, and the proportion of variance in GATB stability coefficients that is explained by a linear regression on the time interval between the initial and the second GATB administration. The relationship is well explained by a linear relationship for the cognitive and spatial aptitudes, but not for the psychomotor aptitudes. Equivalent Forms Reliability Stability coefficients that are based on two administrations of the same test form such as those discussed above are subject to spurious inflation because examinees might remember, and merely duplicate, their initial responses to test items. To avoid this problem, test reliability is some

APPENDIX A 3 ~ 5 TABLE A-4 Initial Values and Degradation of GATB Stability Coefficients, and Proportion of Variance Explained by Linear Relationship, by GATB Aptitude Degradation of Stability Initial Value Coefficient Proportion GATB of Stability (100-day of Variance Aptitude Coefficient interval) Explained G .9089 .0099 .58 V .8936 .0094 .64 N .8943 .0099 .56 S .8323 .0082 .42 P .8074 .0153 .49 Q .8108 .0106 .44 K .8390 .0088 .11 F .6994 .0069 .19 M .7572 .0068 .24 times estimated by correlating examiners' scores on two different forms of a test. The forms are designed to be psychometrically parallel; that is, equivalent in format, in length, and in the distribution of difficulties and correlations of their items. Parallel test forms are sometimes administered at the same time and are sometimes administered with an intervening time interval. In the former case, correlations between examiners' scores are most sensitive to differences in their performances on the two samples of items that compose the parallel forms. Such estimates of reliability are called "coefficients of equivalence." In the latter case, the temporal instability of examiners' performances also attenuates correlations. These estimates of reliability are called "test-retest parallel-forms esti- mates" and reflect temporal stability as well as equivalence of perfor- mance across parallel test forms. Four operational forms of the GATB, labeled A, B. C, and D, are currently in use. Forms A and B have been used since 1947 and are now restricted to retesting of initially tested examinees and other low-inci- dence uses. Forms C and D were normed in 1983 and are currently the primary operational forms of the GATB. Many of the estimates of temporal stability discussed above were based on the administration of parallel forms of the GATB. Results of these studies, as well as other studies of the equivalence of alternate forms of the GATB, are summa- rized in Table A-5. The coefficients of equivalence reported in the table are similar in pattern to the coefficients of stability described earlier. Form A and B coefficients tend to be a bit larger than Form A and C coefficients or Form A and D coefficients. Although conclusions must be tentative because the . .

316 5 - o C) CQ ._ m o a o CC ._ C) U) Lye Em o, U) z C) O ca an ~ Pa Cal 0- O .= of Can ~00 0 00 ~ 00 GoO Go O ~ ~cr ~ ~ U) ~ O~ 00 0 ~000 ~ 00 ~ ~ 00 ~00 ~00 ~ ~00 00 ~00 ············· ·· ·~ ~ ~ ~ ~ ~ d ~ ~ ~ d" ~) C ~ O oo oo oo o0 oo oo ~ ~ ~ \0 ~ \0 oo oo o0 ~oo oo ······.·· .. ·. ·· .· ·~ ~ ~ ~ ~ ~ O ~ oo ~ ~ ~O ~ ~ ~ oo oo oo oo oo oo oo ~ ~ \0 ~ ~oo oo oo oo ~ oo ··.·.··.· .. ·· ·. .· .. ~n 0 oo ~ ~ ~ ~ ~ ~c~ ~ 0 ~ 0 ~ ~ ~ oo oo ~ o0 oo oo oo oo ~ ~oo ~oo oo oo oo a0 oo ·· ·.·. ·. · . . · . · · ~ O ~ ~ ~ ~ ~ ~ 0N 0N ~ ~ ~ 0N oo oo ......... oo ~ ~ ~ C ~t- ~ oo ~ oo oo oo oo oo oo o0 oO oo ......... .. oo ~ · . C~ ~ O CN ~ O ~ O ~ ~ ~ 0N ~ ~ ~ ~ ~ ~ ~ o0 oo oO ~ ~ ~ ~ O ~ _ ~ 4,_ ~ _ ~ C~ ~ ~ ~ ~ ~ ~ ~ ~ == ~ == O ~ ~ ~ ~ ~ ~ O ~ O ~ ~ oo ~) ~ ~) ~ ~) \0 ~ ~~) oo ~ ~oo ~c' oo ~oo oo oo oo oc · · · · · ~ o ln ~cr oooo ~css 0 oo ~oo oooo oooo ~ · ·· ·· ~· ~ o oNo oooo ooo ~ oN ~ oooo oo~ oo · · · · · · · - =: ~ - ~ - ~ o & ~ ~ ~ ~ ~ o ~ ~ vo ~ ~ ~ ~ ~ ~ . ~ ~ ~ ~ ~ ~ In ~) cn ~ ~: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~: ~ 3 3 3 3 3 ~ ~ ~ ~3 3 3 3 3 3 3 3 ~ ~ ~o ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 1_ mm mmmmmm mm mm u~u~o <¢ <~¢< <¢ << ~U<~<U~ 0N ~ \0 ~0\ _' ~_ C) " - C~ O ~O ~ ~c: C) ~ ~ 3 rL~ 3 ,,= C~ ,=° U' ~o~ o ~ox _ _ o ~ o ~ o r ~t4 ~oo o~ o ~o _ ,, _ ~ _ _ C: :t ~ ~ ~ CQ ~ ~- C ~V, ~o ~o ~

APPENDIX A 3 ~ 7 number of studies is small, coefficients of equivalence between Forms C and D appear to be of about the same magnitude as Form A and B coefficients. As was true of coefficients of stability for the GATB aptitudes, the cognitive aptitudes (G. V, and N) appear to be the most reliable. In addition, the coefficients of equivalence of the psychomotor aptitudes that are assessed by tests that require manipulation of objects (F and M) tend to be smallest. With the exception of these latter two aptitudes, scores on the GATB aptitudes appear to be assessed with acceptably large interform equivalence and stability for time intervals of one year or less. It is also possible to estimate the reliability of a test using data collected In a single test administration. Such reliability estimates are called "internal consistency coefficients" since they are effectively correlations of two or more subtests with each other. Internal consistency estimates are not appropriate for assessing the reliability of speeded tests because the consistency actually measured is the consistency of the speed of response, not the consistency of correct answers or of ability. No estimates of the internal consistency of the GATB subtests or aptitudes are provided in the GATB Manual (U.S. Department of Labor, 1970) or in more recent literature that was reviewed in preparing this appendix. Since all subtests of the GATB are administered under condi- tions that impose severe time limits on examiners, estimates of the internal consistency of the GATB subtests are likely to be spuriously high (Anastasi and Drake, 1954; Cronbach and Warrington, 1951; Gulliksen, 1950a,b; Helmstadter and Ortmeyer, 1953; Lord and Novick, 1968; Mollenkopf, 1960; Morrison, 1960; Rindler, 1979; Stafford, 1971; Wes- man, 19604. It is therefore appropriate that Employment Service publica- tions on the GATB and other psychometric literature be devoid of estimates of internal consistency for the GATB. EQUATING ALTERNATE FORMS OF THE GATB The Standards describe the goal for equated forms of a test (American Educational Research Association et al., 1985:311: Ideally, alternate forms of a test are interchangeable in use. That is, it should be a matter of indifference to anyone taking the test or to anyone using the test results whether form A or form B of the test was used. Of course, such an ideal cannot be attained fully in practice. Even minor variations in content from one form to the next can prevent the forms from being interchangeable since one form may favor individuals with particular strengths, whereas the other form may favor those with slightly different strengths. Although considerable care may be taken to make two forms of a test as similar as possible in terms of content and format, the forms cannot be

3 ~ ~ APPENDIX A expected to be precisely equal in difficulty. Consequently, the use of simple number-right scores without regard to form is generally inappro- priate because such scores would place the people taking the more difficult of the two forms at a disadvantage. To take the unintended differences in difficulty into account, it is usually necessary to convert the scores of one form to the units of the other, a process called test equating (American Educational Research Association et al., 1985:311. There are a number of data collection designs and analytical techniques that may be used to equate forms of a test. Detailed descriptions of the various approaches can be found in Angoff (1971) and in Petersen et al. (1989~. Regardless of the approach, however, there are two major issues that need to be considered in judging the adequacy of the equating: (1) the degree to which the forms measure the same characteristic or construct and (2) the magnitude of the errors in the equating due to the procedure and sampling. Comparability of Constructs Measured Although the analytical techniques used in equating could be applied to scores from any pair of tests, it makes sense to consider scores to be interchangeable only when forms measure essentially the same characteris- tics. For example, equating techniques could be used to convert scores on an achievement test in, say, biology such that the distribution of converted biology scores had the same mean and standard deviation as scores on a physics achievement test for some large sample of test takers. Clearly, however, it would not be a matter of indifference to most test takers whether they were administered the biology test or the physics test. Those with greater strengths in biology would prefer the biology test, whereas the converse would be true for those with greater strengths in physics. In practice, of course, the differences between forms that are to be equated are not so obvious or extreme as the difference between a biology and a physics test. However, subtle differences in test content or the format of the test items can sometimes lead to important shifts in what the test forms measure and therefore to conversions that can give an unintended advantage or disadvantage to particular test takers. If alternate forms of a test measure the same characteristics, one would expect scores on the two forms to be highly correlated. In the case of a battery of tests, such as the GATE, one would also expect the pattern of correlations among the parts of the tests to be similar for different forms. Evidence regarding both of these issues is provided in an addendum to the 1970 Manual for USES General Aptitude Test Battery, Section III: Development, titled Reliability and Comparability: Forms C and D (U.S. Department of Labor, 1986~.

APPENDIX A 3 ~ 9 Pairs of three forms of Subtests 1 through 8 of the GATB were administered to a sample of 3,344 people in 20 participating states. A total of six subsamples ranging in size from 545 to 567 were administered pairs of Forms A, C, and D in counterbalanced order. That is, one subsample (AC) was administered Form A first and then Form C a week later, a second subsample (CA) was administered Form C first and Form A a week later. The remaining four subsamples (AD, DA, CD, and DC) were defined in an analogous fashion. The correlations between the scores on the alternate forms for each of Subtests 1 through ~ and for the seven aptitude scores that are based on the first eight subtests of the GATB provide estimates of the alternate form reliabilities of the Subtests and of the aptitude scores. The alternate-form reliabilities of the first three aptitudes (intelligence, verbal, and numerical) are close to .90, a level that is consistent with alternate-form reliabilities of aptitude tests of high technical quality that are provided by major test publishers. The reliabilities of the remaining four aptitudes (spatial, form perception, clerical, and motor coordination) are lower, close to .80. The alternate-form reliabilities of the Subtests are in the .80s for Subtests 1 through 4 (name comparison, computation, three-dimensional space, and vocabulary) and the high .70s to .80 for Subtests 5, 6, and 8 (tool matching, arithmetic reasoning, and mark making). The lowest reliabilities were obtained for Subtest 7, form matching, where two of the correlations, both involving the relationship between Forms A and C, fall below .70. In addition to providing alternate-form reliabilities for the first eight Subtests of the GATB, the addendum to the Manual (U.S. Department of Labor, 1986) also reports matrices of intercorrelations among the eight scores for each of the three forms involved in the reliability and comparability study. Correlations for each pair of subtests show that the three forms of the GATB are virtually indistinguishable in terms of the pattern of correlations among the subtests. The similarity in the patterns of intercorrelations coupled with the reasonably high alternate-form reliabilities suggest that, with the possible exception of Subtest 7, form matching, the forms are measuring sufficiently similar characteristics to equate the forms and use the equated scores interchangeably. Much less evidence is provided regarding the comparability of Subtests 9 through 12 of the GATB than was provided for Subtests 1 through 8. The evidence that is available is based on smaller samples of test takers and provides less support for concluding that the subtest scores can be treated as being interchangeable after equating. Subtests 9 through 12 of Form A were administered to 273 test takers followed by Form C one week later. Another sample of 260 test takers was administered Forms A and D in the same pattern (Table Am. The

320 APPENDIX A TABLE A-6 Alternate-Form Reliabilities of Subtests 9 Through 12 of the GATE and of the Aptitudes Based on Those Subtests, Based on Samples of Sizes 266 and 273 Reliabilities Forms A,C Forms A,D Subtest 9. Place .62 .72 10. Turn .52 .56 11. Assemble .45 .60 12. Disassemble .54 .70 Aptitude Manual Dexterity .65 .74 Finger Dexterity .57 .71 alternate-form reliabilities for both the subtest scores and for the two aptitudes that are based on Subtests 9 through 12 are lower than were obtained for Subtests 1 through 8. The correlations of the Form C with Form A scores are particularly low. In addition to having low alternate form reliabilities, the degree to which the alternate forms measure the same skills is brought into question by the differences in the patterns of correlations among the subtests. The intercorrelations for each pair of subtest scores are listed in Table A-7. Two correlations for each pair of subtests are shown for Form A. The first is based on the subsample that also took Form C; the second correlation is based on the subsample that also took Form D. These correlations are subject to greater sampling error than the ones that were summarized above for Subtests 1 through 8 because of their smaller sample sizes. In addition, there are no replications for Forms C and D. Nonetheless, the pattern of correlations for Form C seems to differ from that of Form D. Indeed, considering the Form C versus Form D TABLE A-7 Correlations Between Subtest Scores by Form Correlations Subtest Pairs Form A Form C Form D 9 with 10 .66 and .62 .73 .53 9 with 11 .40 and .42 .50 .32 9 with 12 .50 and .52 .44 .49 10 with 11 .44 and .40 .50 .27 10 with 12 .56 and .56 .50 .35 11 with 12 .57 and .46 .50 .49

APPENDIX A 32 1 correlation between subtests, one pair of subtests at a time, four of the six pairs of correlations are significantly different at the .05 level. In summary, the alternate-form reliabilities and patterns of correlations do not provide sufficient evidence that Subtests 9 through 12 of Forms A, C, and D measure essentially the same characteristics. Hence, the scores from different forms for these Subtests should not be considered to be interchangeable . Equating Procedure and Sampling Errors The first GATE equating was to place B-1002 (Form A and Form B) on the scale developed for B-1001. To equate Form A, four groups of high school students were administered the old and one of the two new forms, in counterbalanced order. The representativeness of these groups is questionable. One group had to be dropped because test scores were not comparable; the remaining total sample size was 585. The conversion of Form B was similar, but was based on 412 high school students. These are very small samples on which to base equating. The equating procedures, while not specified, appeared to be linear. The information that is available in the 1984 and 1986 addenda to the Manual (U.S. Department of Labor, 1984, 1986) is inadequate for a complete evaluation of the equating of Forms C and D. Thus, the comments on this aspect of the equating are brief and inconclusive. The design used to investigate the reliability and comparability of Subtests 1 through 8 of Forms A, C, and D is a standard equating design. It is referred to as "Design II: Random groups both tests administered to each group, counterbalanced" by Angoff (19711. This design provides equating results that have much greater precision, i.e., smaller standard errors of equating, than the more commonly used designs in which only one of the forms is administered to each group. Given this design, the sample sizes were consistent with accepted practice for test equating. Details of the equating of Subtests 1 through ~ of Forms C and D to Form A are not reported. The data from the reliability and comparability study could have been used with one of the analytical procedures described by Angoff (1971) for Design II. However, information about the specific procedure that was used is not provided. No evidence regarding the adequacy of the equating (e.g., comparisons of linear and equipercentile equating results or estimates of standard errors of equating at different score levels) is presented. Hence, an independent evaluation is not possible. For Subtests 9 through 12, reservations about equating have already been expressed because of questions about the comparability of the characteristics measured by the different forms. The order of administra- tion was not counterbalanced in the design that was used to investigate

322 APPENDIX A the alternate-form reliability and comparability of Subtests 9 through 12; consequently, a separate standardization sample was obtained for pur- poses of equating. The standardization sample (for Subtests 9 through 12) consisted of a total of 2,092 persons from 11 states. Form C was administered to 981 people and Form D to 1,111 people in the standardization sample. Although details of assignment to the two subsamples are not presented, they were apparently considered to be random samples from the same population as the Form A standardization sample to which Forms C and D were equated. Random assignment is a critical part of Ango~s (1971: 569) "Design I: Random groups~ne test administered to each group." Hence, it seems important to understand the basis for considering the Form C and D standardization samples to be randomly equivalent to the Form A standardization sample. The 1986 addendum to the Manual indicates that both linear and equipercentile equating procedures were obtained for the equating of Subtests 9 through 12, but because the two procedures produced very similar results it was decided that the linear equating would be used (U.S. Department of Labor, 19861. Such a decision is consistent with accepted practice. However, no information comparing the results of the two procedures is provided. Hence it is not possible to provide an indepen- dent evaluation of this decision. REFERENCES American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 1985 Standards for Educational and Psychological Testing. Washington, D.C.: Ameri- can Psychological Association. Anastasi, A., and R. Drake 1954 An empirical comparison of certain techniques for estimating the reliability of speeded tests. Educational and Psychological Measurement 14:529-540. Angoff, W.H. 1971 Scales, norms, and equivalent scores. In R.L. Thorndike, ea., Educational Measurement' 2d ed. Washington, D.C.: American Council on Education. Briscoe, C.D., W. Muelder, and W. Michael 1981 Concurrent validity of self-estimates of abilities relative to criteria provided by standardized test measures of the same abilities for a sample of high school students eligible for participation in the CETA program. Educational and Psycho- logical Measurement 41 :1285-1294. Cassel, R.N., and G.W. Reier 1971 Comparative analysis of concurrent and predictive validity for the GATE Clerical Aptitude Test Battery. Journal of Psychology 79:135-140. Cooley, W.W. 1965 Further relationships with the TALENT battery. Personnel and Guidance Journal 44:295-303.

APPENDIX A 323 Cronbach, Lee J., and W.G. Warrington 1951 Time limit tests: estimating their reliability and degree of speeding. Psychometrika 14: 167-188. Dong, H., Y. Sung, and S. Goldman 1986 The validity of the Ball Aptitude Test Battery (BAB) III: relationship to the CAB, DAT, and GATB. Educational and Psychological Measurement 46:245-250. Gulliksen, H.A. 1950a The reliability of speeded tests. Psychometrika 15:259-260. 1950b Theory of Mental Tests. New York: Wiley. Hakstian, A.R., and R. Bennett 1978 Validity studies using the Comprehensive Ability Battery (CAB) II: Relationship with the DAT and GATB. Educational and Psychological Measurement 38: 1003- 1015. Helmstadter, G.C., and D.H. Ortmeyer 1953 Some techniques for determining the relative magnitude of speed and power components of a test. Educational and Psychological Measurement 12:280-287. Howe, M.A. 1975 General Aptitude Test Battery Q: an Australian empirical study. Australian Psychologist 10:32-44. Kettner, N. 1976 Armed Services Vocational Aptitude Battery (ASVAB Form 5): Comparison with GATB and DAT Tests. Final report, May 1975-October 1976. Armed Services Human Resources Laboratory, Brooks Air Force Base, Texas. Kish, G.B. 1970 Alcoholics' GATB and Shipley profiles and their interrelationships. Journal of Clinical Psychology 26:482-484. Knapp, R., L. Knapp, and W. Michael 1977 Stability and concurrent validity of the Career Ability Placement Survey (CAPS) against the DAT and the GATB. Educational and Psychological Measurement 37: 1081-1085. Lord, F.M., and Melvin R. Novick 1968 Statistical Theories of Mental Test Scores. Reading, Mass.: Addison-Wesley. Mollenkopf, W.G. 1960 Time limits and the behavior of test takers. Educational and Psychological Measurement 20:223-230. Moore, R., and J. Davies 1984 Predicting GED scores on the basis of expectancy, valence, intelligence, and pretest skill levels with the disadvantaged. Educational and Psychological Mea- surement 44:483-489. Morrison, E.J. 1960 On test variance and the dimensions of the measurement situation. Educational and Psychological Measurement 20:231-250. O'Malley, P., and J. Bachman 1976 Longitudinal evidence for the validity of the Quick Test. Psychological Reports 38: 1247-1252. Petersen, Nancy S., Michael J. Kolen, and H.D. Hoover 1989 Scaling, norming, and equating. Chap. 6 in Robert L. Linn, ea., Educational Measurement, 3d ed. New York: Macmillan. Rindler, S.E. 1979 Pitfalls in assessing test speededness. Journal of Educational Measurement 16: 261-270.

324 APPENDIX A Sakalosky, J.C. 1970 A Study of the Relationship Between the Differential Aptitude Test Battery and the General Aptitude Test Battery Scores of Ninth Graders. Master's thesis, Millers- ville State College. Senior, N. 1952 An Analysis of the Effect of Four Years of College Training on General Aptitude Test Battery Scores. Unpublished master's thesis, University of Utah, Provo. Showler, W.K., and R.C. Droege 1969 Stability of aptitude scores for adults. Educational and Psychological Measure- ment 29:681-686. Stafford, R.E. 1971 The speededness quotient: A new descriptive statistic for tests. Journal of Educational Measurement 8:275-278. U.S. Department of Labor 1970 Manual for the USTES General Aptitude Test Battery. Section III: Development. Washington, D.C.: Manpower Administration, U.S. Department of Labor. 1984 Forms C and D of the General Aptitude Test Battery: An Historical Review of Development. Division of Counseling and Test Development, Employment and Training Administration, U.S. Department of Labor, Washington, D.C. 1986 Reliability and Comparability: Forms C and D. Addendum to Manual for the USTES General Aptitude Test Battery. Section IlI: Development. U.S. Employ- ment Service, Employment and Training Administration. Washington, D.C.: U.S. Department of Labor. Wesman, A.G. 1960 Some effects of speed in test use. Educational and Psychological Measurement 20: 267-274.

Next: Appendix B: Tables Summarizing GATB Reliabilities »

Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery (1989)

Chapter: Appendix A: A Synthesis of Research on Some Psychometric Properties of the GATB

Welcome to OpenBook!

Get Email Updates