Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 149
8
GATB Validities
In Chapter 6 we described validity generalization as the prediction of
validities of a test for new jobs, based on meta-analysis of the validities of
the test on studied jobs. This chapter focuses on establishing the
predicted validities of the General Aptitude Test Battery (GATB) for new
jobs. The first step involves compiling the existing validity data, and the
second is a matter of estimating the "true" validity of the test by
correcting the observed validities to account for various kinds of weak-
nesses in existing research (e.g., small sample sizes). As part of its study
of validity generalization for the GATB, the committee has conducted
independent analyses of the existing GATB validity studies. The initial
sections of the chapter compare the results of these analyses with the
work done by John Hunter for the U.S. Employment Service (USES)
based on a smaller and older set of studies (U.S. Department of Labor,
1983b,c,d). In addition, drawing on the discussion of corrections pre-
sented in Chapter 6, the second half of the chapter presents the commit-
tee's estimate of the generalizable validities of the GATB for the kinds of
jobs handled by the Employment Service, an estimate that is rather more
modest than that proposed in the U.S. Employment Service technical
reports.
THE GATB VALIDITY STUDIES
Two sets of GATB validity studies are discussed in this chapter. The
first is comprised of the original 515 validity studies analyzed by Hunter;
149
OCR for page 150
~50 GATE VALIDITIES kD VALIDITY GENERALIZATION
they were prepared in the period 1945-1980 with 10 percent 1940s data, 40
percent 1950s data, 40 percent 1960s data, and 10 percent 1970s data. A
larger data tape consisting of 755 studies was made available to the
committee by USES. It included these and an additional set of 264 studies
carried out in the 1970s and 1980s. (The Hunter studies appear as 491
studies in this data tape, because some pairs of studies in the original 515
consisted of validity coefficients for the same set of workers using two
different criteria for job performance; these pairs each appear in a single
study on the data tape.) The original samples from the 515 studies
summed to 38,620 workers, and the samples from the more recent 264
studies summed to 38,521 workers.
Written reports are available for the earlier 515 studies but not for the
more recent 264. It is therefore possible to examine the earlier studies in
some detail to determine their quality and comparability. An examination
of 50 of the written reports selected at random showed very good
agreement between the numbers in the report and the numbers coded into
the data set. It is regrettable that no such reports are available for the
more recent studies, since it leaves no good way to consider the
characteristics of the samples that might explain the very different results
of analysis for the two data sets.
An Illustration of Test Validities
Criterion-related validity is expressed as the product moment correla-
tion between test score and a measure of job performance for a sample of
workers. The degree of correlation is expressed as a coefficient that can
range from -1.0, representing a perfect inverse relationship, to +1.0,
representing a perfect positive relationship. A value of 0.0 indicates that
there is no relationship between the predictor (the GATB) and the
criterion. Figure 8-1 depicts this range of correlations with scatter
diagrams showing the degree of linear relationship. In test validation
research, the relationships between the test score and the performance
measure are usually positive, if not necessarily strong. The lower the
correlation, the less appropriate it is to make fine distinctions among test
scores.
Basic Patterns of Validity Findings
The most striking finding from our analysis of the entire group of 755
validity studies is a distinct diminution of validities in the newer,
post-1972 set. For all three composites, the 264 newer studies show lower
mean validities, the decline being most striking for the perceptual and
psychomotor composites (Table 8-11. (That the standard deviations are
OCR for page 151
GATE VALIDITIES ~5 ~
Value of r Description of linear relationship
+1.00 Perfect, direct relationship Y
About +.50 Moderate, direct relationship Y
.00 Norelationship Y
(i.e., O covariation of X with Y)
About -.50 Moderate, inverse relationship Y
-1.00 Perfect, inverse relationship Y
FIGURE 8-1 Interpretation of values of correlation (r).
TABLE 8-1 Mean and Standard Deviation of the Validity
Coefficients, Weighted by Study Sample Sizes, Computed for Each
Composite Across the 264 Studies, and Compared with Those Hunter
Reported for the Original 515 Studies
Scatter diagram
to
0oooo
oC'
x
0 0 0 0
oOo°
x
of
0
x
0 0° 0° 0
~o°
x
°°oOO
x
GVN
SPQ
KFM
515 264 515 264 515 264
Mean .25 .21 .25 .17 .25 .13
Standard deviation .15 .11 .15 .11 .17 .12
OCR for page 152
~ 52 GATB VALIDITIES AND VANDAL GENERALIZATION
TABLE 8-2 Frequency Distribution of Validity Coefficients for Each
GATB Composite Over A11755 Studies
Percentage of Studies
Validity
Category GVN SPQ KFM
-.40 - -.49 0.1
-.30 - -.39 0.1
-.20 - -.29 0.1 0.1 0.3
-.10--.19 1.0 1.1 1.2
.00 - -.09 3.7 4.1 6.6
.01 - .10 12.4 14.7 16.8
.11 - .20 24.7 22.9 24.0
.21 - .30 27.4 25.1 26.5
.31- .40 18.1 18.7 12.7
.41- .50 7.8 8.2 6.4
.51- .60 3.9 2.5 3.6
.61- .70 0.8 0.5 1.2
.71- .80 0.1 0.1
also lower is readily explainable: the 264 additional studies have a much
larger average sample size 146 as opposed to about 75 in the original
set resulting in less sampling error.)
To give a better sense of the validity data than is provided by means and
variances, a frequency distribution of validity coefficients for each
composite, over all 755 studies, is shown in Table 8-2. The values
presented are the percentage of studies falling into each validity category.
Clearly, the range of observed validity coefficients is large. The
question before us is to understand the meaning of this variability. In the
next section we examine the effect of factors that might cause variation in
the observed validity coefficients.
Potential Moderators of Validity
A number of study characteristics can be hypothesized as potentially
affecting validity (and, therefore, contributing to the observed variability
across studies). In our analysis of the 755 GATB validity studies, we
looked at 10 characteristics:
1. sample size
2. job family
3. study type: predictive (i.e., tested at time of hire) versus concur-
rent (testing of current employees)
4. criterion type: performance on the job versus performance in
training
OCR for page 153
GATE VALIDITIES 153
5. age: mean age of individuals in the sample
6. experience: mean experience of individuals in the sample
7. education: mean education of individuals in the sample
8. race
9. sex
10. date of study
Each of these characteristics is discussed in turn.
Sample size
Sampling error appears to be the single factor with the largest
influence on variance in validity from study to study: removing the
influence of sampling error is a major component of any validity
generalization analysis. To get an intuitive feel for the effects of
sampling error, GVN validities were examined separately for the entire
sample, for samples with more than 100 subjects, for samples with more
than 200 subjects, and for samples with more than 300 subjects. As N.
the number of subjects, increases, random sampling error decreases.
Thus we should see much less variation with large samples than with
small samples. The distribution of GVN validity is presented in Table
8-3. It can clearly be seen that there is much more variation with small
samples; as the mean N increases, validity values center much more
closely on the mean.
TABLE 8-3 Percentage of Studies in Each Validity Category, Based
on A11755 Studies
Percentage of Studies
Validity All Studies >100 >200 >300
Category (N= 755) (N= 192) (N= 81) (N= 33)
-.20 - -.29 0.1
-.10 - -.19 1.0
.00--.09 3.7 2.6 1.2
.01 - .10 12.4 10.9 8.7 6.1
.11 - .20 24.7 33.4 38.2 33.3
.21- .30 27.4 33.3 40.8 51.5
.31 - .40 18.1 15.6 7.4 3.0
.41- .50 7.8 4.2 3.7 6.0
.51- .60 3.9
.61- .70 0.8
.71 - .80 0.1
OCR for page 154
] 5 4 GA TB VALIDI TIES AND VALIDI ~ GENERA TION
TABLE S-4 Variation of Validities Across Job Families in Old (515)
and Recent (264) Studies
GVN SPQ KFM
Job Family 515 264 515 264 515 264
I (set-up/precision) .34 .16 .35 .14 .19 .08
II (feeding /offbearing) .1 3 .19 .1 5 .16 .3 5 . 2 1
III (synthesizing) .30 .27 .21 .21 .13 .12
IV (analyze/compile/ compute) .28 .23 .27 .17 .24 .13
V (copy/compare) .22 .18 .24 .18 .30 .16
Job Family
In both the original 515 studies and the recent 264 studies, validity
clearly varies across job families. The mean observed validities (for both
the data used by Hunter and the full data set) are presented in Table 8-4
for each of the three test composites.
A notable difference between the old and new studies is in the
diminution of the KFM validities in Job Families IV and V.
Study Type: Predictive Versus Concurrent
Some studies are done using job applicants (predictive validation
strategy), whereas others involve the testing of current employees (con-
current validation strategy). Some have argued for the superiority of the
predictive strategy, based on the assumption that the full range of
applicants will be included in the study and thus that range restriction will
be reduced. This argument presumes that a very rare version of the
predictive validation strategy is used, namely that all applicants are hired
regardless of test score. More realistically, applicants are screened using
either the GATB itself or some other predictor, and thus range restriction
is likely in both predictive and concurrent studies. This point has been
made in the testing literature; in 1968 the GATB data base was examined
by Bemis (1968) and no differences in validity for predictive and concur-
rent studies were found. A comparison of predictive and concurrent
studies was not reported for the original 515 studies.
No consistent difference in validities was found in the present study, as
Table 8-5 shows. For some composite/family combinations, validity is
higher for the predictive studies; for others validity is higher for the
concurrent studies. The predictive/concurrent distinction is too crude to be
of real value: for example, we do not know whether the GATB was or was
not used as the basis for hiring in any or all of the studies labeled
'predictive." Thus study type will not be tested further in this report;
OCR for page 155
GATE VALIDITIES i55
TABLE 8-5 Variation of Validities by Study Type and Job Family,
for All 755 Studies
GVN SPQ KFM
Job
Family Predictive Concurrent Predictive Concurrent
Predictive Concurrent
I .14 .21 .10 .19 .00 .11
II .15 .17 .33
III .30 .29 .24 .20 .12 .17
IV .29 .24 .26 .19 .21 .15
V .20 .20 .26 .26 .27 .22
variation in validity due to study type will remain one unaccounted-for
source of variance.
Criterion Type: On-the-]ob Performance Versus Training Success
It has frequently been reported in the personnel testing literature that
higher validity coefficients are obtained for ability tests when training
success rather than job performance is used as the criterion. This makes
conceptual sense, as there are probably fewer external factors influencing
training success than job performance (e.g., job performance typically
covers a longer time period and is probably more heavily influenced by
supervision, work-group norms, variation in equipment, family problems,
and so on). But it could also be a product of measurement technology-
since training success is usually measured with a paper-and-pencil test,
the similarity of measurement methods might artificially boost the corre-
lation. Hunter reports substantially larger mean validities for GATE
studies using training success. A summary based on the full data set is
presented in Table 8-6.
Given the magnitude of these differences, the data set is broken down
by both job family and criterion type for validity generalization analyses.
TABLE 8-6 Validities for Training Success and Supervisor Ratings,
by Job Family, for All 755 Studies
GVN
SPQ
KFM
Job - ~ ~
Family Performance Training Performance Training Performance Training
I .19 .45 .18 .45 .11 .12
II .15 .17 .33
III .29 .30 .21 .20 .17 .10
IV .23 .35 .19 .27 .16 .19
V .20 .3 1 .21 .33 .22 .30
OCR for page 156
] 56 GATE VALIDITIES ED VALIDITY GENERALIZATION
Age
Ideally, the effect of age would be examined by computing validity
coefficients separately for individuals in different age categories. How-
ever, the present data base reports validity coefficients for entire samples
and does not report findings by age. What is reported is the mean age for
each sample. Thus we can determine whether validity varies by the mean
age of the sample.
For the 755 studies, the mean "mean age" is 31 .8 years, with a standard
deviation of 6.3 years. Correlations (r) between mean age and test validity
are as follows:
r age/GVN validity = -.1S
r age/SPQ validity = -.06
r age/KFM validity = .03
Thus the validity of the cognitive composite (GVN) tends to be somewhat
lower for older workers, though not enough to require special consider-
ation in validity generalization analysis. This finding does not seem to
hold for SPQ and KFM.
Relationships between mean age and mean test score are also worthy of
note:
r age/GVN mean = -.28
r age/SPQ mean = -.45
r age/KFM mean = -.52
Thus studies in which the average age is higher tend to have composite
scores that are notably lower, especially on SPQ and KFM.
Since the age-validity relationship is low, age is not treated as a
moderator in the validity generalization analyses, though the age/mean-
test-score relationship certainly merits consideration in examining the
GATE program as a whole.
E.
xperlence
As with age, what is reported in the validity studies is the mean
experience for each sample. In this data base, the mean is 5.5 years, with
a standard deviation of 4 years. Note that what is coded is typically job
experience rather than total work experience. Experience and age are
highly related: the correlation between the two is .58.
Correlations (r) between mean experience and test validity are as follows:
r experience/GVN validity = .03
r experience/SPQ validity = .00
r experience/KFM validity = - .16
OCR for page 157
GATE VALIDITIES ~ 57
The pattern is mixed, with less experienced samples producing higher
KFM validities. This parallels the relationship between experience and
test score means:
r exper~ence/GVN mean = .00
r exper~ence/SPQ mean = -.07
r exper~ence/KFM mean= -.32
Less-experienced samples score higher on KFM; in all likelihood this is
age-related.
Experience is not treated as a moderator in the validity generalization
analyses.
Education
The mean years of education across the 755 samples is 11.4 years, with
a standard deviation of 1.5 years. The pattern of correlations between
mean education and test validity is as follows:
r education/GVN validity = .15
r education/SPQ validity = -.10
r education/KFM validity= -.36
Thus GVN validity tends to be higher for more-educated samples, and
KFM validity higher for less-educated samples. In all likelihood, this
effect is caused by the relationship between job family and validity,
namely, higher GVN validity for more complex jobs (requiring more
education) and higher KFM validity for less complex jobs (requiring less
education).
Validity Differences by Race
Validity differences by race are examined in detail in the following
chapter. Suffice it to say here that most of the GATE validity studies do
not report data by race, but analysis of the 72 studies with at least 50
black and 50 nonminority workers indicates that mean validities for
nonminorities are higher than mean validities for blacks for all three
composites.
Validity Differences by Sex
Many studies (345 of 755) are based on mixed-sex samples. However,
410 studies were done on single-sex samples (226 male, 184 female).
Breaking studies down by job family and criterion type (performance
criterion versus training criterion, the two important moderator variables
identified in the earlier analyses), leaves few categories with enough
studies for meaningful comparisons to be made. Nevertheless, in those
OCR for page 158
)58 GATB VALIDITIES AND VALIDITY GENERALIZATION
TABLE 8-7 Validities for Job Families by Sex: Comparison of the
Mean Observed Validity Across Studies Split by Job Family, Type of
Criterion Measure, and Sex of Sample (Mixed Samples Not Included in
Analyses)
GVN
Performance
Training
Male Female Male Female
Job Number of Number of Number of Number of
Family Mean Studies Mean Studies Mean Studies Mean Studies
I .28 21 .49 1 -
II .23 1 .13 16 -
III .36 12 .37 2 .32 7 .57 1
IV .24 98 .23 31 .34 38 .38 14
V .21 46 .20 118 .35 2 .37 2
SPQ
I .29 21 .40 1
II .22 1 .15 16 -
III .24 12 .25 2 .19 7 .50 1
IV .23 98 .22 31 .29 38 .29 14
V .25 46 .23 118 .45 2 .40 2
KFM
I .18 21 - .03 1
II .29 1 .34 16 - ~
III .13 12 .19 2 .05 7 .48 1
IV .19 98 .23 31 .21 38 .22 14
V .25 46 .30 118 .32 2 .46 2
categories in which comparisons can be made (Job Families IV and V
with a performance criterion and Job Family IV with a training criterion),
the results suggest no effect due to sex. Results are summarized in Table
8-7.
Similar conclusions were reached in a USES test research report,
which analyzed validity differences by sex for 122 validity studies for
which validity could be computed separately for males and females
(U.S. Department of Labor, 1984a). That report concluded that there
are no meaningful differences in GATB validities between males and
females.
Date of Study
The committee was concerned about reliance on very old validity
studies in drawing general conclusions about GATB validity, as the
OCR for page 159
GATE VALIDITIES 159
TABLE 8-8 Correlations of Validities with Year of Study
Job Number of
Family StudiesGVNSPQKFM
I 23.22-.24.27
II 17.12-.08-.67
III 50-.25-.15.02
IV 235.03-.03-.07
V 217.00-.1 1-.33
validity studies had been done over a period of four decades. The date of
the study was not coded on the data tape containing the summaries of 755
validity studies. But virtually all written reports were also made available
to the committee. These contained study dates for more than 400 studies
and validity coefficients for 542 independent samples. The date was
extracted from each study and added to the data tape.
In the subsequent analysis, date was treated as a continuous variable.
Date of study was correlated with GATE composite validity within each
job family (Table 8-8~. Study date varied from 1945 to 1979, distributed
about 10 percent in the 1940s, 40 percent in the 1950s, 40 in the percent
1960s, and 10 percent in the 1970s.
These findings may be artifactual: if, for example, there was a change
over time in some study characteristic (e.g., job performance criteria
versus training criteria), the true effects of study date would be hidden.
Thus partial correlations were computed controlling for criterion type (job
performance versus training success) and for study type (predictive
versus concurrent), producing the second-order partial correlations
shown in Table 8-9.
Only Job Families IV and V offer large enough numbers of studies to
merit careful attention. In these two job families, there is no evidence of
change over time in the validity of the GVN composite, but there is
evidence of a significant decrease in SPQ and KFM validity over time.
Note that this analysis is based on studies for which written reports
TABLE 8-9 Correlations of Validity with Time, Adjusting for
Criterion Type and Job Type
Job Number of
Familv Studies GVN SPQ KFM
I 19 .24 -.31 .27
II 13 .00 .00 .00
III 45 -.37 - .20 -.07
IV 228 -.04 -.18 -.23
V 210 .04 -.08 -.33
OCR for page 161
GATE VALIDITIES )61
TABLE 8-10 Distribution of Studies Over Job Families
Job Percentage of Studies Percentage of Studies
Family (N= 515) (N= 264)
4
4
12
40
40
II
III
IV
V
54
31
Using the performance criterion, validity is lower in the recent studies
for all three composites for all families with meaningful sample sizes (I,
IV, and V). With the training criterion, only Family IV has a large sample
size; GVN validity actually is slightly higher for the new studies, and the
drop in SPQ and KFM in validities is smaller than for the performance
criteron for the same family.
TABLE 8-11 Validities for the Two Sets of Studies by Job Family
and Type of Criterion
Performance
Training
Job Hunter New Hunter New
Family Studies (N) Studies (N) Studies (N) Studies (N)
GVN
I .31 (1,142) .15(3,900) .41(180) .54(64)
II .14 (1,155) .19(200) -
III .30 (2,424) .25(630) .27(1,800) .30(347)
IV .27 (12,705) .21(19,206) .34(4,183) .36(3,169)
V .20 (13,367) .18(10,862) .36(655) .00(106)
SPQ
I .32 .13 .47 .40
II .17 .16
III .22 .21 .18 .21
IV .25 .16 .29 .25
V .23 .18 .38 .01
KFM
I .20 .07 .11 .16
II .35 .21 -
III .17 .17 .11 .02
IV .21 .12 .20 .17
V .27 .16 .31 .12
OCR for page 162
)62 GATB VALIDITIES ED VALIDITY GENERALIZATION
What accounts for this drop in validity? The above analyses have
already dealt with two plausible reasons: change over time in the job
families studied (from families for which validities are higher to families
for which validities are lower) and change over time in the type of criteria
used. Both of these factors have been found to moderate GATB validity.
However, since analyses reported here present results within job families
and within criterion types, this explanation has been ruled out.
Another factor is the role of race. As the next chapter describes in
detail, validity for black samples is lower than validity for white samples.
The more recent studies contain a heavy minority representation, since
many of the studies were undertaken explicitly to build a minority data
base. However, even among the recent studies for which separate
validities were available by race, total white N is larger than the total
black N by a factor of about 5, and the black-white validity difference is
substantially smaller than the difference reported here between the earlier
and the more recent studies. Thus the inclusion of more minority samples
is at best a minor contributor to the validity difference between the earlier
and the more recent studies.
Another possible explanation is that the more recent studies exhibit a
larger degree of range restriction, thus suppressing validity. Analysis
reveals exactly the opposite: the more recent studies show less range
restriction (e.g., a slightly larger GATB composite standard deviation).
Some have advanced the argument that the original data base should be
trusted and the new studies discounted. The reasoning used is that the
new studies were done hurriedly in order to gather data on validity for
black workers. In order to obtain minority samples, two things were
done: first, data from many organizations were pooled to increase
minority sample size and, second, organizations not typical of those
usually studied by USES were used because of access to the minority
samples. The second of these arguments does not seem compelling on its
face. But the hypothesis that pooling across employers could lower
validity seemed plausible, because each employer might have an idiosyn-
cratic performance standard (e.g., an employee whose performance is
"average" in one organization may be "above average" in another). This
would make the criterion less reliable, and thus lower validity.
However, the hypothesis was not borne out when tested empirically.
The data tape, containing raw data for 174 studies, included an employer
code. Validities were computed two ways: first, pooling data from all
employers within a job and, second, computing a separate validity
coefficient for each employer within a job. Because many employers
contributed only a single case or a handful of cases, separate validity
coefficients were computed only for employers contributing 10 or more
cases. This reduced the total sample size by about 20 percent. Mean
OCR for page 163
GATB VALIDITIES ~ 63
validities were essentially the same whether pooled across employers or
computed separately for each employer, thus failing to support the
hypothesis that multiple employer samples are an explanation for the
validity drop. The Northern Test Development Field Center has since
conducted similar analyses and also concluded that only a small part of
the decline in validities can be attributed to single- versus multiple-
location studies (U.S. Department of Labor, 19881.
We have not been able to derive convincing explanations for the
decrease in GATB validities from the data available to us. The drop is
especially marked in KFM validities, and one possibility is that jobs on
the whole require less psychomotor skill than previously, but this
scarcely explains the general decline. One can speculate as to whether
there has been some change in the nature of jobs such that the GATB
composite abilities are less valid now than had previously been the case.
However, if there were such a change, one would expect it to be noted
and commented on widely in the personnel testing literature; similar
declines in validity have not been observed with the Armed Services
Vocational Aptitude Battery, the military selection and classification
battery. It is also possible that the explanation lies in some as yet not
identified procedural aspects of the validity studies. In short, the validity
drop remains a mystery, and the differences between the early and recent
studies demand that USES be cautious in projecting validities computed
for old jobs to validities for future jobs.
VALIDITY GENERALIZATION ANALYSES
Having looked at the observed mean validities of two sets of studies,
and having noted a substantial decrease in validity in the more recent set,
we now turn to the issue of correcting the observed validities. In order to
demonstrate the full range of available options, we report three validity
generalization analyses: one correcting only for the effects of sampling
error (what is termed "bare bones" analysis), a second correcting for
criterion unreliability, and a third correcting for range restriction as well
as criterion unreliability. In each example, analyses are reported first for
the sample of studies and then broken down by criterion type (job
performance versus training success) and by job family. The chapter ends
with our conclusions about the most appropriate estimates of the true
validity of the GATB for Employment Service jobs.
Correcting Only for Sampling Error
In this analysis, variance expected due to sampling error is computed:
the variance is a function of the mean observed validity and the mean
OCR for page 164
~ GATE VALIDITIES ED VALIDI" GENE~IZATION
TABLE 8-12 Validities Corrected for Sampling Error, Based on 264
Studies
GVN SPQ KFM
Job Mean Observed Corrected Mean Observed Corrected Mean Observed Corrected
Family r SD SD r SD SD r SD SD
Overall
.20 .13 .07 .17 .13 .07 .13 .15 .08
Job Performance Criterion
I .15 .11 .06 .13 .13 .07 .07 .10 .06
II .19 .13 .07 .16 .14 .08 .21 .18 .10
III .25 .12 .06 .21 .11 .05 .17 .10 .05
IV .21 .11 .06 .16 .12 .07 .12 .13 .07
V .18 .11 .06 .18 .13 .07 .16 .11 .06
Training Criterion
I .54 .12 .05 .40 .08 .00 .16 .05 .00
II _
III .30 .16 .11 .21 .12 .05 .02 .15 .10
IV .36 .12 .07 .25 .11 .05 .17 .15 .10
V .00 .16 .12 .01 .15 .10 .12 .11 .04
NOTE: SD = standard deviation.
sample size. What is reported in the tables is the mean observed validity
coefficient, the observed standard deviation (SD), and the corrected
standard deviation. This corrected SD is found by subtracting variance
expected due to sampling error from observed variance: this gives a
corrected variance, the square root of which is the corrected standard
deviation. Thus, within each job family, the mean observed validity
estimates the average true validity of the population of jobs in the family,
and, provided the population validities are normally distributed, 90
percent of validities can be expected to fall above the point defined by
multiplying 1.28 times the corrected standard deviation (1.28 SD units
below the mean is the 10th percentile of a normal distribution) and
subtracting the result from the mean validity.
Table 8-12 shows that the observed variability is reduced considerably
in virtually all test/job family combinations when the effects of sampling
error are removed. If there were no variation in true validities, we would
expect the standard deviation of the observed validities to be about 0.10,
corresponding to an average sample size of 100; the actual standard
deviations are only a little larger than they would be if all variation was
due to sampling error. Thus correcting for sampling error produces a
marked reduction in the estimated standard deviation of true validities.
OCR for page 165
GATB VALIDITIES 165
TABLE 8-13 Credibility Values for Best Predictors
Family, Based on 264 Studies
in Each Job
Job Test Mean 90% Credibility
Criterion Family Choice Validity Value
Job performance I GVN .15 .06
II KFM .21 .12
III GVN .25 .16
IV GVN .21 .12
V GVN .18 .09
Training I GVN .54 .39
II
III GVN .30 .16
IV GVN .36 .26
V GVN .12 .05
Credibility values for the preferred test composite for each job family are
shown in Table 8-13. We compute credibility values in each job family
such that 90 percent of the true validities of jobs in that family will be
greater than the given credibility value.
Thus correcting only for sampling error, one finds evidence of modest
validity for the GATB for all job families.
Correcting for Criterion Unreliability
Ideally, a good reliability estimate would be available for each study, in
which case each validity coefficient could be corrected for unreliability.
Unfortunately, reliability data are available only for 285 of the 755
studies. Thus we will revert to the backup strategy of relying on assumed
values. One approach is to use the data from the studies for which
reliability estimates are available and project that similar reliability values
would have been obtained for the rest of the studies.
A problem that researchers in the area of validity generalization have
noted is that some methods of reliability estimation are likely to produce
inflated reliability measures. For example, a "rate-rerate" method, in
which a supervisor is asked to provide a rating of performance on two
occasions, typically about two weeks apart, is likely to produce overes-
timates of reliability, since it is not at all unlikely that the supervisor will
remember the previous rating and rate similarly in order to appear
consistent. Unfortunately, this method is the most commonly used in the
GATB data base, in which it produces a mean reliability value of .86.
More appropriate is an interrater reliability method; unfortunately, only
four studies in the GATB data base use this method.
On the basis of this lack of meaningful reliability data, Hunter assumed
in his validity generalization research for USES that reliability was .60
OCR for page 166
i66 GATB VALIDITIES AND VALIDI~ GENE~IZATION
when job performance was used as the criterion and .80 when training
success was used as the criterion. These values were based on a general
survey of the criterion measurement literature.
These values have met with some skepticism among industrial/organi-
zational psychologists, many of whom believe that the .60 value is too
low, and that interrater reliability is at least on some occasions substan-
tially higher than this. For example, recent research on performance in
military jobs, using job sample tests as the criterion, documents interrater
reliabilities in the .90s (U.S. Department of Defense, 19891. However, no
formal rebuttal of Hunter's position has appeared in print. The .80 for
reliability of training success does not appear controversial.
Operationally, we can correct for the effects of criterion unreliability by
dividing the mean validity coefficient by the square root of the mean
reliability coefficient. Thus, using .60 increases each observed validity by
29 percent and using .80 increases each observed validity by 12 percent.
Given the paucity of data, we recommend the more conservative .80
correction.
Correcting for Range Restriction
If the test standard deviation is smaller in the study sample than in the
applicant pool, then the validity coefficient for workers will be reduced
due to range restriction and will be an underestimate of the true validity
of the test for applicants. If the standard deviation for the applicant pool
is known, the ratio of study SD to applicant SD is a measure of the degree
of range restriction, and the validity coefficient can be corrected to
produce the value that would result if the full applicant population had
been represented in the study.
In the GATB data base the restricted SD is known for each test;
however, no values for the applicant pool SD are available. Hunter dealt
with this by making two assumptions: (1) for each job, the applicant pool
is the entire U.S. work force and (2) the pooled data from all the studies
in the GATB data base can be taken as a representation of the U.S. work
force. Thus Hunter computed the GVN, SPQ, and KFM SDs across all 515
jobs that he studied. Then, for each sample, he compared the sample SD
with this population SD as the basis for his range-restriction correction.
The notion that the entire work force can be viewed as the applicant
pool for each job is troubling. intuitively we tend to think that people
gravitate to jobs for which they are potentially suited: highly educated
people tend not to apply for minimum-wage jobs, and young high school
graduates tend not to apply for middle-management positions. And indeed
there is a large and varied economic literature on educational screening,
self-selection, and market-induced sorting of individuals that speaks
OCR for page 167
GATE VALIDITIES 167
against the notion that the entire work force can be viewed as the
applicant pool for each job (Sueyoshi, 19881.
Some empirical support for the notion that the applicant pool for
individual jobs is more restricted than the applicant pool for the entire
work force can be found by examining test SDs within job families. Using
the logic of Hunter's analysis, if data from all jobs can be pooled to
estimate the applicant population SD, then data from jobs in one family
can be pooled to estimate the applicant SD for that family. Applying this
logic to the GVN subtest produces the following:
GVN SD based on all jobs
GVN SD based on Job Family I
II
III
IV
V
53.0
45.6
48.6
49.7
49.2
48.4
Since the mean restricted GVN SD for the 755 studies is 42.2, Hunter's
method would produce a ratio of restricted to unrestricted SDs of .80,
whereas the family-specific ratios would vary from .85 to .93. Thus there
is a suggestion that Hunter's approach may overcorrect. Since the Job
Families IV and V that constitute the principal fraction of Employment
Service jobs include a very wide range of jobs, we might expect the
standard deviation for actual applicant groups to be smaller than that
obtained by acting as if all workers in the job family might apply for each
job.
Empirical data on test SDs in applicant pools for a variety of jobs filled
through the Employment Service are needed to assess whether Hunter's
analysis overcorrects for range restriction. In the absence of applicant
pool data, the conservative correction for restriction of range would be
simply to apply no correction at all.
The effect of Hunter's correction for restriction of range, which
assumes a restriction ratio of .80, is to multiply the observed correlations
by 1.25 when the observed correlations are modest. The combined effect
of his correction for reliability (which assumes average reliabilities of .60)
and restriction of range is to increase the observed correlations by 61
percent for job performance and by 40 percent for training success. The
more conservative correction recommended by the committee, one that
allows for reliability of .80 and no correction for restriction of range in the
worker population, would increase each correlation by 12 percent.
Thus sizable differences in estimated validities will occur according to
the correction chosen. When the more conservative assumptions are
applied to the 264 recent studies, one is left with a very different sense of
overall GATE validities than that projected by the USES test research
OCR for page 168
)68 GATE VALIDITIES kD VA~DI~ GENERALIZATION
TABLE 8-14 Validities Corrected for Reliability, Based on 264
Studies, Compared with Hunter's Validities Using His Larger
Corrections for Reliability and Restriction of Range, Based on 515
Studies
GVN SPQ
Job
Family 515 264 515
KFM Counts
264 515
264 515 264
Overall
.47 .22 .38 .19 .35 .15 38,620 38,521
Job Performance Criterion
I .56 .17 .52 .15 .30 .08 1,1423,900
II .23 .21 .24 .18 .48 .24 1,155200
III .58 .28 .35 .24 .21 .19 2,424630
IV .51 .23 .40 .18 .32 .13 12,70519,206
V .40 .20 .35 .20 .43 .18 13,36710,862
Training Cr~tenon
I .65 .60 .53 .45 .09 .1818064
II - --
III .50 .33 .26 .24 .13 .021,800347
IV .57 .40 .44 .28 .31 .194,1833,169
V .54 .00 .53 .01 .40 .13655106
reports drafted by Hunter (Table 8-14). Instead of overall GVN validities
of .47, they are .22. The KFM validities shrink from .35 to .15 in the
recent studies. These differences are not due only to differences in
analytic method. The 264 more recent studies simply produce different
empirical findings that is, lower validities than the earlier 515.
Optimal Predictors Based on the Recent 264 Studies
The corrected correlations in Table 8-14 may be used to develop
composite predictors of job performance in the different job families
based on the recent 264 studies. These predictors are weighted combina-
tions of GVN, SPQ, and KFM, with the weights chosen to maximize the
correlation between predictor and supervisor ratings. Because the com-
posites GVN, SPQ, and KFM are themselves highly intercorrelated
(Table 7-1), a wide range of weights will give about the same predictive
accuracy. For example, the predictor
2 GVN + KFM
is very nearly optimal for both Job Family IV and Job Family V.
OCR for page 169
GATB VALIDITIES i69
The optimal predictor in Job Family IV has correlation .24 with
supervisor ratings. The optimal predictor in Job Family V has correlation
.25 with supervisor ratings. The comparable correlations produced in
Hunter's analysis are .53 and .50. The differences are partly due to the
lower observed correlations in the recent studies and partly due to our use
of more conservative corrections.
FINDINGS: THE GATB DATA BASE
Criterion-Related Validity Prior to 1972
1. Validity studies of the GATB completed prior to 1972 produce a
mean observed correlation of about .25 between cognitive, perceptual, or
psychomotor aptitude scores and supervisor ratings on the job. The mean
observed correlation between cognitive or perceptual scores and training
success is about .35.
Criterion-Related Validity Changes Since 1972
2. There are notable differences in the results of GATB validity studies
conducted prior to 1972 and the later studies. The mean observed
correlation between supervisor ratings and cognitive or perceptual apti-
tude scores declines to .19, and between supervisor ratings and psycho-
motor aptitude scores declines to .13.
CONCLUSIONS ON VALIDITY GENERALIZATION FOR THE GATB
1. The general thesis of the theory of validity generalization, that
validities established for some jobs are generalizable to other unexamined
jobs, is accepted by the committee.
Observed and Adjusted Validities
2. The GATB has modest validities for predicting supervisor ratings of job
performance or training success in the 755 validity studies assembled by
USES over 45 years. The unexplained marked decrease in validity in recent
studies suggests caution in projecting these validities into the future.
3. The average observed validity of GATB aptitude composites for
supervisor ratings over the five job families of USES jobs in recent years
is about 0.22.
4. In the committee's judgment, plausible adjustments for criterion
unreliability might raise the average observed validity of the GATB
aptitude composites from .22 to .25 for recent studies. Corresponding
OCR for page 170
i70 GATB VALIDITIES AND VALIDITY GENERALIZATION
adjustments for the older studies produce a validity of .35, and the
average corrected validity across all 755 studies is approximately .30,
with about 90 percent of the jobs studied falling in the range of .20 to .40.
These validities are lower than those circulated in USES technical
reports, such as Test Research Report No. 45 (U.S. Department of
Labor, 1983b), which tend to be .5 or higher. The lower estimates are due
to the drop in observed validities in recent studies and to our use of more
conservative analytic assumptions. We have made the correction for
unreliability based on an assumed value of .80; we have made no
correction for restriction of range.
5. In the committee's judgment, two of the three adjustments to
observed GATB validities made in the USES analysis the adjustment
for restriction of range and that for criterion unreliability-are not well
supported by evidence. We conclude that the corrected validities re-
ported in USES test research reports are inflated.
In particular, we do not accept Hunter's assumption used in correcting
for restriction of range, namely that the applicant pool for a particular job
consists of all workers in all jobs. This assumption causes the observed
correlations to be adjusted upward by 25 percent for small correlations
and by 35 percent for observed validities of .50.
Restriction-of-range estimates should be based on data from applicants
for homogeneous clusters of jobs. Undoubtedly there is an effect due to
restriction of range, but in the absence of data to estimate the elect, no
correction should be made.
6. Reliability corrections are based in part on data in the GATB validity
data base, and so have more empirical support than the corrections for
restriction of range. There remains some question whether a reliability
value of .60, which has the effect of increasing correlations by 29 percent,
is appropriate for supervisor ratings. Given the weakness of the support-
ing data, we believe that a conservative correction, based on an estimated
reliability of .80, would be appropriate.
Validity Variability
7. Validities vary between jobs. Our calculation is that about 90
percent of the jobs in the GATB studies will have true validities between
.2 and .4 for supervisor ratings.
We cannot ascertain how generalizable this distribution is to the
remaining jobs in the population. For those jobs in the population that are
found to be similar to those in the sample, it seems reasonable to expect
roughly the same distribution as in the sample.
8. The GATB is heavily oriented toward the assessment of cognitive
abilities. However, the cognitive composite is not equally predictive of
OCR for page 171
GATB VALIDITIES 171
performance in all jobs. Common sense suggests that psychomotor,
spatial, and perceptual abilities would be very important in certain types
of jobs. But those sorts of abilities are measured much less well. And
GATB research has focused more on selection than on classification, with
a consequent emphasis on general ability rather than differential abilities.
9. Since GATB validities have a wide range of values over different
jobs and have declined over time, introduction of a testing system based
on validity generalization does not eliminate the need for continuing
criterion-related validity research. The concept of validity generalization
does not obviate the need for continuing validity studies for different jobs
and for the same job at different times.
Representative terms from entire chapter:
validity generalization