Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

8 GATB Validities In Chapter 6 we described validity generalization as the prediction of validities of a test for new jobs, based on meta-analysis of the validities of the test on studied jobs. This chapter focuses on establishing the predicted validities of the General Aptitude Test Battery (GATB) for new jobs. The first step involves compiling the existing validity data, and the second is a matter of estimating the "true" validity of the test by correcting the observed validities to account for various kinds of weak- nesses in existing research (e.g., small sample sizes). As part of its study of validity generalization for the GATB, the committee has conducted independent analyses of the existing GATB validity studies. The initial sections of the chapter compare the results of these analyses with the work done by John Hunter for the U.S. Employment Service (USES) based on a smaller and older set of studies (U.S. Department of Labor, 1983b,c,d). In addition, drawing on the discussion of corrections pre- sented in Chapter 6, the second half of the chapter presents the commit- tee's estimate of the generalizable validities of the GATB for the kinds of jobs handled by the Employment Service, an estimate that is rather more modest than that proposed in the U.S. Employment Service technical reports. THE GATB VALIDITY STUDIES Two sets of GATB validity studies are discussed in this chapter. The first is comprised of the original 515 validity studies analyzed by Hunter; 149

~50 GATE VALIDITIES kD VALIDITY GENERALIZATION they were prepared in the period 1945-1980 with 10 percent 1940s data, 40 percent 1950s data, 40 percent 1960s data, and 10 percent 1970s data. A larger data tape consisting of 755 studies was made available to the committee by USES. It included these and an additional set of 264 studies carried out in the 1970s and 1980s. (The Hunter studies appear as 491 studies in this data tape, because some pairs of studies in the original 515 consisted of validity coefficients for the same set of workers using two different criteria for job performance; these pairs each appear in a single study on the data tape.) The original samples from the 515 studies summed to 38,620 workers, and the samples from the more recent 264 studies summed to 38,521 workers. Written reports are available for the earlier 515 studies but not for the more recent 264. It is therefore possible to examine the earlier studies in some detail to determine their quality and comparability. An examination of 50 of the written reports selected at random showed very good agreement between the numbers in the report and the numbers coded into the data set. It is regrettable that no such reports are available for the more recent studies, since it leaves no good way to consider the characteristics of the samples that might explain the very different results of analysis for the two data sets. An Illustration of Test Validities Criterion-related validity is expressed as the product moment correla- tion between test score and a measure of job performance for a sample of workers. The degree of correlation is expressed as a coefficient that can range from -1.0, representing a perfect inverse relationship, to +1.0, representing a perfect positive relationship. A value of 0.0 indicates that there is no relationship between the predictor (the GATB) and the criterion. Figure 8-1 depicts this range of correlations with scatter diagrams showing the degree of linear relationship. In test validation research, the relationships between the test score and the performance measure are usually positive, if not necessarily strong. The lower the correlation, the less appropriate it is to make fine distinctions among test scores. Basic Patterns of Validity Findings The most striking finding from our analysis of the entire group of 755 validity studies is a distinct diminution of validities in the newer, post-1972 set. For all three composites, the 264 newer studies show lower mean validities, the decline being most striking for the perceptual and psychomotor composites (Table 8-11. (That the standard deviations are

GATE VALIDITIES ~5 ~ Value of r Description of linear relationship +1.00 Perfect, direct relationship Y About +.50 Moderate, direct relationship Y .00 Norelationship Y (i.e., O covariation of X with Y) About -.50 Moderate, inverse relationship Y -1.00 Perfect, inverse relationship Y FIGURE 8-1 Interpretation of values of correlation (r). TABLE 8-1 Mean and Standard Deviation of the Validity Coefficients, Weighted by Study Sample Sizes, Computed for Each Composite Across the 264 Studies, and Compared with Those Hunter Reported for the Original 515 Studies Scatter diagram to 0oooo oC' x 0 0 0 0 oOo° x of 0 x 0 0° 0° 0 ~o° x °°oOO x GVN SPQ KFM 515 264 515 264 515 264 Mean .25 .21 .25 .17 .25 .13 Standard deviation .15 .11 .15 .11 .17 .12

~ 52 GATB VALIDITIES AND VANDAL GENERALIZATION TABLE 8-2 Frequency Distribution of Validity Coefficients for Each GATB Composite Over A11755 Studies Percentage of Studies Validity Category GVN SPQ KFM -.40 - -.49 0.1 -.30 - -.39 0.1 -.20 - -.29 0.1 0.1 0.3 -.10--.19 1.0 1.1 1.2 .00 - -.09 3.7 4.1 6.6 .01 - .10 12.4 14.7 16.8 .11 - .20 24.7 22.9 24.0 .21 - .30 27.4 25.1 26.5 .31- .40 18.1 18.7 12.7 .41- .50 7.8 8.2 6.4 .51- .60 3.9 2.5 3.6 .61- .70 0.8 0.5 1.2 .71- .80 0.1 0.1 also lower is readily explainable: the 264 additional studies have a much larger average sample size 146 as opposed to about 75 in the original set resulting in less sampling error.) To give a better sense of the validity data than is provided by means and variances, a frequency distribution of validity coefficients for each composite, over all 755 studies, is shown in Table 8-2. The values presented are the percentage of studies falling into each validity category. Clearly, the range of observed validity coefficients is large. The question before us is to understand the meaning of this variability. In the next section we examine the effect of factors that might cause variation in the observed validity coefficients. Potential Moderators of Validity A number of study characteristics can be hypothesized as potentially affecting validity (and, therefore, contributing to the observed variability across studies). In our analysis of the 755 GATB validity studies, we looked at 10 characteristics: 1. sample size 2. job family 3. study type: predictive (i.e., tested at time of hire) versus concur- rent (testing of current employees) 4. criterion type: performance on the job versus performance in training

GATE VALIDITIES 153 5. age: mean age of individuals in the sample 6. experience: mean experience of individuals in the sample 7. education: mean education of individuals in the sample 8. race 9. sex 10. date of study Each of these characteristics is discussed in turn. Sample size Sampling error appears to be the single factor with the largest influence on variance in validity from study to study: removing the influence of sampling error is a major component of any validity generalization analysis. To get an intuitive feel for the effects of sampling error, GVN validities were examined separately for the entire sample, for samples with more than 100 subjects, for samples with more than 200 subjects, and for samples with more than 300 subjects. As N. the number of subjects, increases, random sampling error decreases. Thus we should see much less variation with large samples than with small samples. The distribution of GVN validity is presented in Table 8-3. It can clearly be seen that there is much more variation with small samples; as the mean N increases, validity values center much more closely on the mean. TABLE 8-3 Percentage of Studies in Each Validity Category, Based on A11755 Studies Percentage of Studies Validity All Studies >100 >200 >300 Category (N= 755) (N= 192) (N= 81) (N= 33) -.20 - -.29 0.1 -.10 - -.19 1.0 .00--.09 3.7 2.6 1.2 .01 - .10 12.4 10.9 8.7 6.1 .11 - .20 24.7 33.4 38.2 33.3 .21- .30 27.4 33.3 40.8 51.5 .31 - .40 18.1 15.6 7.4 3.0 .41- .50 7.8 4.2 3.7 6.0 .51- .60 3.9 .61- .70 0.8 .71 - .80 0.1

] 5 4 GA TB VALIDI TIES AND VALIDI ~ GENERA TION TABLE S-4 Variation of Validities Across Job Families in Old (515) and Recent (264) Studies GVN SPQ KFM Job Family 515 264 515 264 515 264 I (set-up/precision) .34 .16 .35 .14 .19 .08 II (feeding /offbearing) .1 3 .19 .1 5 .16 .3 5 . 2 1 III (synthesizing) .30 .27 .21 .21 .13 .12 IV (analyze/compile/ compute) .28 .23 .27 .17 .24 .13 V (copy/compare) .22 .18 .24 .18 .30 .16 Job Family In both the original 515 studies and the recent 264 studies, validity clearly varies across job families. The mean observed validities (for both the data used by Hunter and the full data set) are presented in Table 8-4 for each of the three test composites. A notable difference between the old and new studies is in the diminution of the KFM validities in Job Families IV and V. Study Type: Predictive Versus Concurrent Some studies are done using job applicants (predictive validation strategy), whereas others involve the testing of current employees (con- current validation strategy). Some have argued for the superiority of the predictive strategy, based on the assumption that the full range of applicants will be included in the study and thus that range restriction will be reduced. This argument presumes that a very rare version of the predictive validation strategy is used, namely that all applicants are hired regardless of test score. More realistically, applicants are screened using either the GATB itself or some other predictor, and thus range restriction is likely in both predictive and concurrent studies. This point has been made in the testing literature; in 1968 the GATB data base was examined by Bemis (1968) and no differences in validity for predictive and concur- rent studies were found. A comparison of predictive and concurrent studies was not reported for the original 515 studies. No consistent difference in validities was found in the present study, as Table 8-5 shows. For some composite/family combinations, validity is higher for the predictive studies; for others validity is higher for the concurrent studies. The predictive/concurrent distinction is too crude to be of real value: for example, we do not know whether the GATB was or was not used as the basis for hiring in any or all of the studies labeled 'predictive." Thus study type will not be tested further in this report;

GATE VALIDITIES i55 TABLE 8-5 Variation of Validities by Study Type and Job Family, for All 755 Studies GVN SPQ KFM Job Family Predictive Concurrent Predictive Concurrent Predictive Concurrent I .14 .21 .10 .19 .00 .11 II .15 .17 .33 III .30 .29 .24 .20 .12 .17 IV .29 .24 .26 .19 .21 .15 V .20 .20 .26 .26 .27 .22 variation in validity due to study type will remain one unaccounted-for source of variance. Criterion Type: On-the-]ob Performance Versus Training Success It has frequently been reported in the personnel testing literature that higher validity coefficients are obtained for ability tests when training success rather than job performance is used as the criterion. This makes conceptual sense, as there are probably fewer external factors influencing training success than job performance (e.g., job performance typically covers a longer time period and is probably more heavily influenced by supervision, work-group norms, variation in equipment, family problems, and so on). But it could also be a product of measurement technology- since training success is usually measured with a paper-and-pencil test, the similarity of measurement methods might artificially boost the corre- lation. Hunter reports substantially larger mean validities for GATE studies using training success. A summary based on the full data set is presented in Table 8-6. Given the magnitude of these differences, the data set is broken down by both job family and criterion type for validity generalization analyses. TABLE 8-6 Validities for Training Success and Supervisor Ratings, by Job Family, for All 755 Studies GVN SPQ KFM Job - ~ ~ Family Performance Training Performance Training Performance Training I .19 .45 .18 .45 .11 .12 II .15 .17 .33 III .29 .30 .21 .20 .17 .10 IV .23 .35 .19 .27 .16 .19 V .20 .3 1 .21 .33 .22 .30

] 56 GATE VALIDITIES ED VALIDITY GENERALIZATION Age Ideally, the effect of age would be examined by computing validity coefficients separately for individuals in different age categories. How- ever, the present data base reports validity coefficients for entire samples and does not report findings by age. What is reported is the mean age for each sample. Thus we can determine whether validity varies by the mean age of the sample. For the 755 studies, the mean "mean age" is 31 .8 years, with a standard deviation of 6.3 years. Correlations (r) between mean age and test validity are as follows: r age/GVN validity = -.1S r age/SPQ validity = -.06 r age/KFM validity = .03 Thus the validity of the cognitive composite (GVN) tends to be somewhat lower for older workers, though not enough to require special consider- ation in validity generalization analysis. This finding does not seem to hold for SPQ and KFM. Relationships between mean age and mean test score are also worthy of note: r age/GVN mean = -.28 r age/SPQ mean = -.45 r age/KFM mean = -.52 Thus studies in which the average age is higher tend to have composite scores that are notably lower, especially on SPQ and KFM. Since the age-validity relationship is low, age is not treated as a moderator in the validity generalization analyses, though the age/mean- test-score relationship certainly merits consideration in examining the GATE program as a whole. E. xperlence As with age, what is reported in the validity studies is the mean experience for each sample. In this data base, the mean is 5.5 years, with a standard deviation of 4 years. Note that what is coded is typically job experience rather than total work experience. Experience and age are highly related: the correlation between the two is .58. Correlations (r) between mean experience and test validity are as follows: r experience/GVN validity = .03 r experience/SPQ validity = .00 r experience/KFM validity = - .16

GATE VALIDITIES ~ 57 The pattern is mixed, with less experienced samples producing higher KFM validities. This parallels the relationship between experience and test score means: r exper~ence/GVN mean = .00 r exper~ence/SPQ mean = -.07 r exper~ence/KFM mean= -.32 Less-experienced samples score higher on KFM; in all likelihood this is age-related. Experience is not treated as a moderator in the validity generalization analyses. Education The mean years of education across the 755 samples is 11.4 years, with a standard deviation of 1.5 years. The pattern of correlations between mean education and test validity is as follows: r education/GVN validity = .15 r education/SPQ validity = -.10 r education/KFM validity= -.36 Thus GVN validity tends to be higher for more-educated samples, and KFM validity higher for less-educated samples. In all likelihood, this effect is caused by the relationship between job family and validity, namely, higher GVN validity for more complex jobs (requiring more education) and higher KFM validity for less complex jobs (requiring less education). Validity Differences by Race Validity differences by race are examined in detail in the following chapter. Suffice it to say here that most of the GATE validity studies do not report data by race, but analysis of the 72 studies with at least 50 black and 50 nonminority workers indicates that mean validities for nonminorities are higher than mean validities for blacks for all three composites. Validity Differences by Sex Many studies (345 of 755) are based on mixed-sex samples. However, 410 studies were done on single-sex samples (226 male, 184 female). Breaking studies down by job family and criterion type (performance criterion versus training criterion, the two important moderator variables identified in the earlier analyses), leaves few categories with enough studies for meaningful comparisons to be made. Nevertheless, in those

)58 GATB VALIDITIES AND VALIDITY GENERALIZATION TABLE 8-7 Validities for Job Families by Sex: Comparison of the Mean Observed Validity Across Studies Split by Job Family, Type of Criterion Measure, and Sex of Sample (Mixed Samples Not Included in Analyses) GVN Performance Training Male Female Male Female Job Number of Number of Number of Number of Family Mean Studies Mean Studies Mean Studies Mean Studies I .28 21 .49 1 - II .23 1 .13 16 - III .36 12 .37 2 .32 7 .57 1 IV .24 98 .23 31 .34 38 .38 14 V .21 46 .20 118 .35 2 .37 2 SPQ I .29 21 .40 1 II .22 1 .15 16 - III .24 12 .25 2 .19 7 .50 1 IV .23 98 .22 31 .29 38 .29 14 V .25 46 .23 118 .45 2 .40 2 KFM I .18 21 - .03 1 II .29 1 .34 16 - ~ III .13 12 .19 2 .05 7 .48 1 IV .19 98 .23 31 .21 38 .22 14 V .25 46 .30 118 .32 2 .46 2 categories in which comparisons can be made (Job Families IV and V with a performance criterion and Job Family IV with a training criterion), the results suggest no effect due to sex. Results are summarized in Table 8-7. Similar conclusions were reached in a USES test research report, which analyzed validity differences by sex for 122 validity studies for which validity could be computed separately for males and females (U.S. Department of Labor, 1984a). That report concluded that there are no meaningful differences in GATB validities between males and females. Date of Study The committee was concerned about reliance on very old validity studies in drawing general conclusions about GATB validity, as the

GATE VALIDITIES 159 TABLE 8-8 Correlations of Validities with Year of Study Job Number of Family StudiesGVNSPQKFM I 23.22-.24.27 II 17.12-.08-.67 III 50-.25-.15.02 IV 235.03-.03-.07 V 217.00-.1 1-.33 validity studies had been done over a period of four decades. The date of the study was not coded on the data tape containing the summaries of 755 validity studies. But virtually all written reports were also made available to the committee. These contained study dates for more than 400 studies and validity coefficients for 542 independent samples. The date was extracted from each study and added to the data tape. In the subsequent analysis, date was treated as a continuous variable. Date of study was correlated with GATE composite validity within each job family (Table 8-8~. Study date varied from 1945 to 1979, distributed about 10 percent in the 1940s, 40 percent in the 1950s, 40 in the percent 1960s, and 10 percent in the 1970s. These findings may be artifactual: if, for example, there was a change over time in some study characteristic (e.g., job performance criteria versus training criteria), the true effects of study date would be hidden. Thus partial correlations were computed controlling for criterion type (job performance versus training success) and for study type (predictive versus concurrent), producing the second-order partial correlations shown in Table 8-9. Only Job Families IV and V offer large enough numbers of studies to merit careful attention. In these two job families, there is no evidence of change over time in the validity of the GVN composite, but there is evidence of a significant decrease in SPQ and KFM validity over time. Note that this analysis is based on studies for which written reports TABLE 8-9 Correlations of Validity with Time, Adjusting for Criterion Type and Job Type Job Number of Familv Studies GVN SPQ KFM I 19 .24 -.31 .27 II 13 .00 .00 .00 III 45 -.37 - .20 -.07 IV 228 -.04 -.18 -.23 V 210 .04 -.08 -.33

i60 GATB VALIDITIES AND VALIDI~GENE^~IZATION were available. No written reports are available for the most recent 200 or so of the 755 studies in the data base. This means that the decline in GATB validities portrayed in Table 8-1 is not restricted to the newer studies, although it has become more pronounced in the post-1972 data set. Exploration of Explanations for the Change in Validity Over Time The decrease in GATB validities over time is puzzling and obviously somewhat worrisome. In trying to find some reasons, we compared the findings from Hunter's analysis of 515 studies, which overlaps closely with the set of studies for which written reports are available, with the 264 more recent studies. Two procedural issues are worthy of note. First, we can only approx- imate the data base used by Hunter. We can identify 513 studies as studies that were available to Hunter. These are contrasted with 264 studies added to the tape since Hunter's analysis was done. These total more than 755, because Hunter included a series of studies that later proved to be nonindependent: if two criteria were available in a single study, two validity coefficients were computed and included on the tape as separate studies. These have not been included separately in the analyses based on all 755 studies, but are included here to recreate Hunter's data base as closely as possible. Second, USES has identified and corrected a number of coding errors in the data base. Thus a reanalysis of the same studies Hunter examined will not produce identical results. The 513 Hunter studies have a total N of 37,674; the 264 new studies have a total N of 38,521. Several hypotheses about possible causes of the mean differences were explored and rejected. One is that the new studies were validated against a different type of criterion. However, 83 percent of both the old and the new studies were validated against on-thejob measures, primarily super- visor ratings, and 17 percent against training measures. A second rejected hypothesis is that the type of job studied changed. A comparison of the job family breakdown between the 515 studies and the present 264 studies appears in Table 8-10. In both data sets, Job Families IV and V predominate, although the old set is evenly divided between the two and the new set has significantly more Family IV studies. We can rule out the differences in job families as an explanation by considering validities within job family (Table 8-111. Several comments are in order. First, the sample sizes for Families II and III for the job performance criteria are small; thus we focus solely on Families I, IV, and V. Second, only for Job Family IV is the sample size adequate to put any confidence in the findings using the training criterion.

GATE VALIDITIES )61 TABLE 8-10 Distribution of Studies Over Job Families Job Percentage of Studies Percentage of Studies Family (N= 515) (N= 264) 4 4 12 40 40 II III IV V 54 31 Using the performance criterion, validity is lower in the recent studies for all three composites for all families with meaningful sample sizes (I, IV, and V). With the training criterion, only Family IV has a large sample size; GVN validity actually is slightly higher for the new studies, and the drop in SPQ and KFM in validities is smaller than for the performance criteron for the same family. TABLE 8-11 Validities for the Two Sets of Studies by Job Family and Type of Criterion Performance Training Job Hunter New Hunter New Family Studies (N) Studies (N) Studies (N) Studies (N) GVN I .31 (1,142) .15(3,900) .41(180) .54(64) II .14 (1,155) .19(200) - III .30 (2,424) .25(630) .27(1,800) .30(347) IV .27 (12,705) .21(19,206) .34(4,183) .36(3,169) V .20 (13,367) .18(10,862) .36(655) .00(106) SPQ I .32 .13 .47 .40 II .17 .16 III .22 .21 .18 .21 IV .25 .16 .29 .25 V .23 .18 .38 .01 KFM I .20 .07 .11 .16 II .35 .21 - III .17 .17 .11 .02 IV .21 .12 .20 .17 V .27 .16 .31 .12

)62 GATB VALIDITIES ED VALIDITY GENERALIZATION What accounts for this drop in validity? The above analyses have already dealt with two plausible reasons: change over time in the job families studied (from families for which validities are higher to families for which validities are lower) and change over time in the type of criteria used. Both of these factors have been found to moderate GATB validity. However, since analyses reported here present results within job families and within criterion types, this explanation has been ruled out. Another factor is the role of race. As the next chapter describes in detail, validity for black samples is lower than validity for white samples. The more recent studies contain a heavy minority representation, since many of the studies were undertaken explicitly to build a minority data base. However, even among the recent studies for which separate validities were available by race, total white N is larger than the total black N by a factor of about 5, and the black-white validity difference is substantially smaller than the difference reported here between the earlier and the more recent studies. Thus the inclusion of more minority samples is at best a minor contributor to the validity difference between the earlier and the more recent studies. Another possible explanation is that the more recent studies exhibit a larger degree of range restriction, thus suppressing validity. Analysis reveals exactly the opposite: the more recent studies show less range restriction (e.g., a slightly larger GATB composite standard deviation). Some have advanced the argument that the original data base should be trusted and the new studies discounted. The reasoning used is that the new studies were done hurriedly in order to gather data on validity for black workers. In order to obtain minority samples, two things were done: first, data from many organizations were pooled to increase minority sample size and, second, organizations not typical of those usually studied by USES were used because of access to the minority samples. The second of these arguments does not seem compelling on its face. But the hypothesis that pooling across employers could lower validity seemed plausible, because each employer might have an idiosyn- cratic performance standard (e.g., an employee whose performance is "average" in one organization may be "above average" in another). This would make the criterion less reliable, and thus lower validity. However, the hypothesis was not borne out when tested empirically. The data tape, containing raw data for 174 studies, included an employer code. Validities were computed two ways: first, pooling data from all employers within a job and, second, computing a separate validity coefficient for each employer within a job. Because many employers contributed only a single case or a handful of cases, separate validity coefficients were computed only for employers contributing 10 or more cases. This reduced the total sample size by about 20 percent. Mean

GATB VALIDITIES ~ 63 validities were essentially the same whether pooled across employers or computed separately for each employer, thus failing to support the hypothesis that multiple employer samples are an explanation for the validity drop. The Northern Test Development Field Center has since conducted similar analyses and also concluded that only a small part of the decline in validities can be attributed to single- versus multiple- location studies (U.S. Department of Labor, 19881. We have not been able to derive convincing explanations for the decrease in GATB validities from the data available to us. The drop is especially marked in KFM validities, and one possibility is that jobs on the whole require less psychomotor skill than previously, but this scarcely explains the general decline. One can speculate as to whether there has been some change in the nature of jobs such that the GATB composite abilities are less valid now than had previously been the case. However, if there were such a change, one would expect it to be noted and commented on widely in the personnel testing literature; similar declines in validity have not been observed with the Armed Services Vocational Aptitude Battery, the military selection and classification battery. It is also possible that the explanation lies in some as yet not identified procedural aspects of the validity studies. In short, the validity drop remains a mystery, and the differences between the early and recent studies demand that USES be cautious in projecting validities computed for old jobs to validities for future jobs. VALIDITY GENERALIZATION ANALYSES Having looked at the observed mean validities of two sets of studies, and having noted a substantial decrease in validity in the more recent set, we now turn to the issue of correcting the observed validities. In order to demonstrate the full range of available options, we report three validity generalization analyses: one correcting only for the effects of sampling error (what is termed "bare bones" analysis), a second correcting for criterion unreliability, and a third correcting for range restriction as well as criterion unreliability. In each example, analyses are reported first for the sample of studies and then broken down by criterion type (job performance versus training success) and by job family. The chapter ends with our conclusions about the most appropriate estimates of the true validity of the GATB for Employment Service jobs. Correcting Only for Sampling Error In this analysis, variance expected due to sampling error is computed: the variance is a function of the mean observed validity and the mean

~ GATE VALIDITIES ED VALIDI" GENE~IZATION TABLE 8-12 Validities Corrected for Sampling Error, Based on 264 Studies GVN SPQ KFM Job Mean Observed Corrected Mean Observed Corrected Mean Observed Corrected Family r SD SD r SD SD r SD SD Overall .20 .13 .07 .17 .13 .07 .13 .15 .08 Job Performance Criterion I .15 .11 .06 .13 .13 .07 .07 .10 .06 II .19 .13 .07 .16 .14 .08 .21 .18 .10 III .25 .12 .06 .21 .11 .05 .17 .10 .05 IV .21 .11 .06 .16 .12 .07 .12 .13 .07 V .18 .11 .06 .18 .13 .07 .16 .11 .06 Training Criterion I .54 .12 .05 .40 .08 .00 .16 .05 .00 II _ III .30 .16 .11 .21 .12 .05 .02 .15 .10 IV .36 .12 .07 .25 .11 .05 .17 .15 .10 V .00 .16 .12 .01 .15 .10 .12 .11 .04 NOTE: SD = standard deviation. sample size. What is reported in the tables is the mean observed validity coefficient, the observed standard deviation (SD), and the corrected standard deviation. This corrected SD is found by subtracting variance expected due to sampling error from observed variance: this gives a corrected variance, the square root of which is the corrected standard deviation. Thus, within each job family, the mean observed validity estimates the average true validity of the population of jobs in the family, and, provided the population validities are normally distributed, 90 percent of validities can be expected to fall above the point defined by multiplying 1.28 times the corrected standard deviation (1.28 SD units below the mean is the 10th percentile of a normal distribution) and subtracting the result from the mean validity. Table 8-12 shows that the observed variability is reduced considerably in virtually all test/job family combinations when the effects of sampling error are removed. If there were no variation in true validities, we would expect the standard deviation of the observed validities to be about 0.10, corresponding to an average sample size of 100; the actual standard deviations are only a little larger than they would be if all variation was due to sampling error. Thus correcting for sampling error produces a marked reduction in the estimated standard deviation of true validities.

GATB VALIDITIES 165 TABLE 8-13 Credibility Values for Best Predictors Family, Based on 264 Studies in Each Job Job Test Mean 90% Credibility Criterion Family Choice Validity Value Job performance I GVN .15 .06 II KFM .21 .12 III GVN .25 .16 IV GVN .21 .12 V GVN .18 .09 Training I GVN .54 .39 II III GVN .30 .16 IV GVN .36 .26 V GVN .12 .05 Credibility values for the preferred test composite for each job family are shown in Table 8-13. We compute credibility values in each job family such that 90 percent of the true validities of jobs in that family will be greater than the given credibility value. Thus correcting only for sampling error, one finds evidence of modest validity for the GATB for all job families. Correcting for Criterion Unreliability Ideally, a good reliability estimate would be available for each study, in which case each validity coefficient could be corrected for unreliability. Unfortunately, reliability data are available only for 285 of the 755 studies. Thus we will revert to the backup strategy of relying on assumed values. One approach is to use the data from the studies for which reliability estimates are available and project that similar reliability values would have been obtained for the rest of the studies. A problem that researchers in the area of validity generalization have noted is that some methods of reliability estimation are likely to produce inflated reliability measures. For example, a "rate-rerate" method, in which a supervisor is asked to provide a rating of performance on two occasions, typically about two weeks apart, is likely to produce overes- timates of reliability, since it is not at all unlikely that the supervisor will remember the previous rating and rate similarly in order to appear consistent. Unfortunately, this method is the most commonly used in the GATB data base, in which it produces a mean reliability value of .86. More appropriate is an interrater reliability method; unfortunately, only four studies in the GATB data base use this method. On the basis of this lack of meaningful reliability data, Hunter assumed in his validity generalization research for USES that reliability was .60

i66 GATB VALIDITIES AND VALIDI~ GENE~IZATION when job performance was used as the criterion and .80 when training success was used as the criterion. These values were based on a general survey of the criterion measurement literature. These values have met with some skepticism among industrial/organi- zational psychologists, many of whom believe that the .60 value is too low, and that interrater reliability is at least on some occasions substan- tially higher than this. For example, recent research on performance in military jobs, using job sample tests as the criterion, documents interrater reliabilities in the .90s (U.S. Department of Defense, 19891. However, no formal rebuttal of Hunter's position has appeared in print. The .80 for reliability of training success does not appear controversial. Operationally, we can correct for the effects of criterion unreliability by dividing the mean validity coefficient by the square root of the mean reliability coefficient. Thus, using .60 increases each observed validity by 29 percent and using .80 increases each observed validity by 12 percent. Given the paucity of data, we recommend the more conservative .80 correction. Correcting for Range Restriction If the test standard deviation is smaller in the study sample than in the applicant pool, then the validity coefficient for workers will be reduced due to range restriction and will be an underestimate of the true validity of the test for applicants. If the standard deviation for the applicant pool is known, the ratio of study SD to applicant SD is a measure of the degree of range restriction, and the validity coefficient can be corrected to produce the value that would result if the full applicant population had been represented in the study. In the GATB data base the restricted SD is known for each test; however, no values for the applicant pool SD are available. Hunter dealt with this by making two assumptions: (1) for each job, the applicant pool is the entire U.S. work force and (2) the pooled data from all the studies in the GATB data base can be taken as a representation of the U.S. work force. Thus Hunter computed the GVN, SPQ, and KFM SDs across all 515 jobs that he studied. Then, for each sample, he compared the sample SD with this population SD as the basis for his range-restriction correction. The notion that the entire work force can be viewed as the applicant pool for each job is troubling. intuitively we tend to think that people gravitate to jobs for which they are potentially suited: highly educated people tend not to apply for minimum-wage jobs, and young high school graduates tend not to apply for middle-management positions. And indeed there is a large and varied economic literature on educational screening, self-selection, and market-induced sorting of individuals that speaks

GATE VALIDITIES 167 against the notion that the entire work force can be viewed as the applicant pool for each job (Sueyoshi, 19881. Some empirical support for the notion that the applicant pool for individual jobs is more restricted than the applicant pool for the entire work force can be found by examining test SDs within job families. Using the logic of Hunter's analysis, if data from all jobs can be pooled to estimate the applicant population SD, then data from jobs in one family can be pooled to estimate the applicant SD for that family. Applying this logic to the GVN subtest produces the following: GVN SD based on all jobs GVN SD based on Job Family I II III IV V 53.0 45.6 48.6 49.7 49.2 48.4 Since the mean restricted GVN SD for the 755 studies is 42.2, Hunter's method would produce a ratio of restricted to unrestricted SDs of .80, whereas the family-specific ratios would vary from .85 to .93. Thus there is a suggestion that Hunter's approach may overcorrect. Since the Job Families IV and V that constitute the principal fraction of Employment Service jobs include a very wide range of jobs, we might expect the standard deviation for actual applicant groups to be smaller than that obtained by acting as if all workers in the job family might apply for each job. Empirical data on test SDs in applicant pools for a variety of jobs filled through the Employment Service are needed to assess whether Hunter's analysis overcorrects for range restriction. In the absence of applicant pool data, the conservative correction for restriction of range would be simply to apply no correction at all. The effect of Hunter's correction for restriction of range, which assumes a restriction ratio of .80, is to multiply the observed correlations by 1.25 when the observed correlations are modest. The combined effect of his correction for reliability (which assumes average reliabilities of .60) and restriction of range is to increase the observed correlations by 61 percent for job performance and by 40 percent for training success. The more conservative correction recommended by the committee, one that allows for reliability of .80 and no correction for restriction of range in the worker population, would increase each correlation by 12 percent. Thus sizable differences in estimated validities will occur according to the correction chosen. When the more conservative assumptions are applied to the 264 recent studies, one is left with a very different sense of overall GATE validities than that projected by the USES test research

)68 GATE VALIDITIES kD VA~DI~ GENERALIZATION TABLE 8-14 Validities Corrected for Reliability, Based on 264 Studies, Compared with Hunter's Validities Using His Larger Corrections for Reliability and Restriction of Range, Based on 515 Studies GVN SPQ Job Family 515 264 515 KFM Counts 264 515 264 515 264 Overall .47 .22 .38 .19 .35 .15 38,620 38,521 Job Performance Criterion I .56 .17 .52 .15 .30 .08 1,1423,900 II .23 .21 .24 .18 .48 .24 1,155200 III .58 .28 .35 .24 .21 .19 2,424630 IV .51 .23 .40 .18 .32 .13 12,70519,206 V .40 .20 .35 .20 .43 .18 13,36710,862 Training Cr~tenon I .65 .60 .53 .45 .09 .1818064 II - -- III .50 .33 .26 .24 .13 .021,800347 IV .57 .40 .44 .28 .31 .194,1833,169 V .54 .00 .53 .01 .40 .13655106 reports drafted by Hunter (Table 8-14). Instead of overall GVN validities of .47, they are .22. The KFM validities shrink from .35 to .15 in the recent studies. These differences are not due only to differences in analytic method. The 264 more recent studies simply produce different empirical findings that is, lower validities than the earlier 515. Optimal Predictors Based on the Recent 264 Studies The corrected correlations in Table 8-14 may be used to develop composite predictors of job performance in the different job families based on the recent 264 studies. These predictors are weighted combina- tions of GVN, SPQ, and KFM, with the weights chosen to maximize the correlation between predictor and supervisor ratings. Because the com- posites GVN, SPQ, and KFM are themselves highly intercorrelated (Table 7-1), a wide range of weights will give about the same predictive accuracy. For example, the predictor 2 GVN + KFM is very nearly optimal for both Job Family IV and Job Family V.

GATB VALIDITIES i69 The optimal predictor in Job Family IV has correlation .24 with supervisor ratings. The optimal predictor in Job Family V has correlation .25 with supervisor ratings. The comparable correlations produced in Hunter's analysis are .53 and .50. The differences are partly due to the lower observed correlations in the recent studies and partly due to our use of more conservative corrections. FINDINGS: THE GATB DATA BASE Criterion-Related Validity Prior to 1972 1. Validity studies of the GATB completed prior to 1972 produce a mean observed correlation of about .25 between cognitive, perceptual, or psychomotor aptitude scores and supervisor ratings on the job. The mean observed correlation between cognitive or perceptual scores and training success is about .35. Criterion-Related Validity Changes Since 1972 2. There are notable differences in the results of GATB validity studies conducted prior to 1972 and the later studies. The mean observed correlation between supervisor ratings and cognitive or perceptual apti- tude scores declines to .19, and between supervisor ratings and psycho- motor aptitude scores declines to .13. CONCLUSIONS ON VALIDITY GENERALIZATION FOR THE GATB 1. The general thesis of the theory of validity generalization, that validities established for some jobs are generalizable to other unexamined jobs, is accepted by the committee. Observed and Adjusted Validities 2. The GATB has modest validities for predicting supervisor ratings of job performance or training success in the 755 validity studies assembled by USES over 45 years. The unexplained marked decrease in validity in recent studies suggests caution in projecting these validities into the future. 3. The average observed validity of GATB aptitude composites for supervisor ratings over the five job families of USES jobs in recent years is about 0.22. 4. In the committee's judgment, plausible adjustments for criterion unreliability might raise the average observed validity of the GATB aptitude composites from .22 to .25 for recent studies. Corresponding

i70 GATB VALIDITIES AND VALIDITY GENERALIZATION adjustments for the older studies produce a validity of .35, and the average corrected validity across all 755 studies is approximately .30, with about 90 percent of the jobs studied falling in the range of .20 to .40. These validities are lower than those circulated in USES technical reports, such as Test Research Report No. 45 (U.S. Department of Labor, 1983b), which tend to be .5 or higher. The lower estimates are due to the drop in observed validities in recent studies and to our use of more conservative analytic assumptions. We have made the correction for unreliability based on an assumed value of .80; we have made no correction for restriction of range. 5. In the committee's judgment, two of the three adjustments to observed GATB validities made in the USES analysis the adjustment for restriction of range and that for criterion unreliability-are not well supported by evidence. We conclude that the corrected validities re- ported in USES test research reports are inflated. In particular, we do not accept Hunter's assumption used in correcting for restriction of range, namely that the applicant pool for a particular job consists of all workers in all jobs. This assumption causes the observed correlations to be adjusted upward by 25 percent for small correlations and by 35 percent for observed validities of .50. Restriction-of-range estimates should be based on data from applicants for homogeneous clusters of jobs. Undoubtedly there is an effect due to restriction of range, but in the absence of data to estimate the elect, no correction should be made. 6. Reliability corrections are based in part on data in the GATB validity data base, and so have more empirical support than the corrections for restriction of range. There remains some question whether a reliability value of .60, which has the effect of increasing correlations by 29 percent, is appropriate for supervisor ratings. Given the weakness of the support- ing data, we believe that a conservative correction, based on an estimated reliability of .80, would be appropriate. Validity Variability 7. Validities vary between jobs. Our calculation is that about 90 percent of the jobs in the GATB studies will have true validities between .2 and .4 for supervisor ratings. We cannot ascertain how generalizable this distribution is to the remaining jobs in the population. For those jobs in the population that are found to be similar to those in the sample, it seems reasonable to expect roughly the same distribution as in the sample. 8. The GATB is heavily oriented toward the assessment of cognitive abilities. However, the cognitive composite is not equally predictive of

GATB VALIDITIES 171 performance in all jobs. Common sense suggests that psychomotor, spatial, and perceptual abilities would be very important in certain types of jobs. But those sorts of abilities are measured much less well. And GATB research has focused more on selection than on classification, with a consequent emphasis on general ability rather than differential abilities. 9. Since GATB validities have a wide range of values over different jobs and have declined over time, introduction of a testing system based on validity generalization does not eliminate the need for continuing criterion-related validity research. The concept of validity generalization does not obviate the need for continuing validity studies for different jobs and for the same job at different times.