Read "Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery" at NAP.edu

« Previous: 5 Problematic Features of the GATB: Test Administration, Speedness, and Coachability

Page 119 Cite

Suggested Citation:"6 The Theory of Validity Generalization." National Research Council. 1989. Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery. Washington, DC: The National Academies Press. doi: 10.17226/1338.

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

6 The Theory of Validity Generalization META ANALYSIS Meta-analysis is the combination of empirical evidence from diverse studies. Although the term meta-ana~ysis has emerged only in the past two decades, formal methods for combining observations have a long history. Astronomical observations at different sites and times have been combined in order to draw general conclusions since the 1800s (Stigler, 19861. Statistical techniques for combining significance tests and combin- ing estimates of effects in agricultural experiments date from the 1930s (Hedges and Olkin, 19851. Several major programs of quantitative synthesis of research have existed for decades in the physical sciences. For example, the Particle Data Group, headquartered jointly at Berkeley and Centre Europeen de la Recherche Nucleaire in Switzerland, conducts meta-analyses of the results of experiments in elementary particle physics worldwide and publishes the results every two years as the Review of Particle Properties. In medicine, meta-analyses are becoming increasingly important as a technique to systematize the results of clinical trials (Proceedings of the Workshop on Methodological Issues in Overviews of Randomized Clinical Trials, 1987), to collect research results in particular areas (the Oxford Database of Perinatal Medicine), and in public health (Louis et al., 1985). In the social and behavioral sciences, meta-analysis has been used primarily in psychology and education, for such diverse purposes as to ~9

~20 V~DI~ GENERATION AND GATE VA~DITIES summarize research on the effectiveness of psychotherapy, the effects of class size on achievement and attitudes, experimental expectancy effects, and the social psychology of gender differences. In studying a problem, every scientist must assimilate and assess the results of previous studies of the same problem. In the absence of a formal mechanism for combining the past results, it is always tempting to assume that the present experiment is of prime originality and to ignore or dismiss inconvenient or contradictory results from the past. Many superfluous data are collected because it is too difficult or confusing or unconvincing or unglamorous to assemble and examine what is known already. Meta-analysis attempts to provide a formal mechanism for doing so. By combining information from different studies, meta-analysis in- creases the precision with which effects can be estimated (or increases the power of statistical tests of hypotheses). For example, many clinical trials in medicine are too small for treatment effects to be estimated with accuracy, but combining evidence across different studies can yield estimates that are precise enough to be useful. In addition, meta- analysis produces more robust evidence than any single study. The convergence of evidence produced under differing conditions helps to ensure that the effects observed are not the inadvertent result of some unrecognized aspect of context, procedure, or measurement. And finally, meta-analysis usually involves some explicit plan for sampling from the available body of research evidence. Without controls for selection, it is possible to obtain very different pictures of the evidence by selecting, perhaps inadvertently, studies that favor one position or another. Although there is no general prescription for carrying out a meta- analysis, the procedure can be divided into four steps: 1. Identify relevant studies and collect results. It is important in this step to ensure the representativeness of the studies used. One particularly difficult source of bias to control for is called the file drawer problem, which alludes to the tendency for statistically insignificant results to repose unpublished and unknown in file drawers and thus not be available for collection. 2. Evaluate individual studies for quality and relevance to the problem of interest. 3. Identify relevant measurements, comparable across studies. 4. Combine relevant comparable measures across studies and project these values onto the problem of interest.

THE THEORY OF VAUDI~ GENE TON 12} VALIDITY GENERALIZATION Validity generalization is a branch of meta-analysis that draws on criterion-related validity evidence to extend the results of test research. Our precise interest is the estimation of the validities of a test for performance on new jobs, based on meta-analysis of the validities of the test for studied jobs. There is a very substantial statistical and psychomet- ric literature on estimating validities measured via correlation coefficients. This chapter presents in broad outline the statistical analyses used in validity generalization and focuses particularly on the work of John E. Hunter and his frequent collaborator, Frank L. Schmidt, because of Hunter s central role in applying validity generalization to the General Aptitude Test Battery (GATB) (Hunter, 1986; Schmidt and Hunter, 1977, 1981; Schmidt et al., 1982; U.S. Department of Labor, 1983b,c). The Theoretical Framework The fundamental problem addressed by validity generalization is how to characterize the generalizability of test validities across situations, populations of applicants, and jobs. The most prominent approach treats the problem as one of examining the variability across studies of validity coefficients. The theoretical framework is as follows. One wants to estimate the true validity of a test for given jobs. (By true validity, we mean the validity that would obtain in studies conducted under ideal conditions, with job performance assessed with perfect accuracy by the criterion.) As a first proposition, it is assumed that there is some distribution of true validities across a population of jobs, and that this distribution of validities is then taken to apply to new jobs that have not undergone a criterion-related validity study. The conclusion will be in the form: the validity of the test for the new job lies between .3 and .5 with probability .9. The questions remain: how are the observed validities to be used to estimate the distribution of true validities across a population of jobs, and thus what is the probable range of values that can be generalized to a new job? There are a number of ways in which the correlation coefficient obtained in any given study of the relation of test scores to job perfor- mance is affected by situational factors, so that the validity estimate differs from the true validity of the test for a new job: 'The criterion-related validity of a test is a measure of the relationship between the test score and a criterion of job performance (e.g., supervisor ratings). The relationship between test score and job performance is measured by the product moment correlation. Following standard practice, we refer to this correlation as validity, although this usage invites confusion with other psychometric and legal uses of the word validity.

|22 VALIDITY GENERALIZATION AND GATE VA~rDITIES Sampling error. The observed validities are based on a sample of workers; the true validities are based on a population of applicants. The difference between sample and population is adjusted for by taking into account sampling error of the observed validities. The major effect is that the variability of the observed validities over jobs is greater than the variability of true validities. Restriction of range. The observed validities are based on a sample of workers, the true validities are based on a population of applicants. Because the worker group may be selected from the applicant group by criteria correlated with the test score, the distribution of test scores within the worker group may be different from that in the applicant group. There will be a corresponding difference between true validities for workers and applicants. For example, in a highly selective job, range restriction occurs so that nearly all workers will have a narrow high range of test scores, and the true validity will be lower than that for an unselected applicant group. If the applicant and worker distributions can be estimated, it is possible to correct for range restriction. Reliability of supervisor ratings. The criterion of supervisor ratings is assumed to be perfectly measured in computing the true validities. Unreliable supervisor ratings will tend to make the observed validities smaller than the true validities; if the reliability of supervisor ratings can be estimated, an adjustment can be made for it in estimating the true validity. Connecting the sample to the population. The new job is different from the jobs studied. If the jobs studied are assumed to be a random sample from the population of all jobs, then the sample distribution can be projected to the population distribution. This is the implicit assumption of the Schmidt-Hunter validity generalization analyses. If this assumption cannot be sustained, some other connection must be established between the jobs studied and the new job. Each of these factors is considered below in some detail. Sampling Error The true validity for a given job, population of subjects, and criterion is the validity coefficient that would be obtained by conducting a validity study involving the entire population. Any actual validity study will use only a sample of subjects- typically a group of job incumbents chosen to participate in the study and will yield an observed validity (a sample correlation), r, that differs from the true validity as a consequence of the choice of sample. The observed validity r will deviate from the true validity by a sampling error, e.

THE THEORY OF VALIDITY GENERALIZATION i23 If several samples are taken from the same population, each would have a different observed validity. It is the variability of these observed validities about the true validity that tells us how confident to be in estimating the true validity by the observed validity r. Suppose, for example, there is a population of 1,000 individuals for which the true validity of a test is .3. We draw a sample of 100 individuals and compute an observed validity of .41. Other samples of 100 individuals give validities of .22, .35, .42. The observed values vary around the true value by a range of about .1. Now suppose there is another population of 1,000 individuals for which the true validity is unknown. We draw a sample of 100 individuals and compute an observed validity of .25. What is the true validity? We think that it lies somewhere in the range .15 to .35. Thus we use the distribution of sampling error to indicate how close the true validity is likely to be to the observed validity. (There may be other evidence such as prior information about the true validity.) The average of the sampling error M is very close to zero for modest true validities. The variance of the sampling error, the average of (e - M)2, iS close to 1/(n - 1), where n is the sample size for modest true validities. Thus for sample sizes of 100, the variance is about .01 and the standard deviation is .1; we expect the observed validity to differ by .1 from the true validity. Corrections for Sampling Error To illustrate how corrections for sampling error fit into the estimation of the distribution of true validities in a population of jobs, we offer a hypothetical example. Assume, following Hunter and Schmidt, that the jobs actually studied form a random sample of the population of jobs. For each job studied a random sample of applicants is taken from the relevant population for that job, and an observed validity is computed for the random sample. Note that there are two levels of sampling, from the universe of jobs and from the universe of applicants for each job. Provided that the different studies are independent, the expected variance of the observed validities is the sum of two components: the variance of the true validities plus the average variance of the sampling error. Thus we estimate the mean true validity in the population of jobs by the average of the observed validities, but we must estimate the variance of true validities by the observed variance of observed validities less the average sampling variance. This is the correction for sampling error. A good practical estimate of the average sampling variance is the average value of 1/(n - 1) where n is the sample size (Schmidt et al., 1982~.

i24 VA~DI~ GENE TON ED GATE VA~DITlES An Example We have 11 jobs with 1,000 applicants each. If all applicants were tested and evaluated on the job, the true validities would be (dropping the decimal point): 25, 26, 27, 2S, 29, 30, 31, 32, 33, 34, 35. We sample from the 11 jobs at random to get 4 jobs with true validities: 26, 28, 31, 32. For each of the four jobs, we sample 101 from the 1,000 applicants. For the four samples we compute observed validities: 34, 16, 26, 40. We use these observed validities to estimate properties of the original distribution of true validities. The mean true validity is estimated by the mean of the sample validities, 29. The sample variance is (52 + 132 + 32 + 112~/3 = IDS, but this overestimates the variance of true validities because of sampling error. For each sample, the sampling error variance is 10, 000/(n - 1) = 100 approximately (remember that the decimal has been dropped, multiplying the scale by 100~. Thus the average sampling error is 100, and the estimated variance of true validities is 108 - 1~ = 8. The mean true validity is 30, estimated by 29, and the variance of true validities is 10 estimated by 8. These estimates are closer than we have a right to expect, but the important point is that a drastic overestimate in true validity variance may occur if the sampling error correction is not made. Note that these procedures do not make assumptions about the form of the distribution from which the true validities are sampled (although distributional estimates derived from the procedures frequently do). However, the computation of an estimate of the sampling error variance does require weak assumptions about the distribution of test and criterion scores within studies. When the population validities are moderate, the estimate 1/(n - 1) is satisfactory. The corrections for sampling error in the Hunter-Schmidt analyses, all in all, follow accepted statistical practice for estimating components of variance. Restriction of Range Observed validities are based on a sample of workers, whereas the true validities are based on a population of applicants. Since the worker group presumably has been selected from the applicant group by criteria correlated with the test score, the distribution of test scores within the worker group should be different from that in the applicant group. There will be a corresponding difference between "true" validities for workers and applicants.

THE THEORY OF VALIDITY GENERALIZATION i25 It is necessary to develop a mechanism to relate the validities of workers and applicants. Since many applicants will never be employed on the job, it is impossible to assemble job performance data on a typical group of applicants. We must estimate, by some theoretical model, what the job performance would have been if the applicants had been selected for employment. We make two assumptions. The first is that the linear function of test score that best predicts job performance, when computed separately for the population of applicants and the population of workers, has the same coefficient of test score in both groups. This means that a given increase in test score produces the same increase in predicted job performance in both groups. Some such assumption cannot be avoided, because we have data available only for the worker group but wish to use that data to make predictions about the applicant group. The second assumption is that the error of the linear prediction of job performance by test score has the same variance in both groups. One might argue against this assumption on the grounds that workers' job performance will be predicted more accurately if the workers are rationally selected to maximize job performance. But methods of prediction of job performance are not so well developed that we would expect a very noticeable decrease in error variance in the worker group (see Linn et al., 1981~. Under these assumptions there is a remarkable formula connecting the theoretical validities in the two groups: the quantity (1 - validity-2) multiplied by test score variance is the same in both groups. When the validities are moderate or small, this means that the ratio of the validities in the two groups is very nearly the same as the ratio of the standard deviations of test scores in the two groups. The ratio of the standard deviation in the worker group to the standard deviation in the applicant group will be called the restriction ratio. Thus if a worker group is thought to have a standard deviation only half that of the applicant group, then the restriction ratio is one-half, and the validity of the test for the applicant group is close to twice that of the worker group. The main problem in determining the correction for restriction of range is identifying the appropriate population of applicants for a particular job and estimating the variance of test scores for those applicants. The validation study will use as subjects a set of workers on the job, but we wish to estimate the validity for a set of applicants for the job who will take the test through the Employment Service. Few data are available on the distribution of test scores of applicants for particular jobs. It is not even clear who should be regarded as applicants. Anyone who wishes to apply for the job? Anyone who wishes to apply for the job and is willing to take the test? Anyone who wishes to apply for the job and meets the employer's minimum qualifications?

|26 V~DI~ GENE~TION~D GATE VA9DITIES The pool of applicants for jobs as laborers or as university professors may be considerably more restricted than the genera! population (because of self-selection or qualifications required). Consequently, the correlation between test score and job performance among the applicants to these jobs may not be as high as would be the case if the general population applied for and was employed in these occupations. Note also that the pool of potential job applicants is not necessarily fixed across localities and therefore across validity studies. For example, in localities with chronically high unemployment, the pool of potential applicants for low-paying jobs may include many people with high test scores who might not be available (because they are employed) in localities with low unemployment. Corrections for Restrictions of Range Suppose that the above assumptions about the relationship between test score and job performance are satisfied for worker and applicant groups. How can the observed validities be corrected for restriction of range? The standard procedure is as follows: for each job studied, the restriction ratio-the ratio of standard deviations of test scores for applicants and workers is estimated. The sample validities computed on the sample of workers are adjusted to give estimated validities for the population of applicants for the job. The average of the true validities for the population of jobs, and the variance of the true validities for the population of jobs, with due adjustment for sampling error, are computed from the estimated validities adjusted for restriction of range. The principal effect of the restriction-of-range correction is to increase or decrease the estimate of average true validity; for example, if the average restriction ratio is one-half, the effect is to double the estimate of mean true validity. Let us trace the theoretical assumptions and the corresponding com- putations from applicant population to sample of workers on the hypo- thetical population considered previously. We have 11 jobs with 1,000 applicants each. If all applicants were tested and evaluated on the job, the true validities would be (dropping the decimal point): 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35. We sample from the 11 jobs at random to get 4 jobs with true applicant validities: 26, 28, 31, 32. The four jobs selected have restriction ratios of: 0.5, 0.5, 1, 1. Thus the true validities for populations of workers in the four jobs are 13, 14, 31, 32. For each of the four jobs, we sample 101 from 500 workers on the job. For the four samples we compute observed validities: 21, 2, 26, 40.

THE THEORY OF VA~DI7Y GENERALIZATION 127 The effect of the restriction of range is to lower the observed validities whenever the restriction ratio is less than 1. Thus the first two observed validities average 11 although the true validities average 27. When the ratio of standard deviations varies between jobs, a secondary effect is to increase the variance of the observed standard deviations. In practice, only the observed validities are known, and one wants to infer properties of the true validities. To get from the worker sample back to the applicant population, we must undo the various operations in going from the population to the sample. The observed validities are corrected for restriction of range; the corrected observed validities are 42, 4, 26, 40. The new estimate of mean true validity is the corrected sample average 28; without the correction the estimate would be 22. The estimate of variance of true validity is the sample variance 307 less the average error variance in the four studies 300, yielding an estimated true variance of 7. Note that the adjusted validities are more variable than the unadjusted ones. Estimating Restriction Ratios In principle, it is possible to estimate the variances in test scores for different applicant groups the variances necessary for correcting for restriction of range. However, in the GATE validity studies, which use workers on the job, no information is available about applicant groups for those jobs. It is not even clear how applicant groups should be defined for those jobs. It could be all people who applied for the job over a period of time, all people in the local labor market who met the requirements for the job, or all registrants in the local Employment Service office. The last definition might best fit the purpose of relating test scores to job performance for Employment Service registrants. Methods have been developed to correct for restriction of range in a large sample of studies without knowing the restriction ratio for every individual study. It is assumed that the restriction ratios for the various studies have a known mean and variance, and that the distribution of restriction ratios is independent of the true validities (Callender and Osburn, 1980~. The known mean and variance are sufficient to determine the correction. For example, if the restriction ratios have average value 0.5, the average true validity is estimated to be about twice the average observed validity. Similarly, if the restriction ratios have a large variance, a reduction will occur in estimating the variance of true validities compared with the observed variance of sample validities. The model and calculations are as follows: Sample validity = restriction ratio x true validity + error

~ 28 VALIDITY GENERATION AND GATB VALIDITIES Since the restriction ratio is assumed to have a distribution independent of the true validity: average sample validity = average restriction ratio x average true validity variance of sample validity = average sampling variance + [restriction ratio variance x (average true validity)2] + Etrue validity variance x (average restriction ratio)2] The variance calculation is only approximate, but the approximation is good whenever the restriction ratio has small percentage variation. The same model may be used if multiplicative factors other than the restriction ratio are included; one need only know the mean and variance of the multiplicative factor. Can GATB Restriction Ratios be Estimated? The crucial question remains: What is the average restriction ratio? The simple option of using the variance derived from all workers who appeared in the studies (U.S. Department of Labor, 1983c) will lead to inflated corrections for restriction of range if this group is more variable in test scores than a typical applicant group for a particular job. This method of correction is also at odds with assertions made elsewhere by Hunter (U.S. Department of Labor, 1983e) that the selection methods of the Employment Service are '`equivalent to random selection"; if indeed that were true, there would be no difference between applicant groups and worker groups in test score variance. In the absence of direct information for particular jobs, the conservative response is to apply no correction for restriction of range. Lack of adequate reliable data about the variance of test scores in realistically defined applicant populations is a major problem in validity generalization from the GATB validity studies. The absence of direct data is so pronounced that the committee has chosen the conservative re- sponse of making no corrections for range restrictions in its analysis of GATB validities. Reliability of Supervisor Ratings In each validity study, a worker's job performance is measured by a supervisor rating. We distinguish between a true rating, done with exhaustive study of the worker's job performance, and an observed

THE THEORY OF VA~DI~ GENE TON ~29 rating, performed under real conditions by the supervisor. We suppose that the observed rating differs from the true rating by some error that is uncorrelated with the true rating over the population of workers. Reliability is measured by the ratio of the variance of the true ratings to the variance of the observed ratings. If there is no measurement error, the reliability would be 1; if the observed rating is unrelated to the true rating, the reliability would be zero. The reliability correction is the ratio of the standard deviation of the true ratings to the standard deviation of the observed ratings. It is the square root of the reliability. Just as with the restriction ratio, the validity of test score with observed rating is divided by the reliability correction to become a validity of test score with true rating. The main effect of the reliability correction is to increase the estimate of average true validity. A secondary effect, when reliabilities vary among studies, is to reduce the estimate of variance of true validities compared with the observed variance of sample validities. Much the same things can be said about reliability corrections as for restriction of range corrections. It is a sensible correction if the required ratios of variances can be estimated, but in the GATB validity studies the reliability of the ratings is rarely available. In the Hunter and Schmidt validity generalization analysis, the mean and variance of the distribution of reliabilities across studies are assumed, and the mean and variance of true validities are corrected accordingly. If the reliabil- ities are underestimated, then the correction will be an overcorrection. The mean reliability of .60 assumed by Hunter and Schmidt causes a reliability correction of 0.78; the true validity estimate is increased by 30 percent. Given the dangers of overcorrecting, and given the observa- tion of reliabilities higher than .60 in many studies, the more conservative figure of .80 seems more appropriate to the committee and is used in its calculations. Connecting the Sample to the Population The data available about validities of the GATB consist of some 750 studies, conducted by the USES, in collaboration with employers, over the period 1945-1985. We wish to draw conclusions about the validity of the GATB for jobs in new settings, as well as about the population of 12,000 job types in many different settings. In order to justify the extrapolation, we must establish a connection between the jobs studied and the targeted population of jobs. In USES validity generalization studies (U.S. Department of Labor, 1983e), it is asserted that the jobs studied in each of five job families may be taken to be a sample from the set of all jobs in the corresponding job

1 30 VALIDITY GENERALIZATION AND GATB VAtAiDITIES family. Inference about population characteristics is then based on the tacit assumption that the sample is random, that is, that all jobs in a job family have equal chance of appearing in job studies. There are a number of reasons to be skeptical about the assertion that the jobs studied are representative of all jobs. The studies have been carried out over a long period of time, and it is fair to question whether a job study carried out in 1950 is as relevant in 1990 as it was then. Standard job conditions may have changed, the literacy of the work force may have changed, accepted selection procedures may have changed. There is indeed evidence of a general decline of validities over time in the USES data base. Moreover, certain conditions must be met before a job appears in a validity study. An employer must be found who is willing to have workers spend time taking the GATB test and to have supervisors spend time rating the workers. The employer must be persuaded that the test is of some value in predicting job performance; why would the employer participate in a futile exercise? If the test is then more valid for some jobs and in some settings than others, and if we assume that either USES or employers are able to identify the more fruitful jobs and settings, then surely they would study such jobs first. The jobs thought to have low validity will have less chance of being studied. The net effect is that the average population validity will be lower than the observed sample validity, but we do not know enough about the selection rules for initiating and carrying out studies to estimate the size of the effect. An example of such selection in GATB studies is provided by jobs classified as agricultural, fishery, forestry, and related occupations. They include 2 percent of the jobs in the Dictionary of Occupational Titles, but only 0.04 percent (3 studies of 777) of the jobs in the USES data base. A related selection problem in publishing the results of studies is known as the file drawer problem. Results that show small validities may have less chance of being written up formally and being included in the available body of data. We do not have an estimate for the size of this effect for the GATB studies. THE INTERPRETATION OF SMALL VARIANCES IN VALIDITY GENERALIZATION Most writers in the area of validity generalization have argued that, if the variance of the validity parameters is estimated to be small, then validities are highly generalizable. Two justifications for this position are advanced. The first is that, if most of the variability in the observed validities can be accounted for by the artifacts of sampling error, unreliability of test and criterion, and restriction of range, then it is

THE THEORY OF VANDAL GENE TON ~3 ~ reasonable to assume that much of the rest can be accounted for by other artifacts. The second argument is that, if the variance in validity param- eters is small, then the validities in all situations are quite similar. There is little empirical evidence to aid in the evaluation of the first argument. Although it seems sensible to many, reasonable people might disagree on how much of the variation must be explained for the argument to be persuasive. For example, Schmidt and Hunter s (1977) 75 percent rule which suggests that, if the four artifact corrections explain 75 percent of the variation, then the remaining 25 percent is probably due to other artifacts (such as clerical errors)- is not univer- sally accepted (see James et al., 1986, 1988; but see also Schmidt et al., 19881. The argument that small variance among validity parameters implies that all validities are quite similar is more obviously problematic. Suppose that the sample of studies actually consists of two distinct groups (differing from one another in job or context), which have different distributions of validity parameters. If one of the groups in the sample has only a small number of studies and the other has a much larger number of studies, then between-group differences in validities need not greatly inflate the overall variance among validities. Note also that, when studies in the sample are not representative of the universe of all jobs or contexts, the size of the two groups in the sample need not reflect their incidence in the universe. Thus jobs that might be associated with unusually high validities might occur infre- quently in the sample of validity studies but occur with higher fre- quency in the universe of all jobs or contexts. Moreover, the existence of two groups of studies, each with a different distribution of validity parameters, cannot be detected from the estimate of the overall mean and variance of the validities alone. In general, omnibus procedures designed to estimate the variance of validity parameters (or to test the hypothesis that this variance is zero) are not well suited to detect the possibility that validities are influenced by moderator variables that may act on only a few studies in the sample. The reason is that because such omnibus procedures are sensitive to many kinds of departures from absolute consistency among studies, they are not optimal for detecting a specific pattern. To put this argument more precisely, the omnibus statistical test that tests for any difference among validities does not have as much power to detect a particular difference between groups of studies as does a test designed to detect that specific, between- group contrast.

|32 VANDAL GENE~TION~D GATB VA~DITIES CONCLUSIONS 1. The general thesis of the theory of validity generalization, that validities established for some jobs are generalizable to some unexamined jobs, is accepted by the committee. Adjustments to Validity Coefficients Sampling Error The observed variance in validities is partly due to variance in the "true" validities computed for a very large number of workers in each job, and partly due to the differences between those true validities and the sample validities computed for the actual groups of workers available in each job. 2. For the GATB, the variance is justifiably adjusted by subtracting from the observed variance an estimate of the contribution due to sampling error. Range Restriction The adjustments of average validity are designed to correct for two deficiencies in the data. The first is that, although the correlation between test score and job performance is based on workers actually on the job, the prediction will be applied to applicants for the job. If workers have a narrower range of test scores than applicants, then the worker correlation will be lower than the applicant correlation; an adjustment for range restriction produces an adjusted correlation larger than the observed correlation. 3. Lack of adequate, reliable data about the variance of test scores in realistically defined applicant populations appears to be a major prob- lem in validity generalization from the GATB validity studies. Appro- priate corrections remain to be determined by comparisons between test score variability of workers and of applicants, and, in the mean- time, caution suggests that no corrections for restriction of range be made. Criterion Unreliability A further deficiency in the data is that the criterion measure, usually supervisory ratings, is inaccurately measured and for this reason reduces the observed correlation. Thus an adjustment is used that produces a correlation between the test score and a theoretical criterion measured

THE THEORY OF VALIDITY GENE TON 133 with perfect precision, which may reasonably be taken to be a better indicator of job performance than the observed criterion. 4. In the GATB validity studies, data on the reliability of the criterion are rarely available. Correction for criterion unreliability with too low a figure would inflate the adjusted validity. Given the observation of reliabilities higher than .60 in many studies, the committee finds that a conservative value of .80 would be more appropriate than the .60 value contained in USES technical reports on validity generalization. Connecting the Sample to the Population The generalization of validities computed for 500 jobs in some 750 USES studies to the population of 12,000 jobs in the Dictionary of Occupational Titles is justified only to the degree that these jobs are similar to the other jobs not studied. Thus a necessary component of validity generalization for the GATB is to establish links between the jobs studied and the remainder. One way to do so is to select the jobs at random from a general class. Failing randomness in selection, it is necessary to establish important similarities between the studied jobs and the target jobs. 5. The 500 jobs in the GATB data base were selected by unknown criteria. They cannot be considered a representative sample of all jobs in the U.S. economy. Nevertheless, the data suggest that a modest level of validity (greater than .15) will hold for a great many jobs in the U.S. economy.

Next: 7 Validity Generalization Applied to the GATB »

Fairness in Employment Testing: Validity Generalization, Minority Issues, and the General Aptitude Test Battery (1989)

Chapter: 6 The Theory of Validity Generalization

Welcome to OpenBook!

Get Email Updates