Exploring Strategies for Clustering Military Occupations
Paul R. Sackett
CLUSTERING MILITARY OCCUPATIONS
The Joint Services Project on Assessing the Performance of Enlisted Personnel has resulted in the collection of data on a variety of criterion measures for a number of occupational specialties. Intercorrelations among these criteria are being examined, as are relationships between the Armed Services Vocational Aptitude Battery (ASVAB) subtests and composites and performance on these criterion measures. A fundamental issue facing the Services is that of extending the results of these efforts from the limited set of occupational specialties included in the project to the universe of military occupational specialties (MOS).
More specifically, three different types of extension are needed. The first is the issue of ASVAB validity: based on known ASVAB-performance relationships for a small number of MOS, we wish to infer ASVAB-performance relationships for the universe of MOS. The second is the issue of intercorrelations among criteria. For a small number of MOS, intercorrelations among various types of criteria (e.g., hands-on performance tests and training grades) are known; we wish to generalize these relationships to the universe of specialties. The third is the issue of setting predictor cutoffs for various MOS. For MOS for which ASVAB and criterion data are available, it is at least possible (even if not current practice) to set cutoffs to ensure that no more than a specified proportion of applicants will fall below some
specified level of criterion performance. We wish to set justifiable cutoffs for MOS for which high-quality criterion data are not available.
The critical question is what aspects of jobs produce variations in validity coefficients, criterion intercorrelations, and cutoff scores. If this question can be answered, we can then ask two more questions: (1) which MOS can be shown to be sufficiently similar to MOS for which predictor and criterion data are available that we can infer that validity is the same and/or that appropriate cutoffs are the same; and (2) for MOS that are not sufficiently similar to any for which predictor-criterion data are available, can we establish relationships between job characteristics and validity coefficients, criterion intercorrelations, and cutoff scores such that we can make projections to MOS for which predictor-criterion data are not available?
This paper considers approaches to addressing the need to assess job similarity in the context of the questions stated in the above paragraph, rather than as a general review of the job clustering literature. My single greatest concern about both the job analysis and job clustering literatures is the pervasive tendency to ignore the purpose for which job analysis is being done or for which jobs are being compared. When comparing jobs, two major decisions need to be made: (1) what job descriptor to use (e.g., tasks, abilities), and (2) what quantitative clustering procedure to use. The second has received more attention than the first; a detailed review by Harvey (1986) makes it unneccesary to treat this issue in detail here. The first factor has been shown (e.g., Cornelius et al., 1979) to have a large impact on decisions about job similarity. For example, jobs very different at the task level may be quite similar at the ability level. Decisions about the appropriate job descriptor are needed for subsequent efforts to examine the relationships between job characteristics and validities and cutoff scores.
JOB ANALYSIS METHODS: THE CHOICE OF THE JOB DESCRIPTOR
Numerous approaches to analyzing jobs exist. Textbooks in the fields of industrial/organizational psychology and personnel management commonly catalog 6-12 job analysis methods (e.g., functional job analysis, Position Analysis Questionnaire (PAQ), task checklist, job element method, critical incidents, ability requirement scales, threshold traits analysis) (e.g., Cascio, 1982; Schneider and Schmitt, 1986). One way to disentangle the myriad of approaches is to characterize them on a number of dimensions. Dimensions include source of information (e.g., incumbent versus supervisor versus job analyst) and method of collecting information (e.g., observation versus interview versus questionnaire), purpose (e.g., setting selection standards versus setting wages), and job descriptor (e.g., describing tasks versus describ-
ing attributes needed for task performance). Of particular interest here are these last two: the purpose for which the job analysis information is collected and the job descriptor chosen.
Pearlman's (1980) review of the literature on the formation of job families identifies four major categories of job descriptors. The first he labels “job-oriented content,” referring to systems that describe work activities in terms of work outcomes or tasks. In other words, the focus is on what work is accomplished. Such systems are job specific. Pearlman gives two examples of task statements: “turns valves to regulate flow of pulp slush from main supply line to pulp machine headbox,” and “install cable pressurization systems.” I have relabeled this category with the more descriptive title “specific behaviors.” Researchers and practitioners describing jobs at this level typically use the label “tasks,” and generate a detailed list of tasks statements. Four to five hundred task statements are not uncommon.
Pearlman's second category is labeled “worker-oriented content,” referring to systems that describe work activities in terms of behaviors or job demands that are not job specific. Thus these systems are intended as applicable to a wide variety of jobs and commonly involve evaluating jobs using a standard questionnaire. I have relabeled this category “general behaviors.” McCormick's Position Analysis Questionnaire (PAQ) typifies this approach. Sample PAQ items include “use quantitative materials” and “estimate speed of moving objects. ” Thus researchers and practitioners describing jobs in these terms typically use an inventory of 100-200 behavioral statements.
Pearlman's third category is labeled “attribute requirements,” referring to systems that describe jobs in terms of the areas of knowledge, skill, or ability needed for successful job performance. Two very different approaches can fall into this category. The first involves the identification of specific areas of knowledge, skill, and ability needed for performance in one specific job in the context of the development of selection tests that will be justified on content validity grounds. This is a very common activity among psychologists developing selection systems in public sector settings. The critical feature is that the applicants for the job in question are expected to already have obtained the training to perform the job; thus the focus is on determining the extent to which applicants possess specific knowledge and skills needed for immediate job performance. In these settings it is not uncommon to develop detailed lists of 100-200 areas of needed knowledge, skill, and ability; these lists are then used to guide test development.
The second is more applicable to the military situation in that it is more applicable to settings in which training takes place after selection. As knowledge and specific skills will be acquired in training, selection is based on abilities shown to be predictive of knowledge and skill acquisition and/or subsequent job performance. Thus, rather than focusing on large numbers of areas of job-specific knowledge and skill, this approach involves describing
jobs in terms of a fixed set of cognitive, perceptual, and psychomotor abilities. I use the label “ability requirements” to refer to this subset of the more general category “attribute requirements.” An example of this approach is Fleishman's work on ability requirements (Fleishman and Quaintance, 1984). Based on an extensive program of research, a list of abilities was created, as well as rating scales for evaluating the degree to which each of the abilities is required. The present list identifies 52 abilities; smaller numbers could be used if, for example, motor requirements were not relevant to the purpose for which job information was being collected. A focus solely on cognitive ability requirements would involve 14 abilities. Examples include “number facility” and “fluency of ideas.” Thus the ability requirements approach involves describing jobs in terms of a relatively limited number of abilities required for job performance.
Pearlman's fourth category is labeled “overall nature of the job,” referring to approaches that characterize jobs very broadly, such as by broad job family (managerial, clerical, sales). An example of this category that may be of particular interest to the Job Performance Measurement Project (JPM Project) is Hunter's (1980) grouping of all 12,000 jobs in the Dictionary of Occupation Titles into one of five categories on a job complexity scale. This complexity scale is based on a recombination of the Data and Things scales used by the U.S. Department of Labor to classify jobs. Hunter shows that validity coefficients for composites of General Aptitude Test Battery (GATB) subtests differ across levels of this complexity variable and are very similar within levels of this variable.
As Pearlman points out, distinctions between these categories are not always clear, and some approaches to job analysis involve multiple categories. However, it is conceptually useful to conceive of these four categories as a continuum from more specific to less specific. A given job can be described in terms of a profile of 400-500 specific behaviors, 100-200 general behaviors, 10-40 abilities, or a single global descriptor, such as job complexity. It should be recognized that this is not merely a continuum of level of specificity; there are clearly qualitative differences in moving from behaviors performed to abilities required. Nonetheless, this discussion should clarify the differences in level of detail involved in the various approaches to describing jobs and should set the stage for a discussion of the relationship between the purpose for which job information is being collected and the type of job descriptor chosen.
These two issues—purpose and job descriptor chosen—are closely intertwined. The question “which job analysis method is most appropriate” can only be answered in the context of a specific purpose. An illustration would be an example from a job analysis of the job “psychologist. ” An issue of concern was whether different specialties within psychology —clinical, counseling, industrial/organizational, and school—were similar enough that a common li-
censing exam was appropriate for these four specialties. The Educational Testing Service (ETS) was commissioned to conduct a comparative job analysis of these four areas (Rosenfeld et al., 1983). An inventory of 59 responsibilities and 111 techniques and knowledge areas was designed and mailed to a carefully selected sample of licensed psychologists. The study found a common core of responsibilities among all four specialties and chided various practice areas for emphasizing the uniqueness of their own group.
I am not denying that there are commonalities among different types of psychologists. However, I will argue that I could have easily designed a survey instrument that would have produced different results. One thing industrial/organizational psychologists have learned from our experience with job analysis is that the more general the data collected, the more likely it is that jobs will appear similar when subjected to statistical analysis; conversely, the more specific the inventory items, the greater the apparent differences among jobs. The art of job analysis lies in determining a level of specificity that meets the purposes of the particular job analysis application. Consider some of the statements making up the ETS inventory. Responsibility 1 leads the inventory reading: “Conduct interviews with client/patient, family members or others to gain an understanding of an individual 's perceived problem.” This is endorsed by a high proportion of respondents from all specialties, yet it can mean dramatically different things, from interviewing a corporate executive to gain insight into an organization's incentive pay plan to interviewing a 7-year-old suspected victim of child abuse. More examples: “observe the behavior of individuals who are the focus of concern,” and “formulate a working hypothesis or diagnosis regarding problems or dysfunctions to be addressed.” Again, these can refer to dramatically different activities. More to the point, given that the purpose of the job analysis is to support the creation of one or more licensing exams, these can require different skills, abilities, training and experience. By being more specific and rephrasing Responsibility 1 as multiple tasks (“interview business clients,” “interview adult patients,” “interview children”), the chances of concluding that the jobs are different increase. By getting even more general (“gather information verbally”), the chances of concluding that the jobs are similar increase. Each of these three levels of specificity present information which is true. However, the question of which level of specificity is appropriate depends on the purpose for which the information is being collected.
In the above example, the three levels of specificity illustrated all focus on worker activities. The job descriptor chosen is in all cases behavioral; they vary on a continuum from general behaviors to specific behaviors. Similarly, one may reach different conclusions about job similarities and differences if different categories of job descriptors are chosen (e.g., focusing on job activities versus focusing on abilities required for job performance).
A multiorganization study of bank teller and customer service jobs illustrates this nicely (Richardson, Bellows, Henry, and Co., 1983). A 66-item behavioral work element questionnaire (e.g., “cashes savings bonds,” “verifies signatures,” “types entries onto standardized forms”) and a 32-item ability requirement questionnaire (e.g., “ability to sort and classify forms,” “ability to compute using decimals,” “ability to pay attention to detail”) were administered. While the vast majority of incumbents held the title “paying and receiving teller,” 20 other job titles were found (e.g., new accounts representative, customer service representative, drive-in teller, safe deposit custodian). The issue was whether these 20 jobs were sufficiently similar to the job of paying and receiving teller that a selection test battery developed for the paying and receiving tellers could also be used for the other jobs. A correlation between each job and the paying and receiving teller was computed, first based on the behavioral work element ratings and then based on the ability ratings. In a number of cases, dramatically different findings emerged. The new accounts representative, customer service representative, and safe deposit custodian correlated .21 with the paying and receiving teller when comparing the jobs based on similarity of rated behavioral work elements. These same three jobs correlated .90, .92, and .88 with the paying and receiving teller when comparing the jobs based on similarity of rated ability requirements. Thus the use of different job descriptors leads to different conclusions about job similarity. Conceptually, one could argue that for purposes of developing an ability test battery, the ability requirements data seem better suited. If data on these same jobs were being collected to determine whether a common training program for new hires was feasible, one might argue that the work behavior data seem better suited. Again, the question “which jobs are sufficiently similar that they can be treated the same” cannot be answered without information as to the purpose for which the jobs are being compared.
A study by Cornelius et al. (1979) reinforces this point and takes it one step further. They analyzed seven nominally different first-level supervisory jobs in chemical processing plants. Hierarchical clustering analysis was done to establish job groupings based on three types of data: task similarity, similarity of Position Analysis Questionnaire profiles, and similarity of ability requirements. Each type of data produced a different pattern of job similarities and a different clustering of jobs. Cornelius et al. properly tell us that purpose will dictate which set of data we should rely on. However, even after this decision has been made, problems remain. Cornelius et al.'s task analysis data, for example, indicate that both five-cluster and three-cluster solutions are feasible. Hierarchical cluster analysis, as well as other grouping methods, can only establish relative similarity among jobs. In the Cornelius et al. study, if 40 percent of tasks in common is seen as sufficient to label jobs similar, the seven jobs would fall into three clusters.
If 60 percent of tasks in common is seen as sufficient to label jobs similar, the seven jobs would fall into five clusters. The question left unanswered is “given that an appropriate job descriptor has been chosen, how large a difference between jobs on the chosen descriptor is needed to have a significant impact on the criterion of interest? ” In a selection setting, how different do jobs have to be before validity coefficients are affected? In a training situation, how different do jobs have to be before separate training programs are required? In a performance appraisal situation, how different do jobs have to be before separate performance ratings forms need to be constructed? Thus job clustering can only be meaningful with reference to an external criterion.
In summary, the above discussion highlights a number of concerns about job grouping. First, different descriptors can produce very different job groupings. Second, different levels of specificity within a given general type of descriptor (e.g., task) can produce very different job groupings. Third, even if a given type of job descriptor and level of specificity are agreed on, the magnitude of job differences that will be needed to classify jobs differently remains a problem. An external criterion is needed.
The implications of the above discussion for the JPM Project are clear. First, there is reason to expect that different job descriptors will produce different job groupings. The choice of job descriptor should not be a function of the availability of job descriptor data using a particular approach, but rather a function of the type of job descriptor data which is most closely linked to the purpose for which jobs are being grouped. Second, it must be realized that the two goals of grouping jobs with similar test validities and grouping jobs with similar levels of ability required to ensure a specified level of performance must be treated independently. Grouping jobs based on validity may produce very different job clusters than grouping jobs based on required ability levels. Conceivably these two purposes could require different job descriptors for optimal clustering. Approaches to identifying the appropriate job descriptor for these purposes are discussed in a subsequent section of this paper.
One additional aspect of the choice of the job descriptor merits some discussion, namely, the nature of the data to be collected about the descriptor chosen. Given that a descriptor has been chosen (e.g., specific behaviors or abilities), it is common to ask job experts to rate the importance of each job component. However, “importance ” can be conceptualized in a number of ways, three of which are discussed here. Using abilities as an example, one approach to importance is in terms of time: what proportion of total time on the job is spent using the ability in question. The Position Analysis Questionnaire, for example, uses this type of scale for some items. A second approach is in terms of contribution to variance in job performance: to what extent does the ability in question contribute to differentiating the
more successful employees from the less successful. The job element approach to job analysis for selection system development uses such a scale. A third approach is in terms of level: what degree of a given ability is needed for successful job performance. Fleishman 's Ability Requirement Scales exemplify this approach. Conceptually, it is clear that these three can be completely independent. The abilities that are used most frequently may be possessed by virtually all incumbents and thus not contribute to variance in job performance. A given ability may contribute equally to variance in job performance in two jobs, yet the level of ability needed may differ dramatically across the jobs. Thus, even if it were agreed that abilities required is the appropriate job descriptor for a given application, operationalizing ability as importance, frequency of use, contribution to variance in performance, or level required can lead to different conclusions about job similarity. It would seem logical to hypothesize that judgments about contributions to variance in job performance would be most appropriate for determining for which jobs a given test should have similar validity and that judgments about level required would be most appropriate for determining which jobs should have similar test cutoffs.
The distinctions made in the above paragraph are not typically made. In fact, researchers sometimes seem to feel that the choice of the descriptor is all that is important and do not even mention the aspect of the descriptor that is rated. For example, a paper by Cornelius et al. (1984) describes the construction and use of a 26-item ability element battery to group jobs in the petroleum/petrochemical industry. They used the results of this inventory to assign jobs to one of three occupational groups, but did not tell us whether ability was operationalized as frequency of use, contribution to variance in performance, or level required.
The use of one operationalization of importance where another seems better suited is found in Arvey and Begalla's (1975) examination of the job of homemaker. They administered the PAQ to a sample of homemakers and compared the PAQ profile for this position with each of the large number of profiles in the PAQ data base. These comparisons were made for two human resource management purposes: attempting to associate a wage with the homemaker job and making inferences about job transfer and training decisions. Jobs most similar in PAQ profiles were patrolman, home economist, airport maintenance chief, and kitchen helper; a number of supervisory positions followed closely (electrician foreman, gas plant maintenance foreman, fire captain) in the list of the 20 most similar positions. Arvey and Begalla note that a major theme running through many of the occupations listed was a trouble-shooting emergency handling orientation.
Based on this list of most similar occupations, it is not clear that the goal of identifying jobs amenable to entry by homemakers was met. Arvey and Begalla note this and interpret their findings with appropriate caution. The
predicted salary for the job was $740 per month, in 1969 dollars, which the authors felt was overinflated. They offer distortion of responses based on desire on the part of the respondents to make their positions seem more important as an explanation of the high salary. In light of our discussion of various operationalizations of job element importance, another explanation seems likely: the descriptions provided are accurate (i.e., not intentionally distorted), but the information requested is not well suited to the task at hand. The ratings scales used in the PAQ typically reflect time spent: either a direct rating of frequency or a rating of importance, operationalized vaguely as “consider such factors as amount of time spent, the possible influence on overall job performance if the worker does not properly perform the activity, etc.” I would hypothesize that different patterns of similarity would be found if “level required” rather than “time spent” were used to rate items. Conceptually, level required seems better suited to the tasks of identifying jobs amenable to entry by homemakers and setting wage levels. Jobs very similar in the amount of time spent on the PAQ dimension “processing information” may be very different in the level of information processing involved. In short, it is suggested that careful attention be paid to both the selection of the job descriptor and to the operationalization of job element importance.
The following sections of this paper separately address the issues of identifying valid predictors of performance for the universe of MOS and setting minimum standards on these predictors. Multiple potential solutions to the problem are presented.
EXTENDING ASVAB VALIDITY TO THE UNIVERSE OF MOS
Validity generalization is a form of meta-analysis. The application of meta-analytic techniques to the examination of predictor-criterion relationship in the selection arena has been labeled validity generalization; the use of the two terms is the result of the parallel development of data cumulation techniques by two groups of researchers—Glass and colleagues (e.g., Glass et al., 1981) and Schmidt and Hunter and colleagues (e.g., Hunter et al., 1982)—who applied different labels to similar techniques. Note that there are five-book length treatments of cumulative techniques (Glass et al., 1981; Hunter et al., 1982; Rosenthal, 1984; Cooper, 1984; Hedges, 1985) and a number of thorough and critical treatments of the topic in archival journals (e.g., Schmidt et al., 1985; Sackett et al., 1985; Bangert-Drowns, 1986).
An introduction to validity generalization is in order. For years psychologists have observed that when a given test is validated in different settings, the resulting validity coefficients vary; in some cases the amount of varia-
tion is substantial. Historically, the explanation offered for this was that situational factors affected validity. Due to these unspecified situational factors (for example, organizational climate, leadership style, and organizational structure) a test valid in one situation might not be valid in another. Thus there is the doctrine of “situation specificity,” defined as the belief that due to these factors one could not safely rely on validity studies done elsewhere, but rather, one must do a validity study in each new setting.
To understand validity generalization, it is helpful to distinguish between “true validity” and “observed validity.” True validity is the correlation that is obtained if there is an infinitely large sample size that is perfectly representative of the applicant pool of interest and if the criterion measure is a perfectly reliable measure of true job performance. Observed validity is the correlation obtained in our research—typically with smaller Ns than preferred, with samples that may not be perfectly representative of the job applicant population, and with less than perfect criterion measures (e.g., supervisory ratings of performance). Historically, researchers have not differentiated between observed validity and true validity: when observed validity differences were found between studies, it was assumed that the differences were real. Recently, it has been suggested that these differences are not real, but simply reflect differences in sample size, criterion reliability, or range restriction. Could it be that true validity does not differ across situations? If it weren't for these methodological problems, would validities be the same across studies?
These ideas make for interesting speculation; what was needed were ways of testing them. Validity generalization models are means of testing these ideas: they offer a way of assessing true validity and of assessing how much variability in validity coefficients we can expect due to the methodological problems listed above. The amount of variability in observed validity coefficients is compared with the amount of variability expected due to methodological artifacts: if expected validity equals or nearly equals observed validity, one concludes that differences in validities across studies are not real, but merely the result of the effects of these artifacts.
Procedurally, validity generalization ties together a number of well-known psychometric ideas. One starts with a number of validity studies and a validity coefficient for each. For each study, one obtains an estimate of criterion reliability. Each validity coefficient is corrected using the well-known formula for correction for attenuation in the criterion. Each validity coefficient is also corrected for range restriction—the extent to which the sample used in the study has a narrower range of test scores than would be obtained from job applicants—using well-known formulas for range restriction. The mean and variance of this distribution of corrected validity coefficients is then computed and compared with the variance expected due to sampling error, which is a function of N and the mean validity coefficient. If the
variance expected due to sampling error and the variance in corrected validity coefficients are nearly equal, we conclude that validity is not situation specific, and that the best estimate of true validity is the mean of the corrected validity coefficients.
Validity generalization analyses might appear to be straightforward under the conditions outlined above. However, if criterion reliability values or information about range restriction is not available for each study, assumptions must be made about what criterion reliability was likely to be, about how much range restriction was likely to have occurred, and about the linearity of the predictor/criterion relationship. These assumptions are critical: if the values assumed are incorrect, the estimated value of true validity can be substantially in error. Furthermore, when the range restriction is severe, the extrapolation permitted by these assumptions is tenuous.
A source of confusion in understanding and interpreting validity generalization/meta-analytic research lies in the failure to differentiate between two different statistical tests that can be performed on a set of validity coefficients; these are tests of the situational specificity hypothesis and the generalizability hypothesis. The situational specificity hypothesis is rejected when variance in validity coefficients is essentially zero after correcting for artifacts. Rejecting this hypothesis implies accepting the hypothesis that true validity is virtually constant for the job/test combination under consideration. The generalizability hypothesis is less stringent. It involves the recognition that even if one fails to reject the situational specificity hypothesis and thus acknowledges that validity varies across jobs, it is still possible that even the low end of a distribution of validity coefficients is of a magnitude sufficient to consider the test useful. Thus, if one's interest is not in a point estimate of validity for a given situation but rather in simply the assurance that test validity will be above a level considered minimally acceptable, one can accept the generalization hypothesis if the low end of a confidence interval around mean validity exceeds this level.
The research of Schmidt and Hunter has asserted that cognitive ability tests are valid for all jobs (Hunter and Hunter, 1984). Some have interpreted this as implying that tests are equally valid for all jobs. This misinterpretation is based on confusing the situational specificity hypothesis and the generalizability hypothesis. Schmidt and Hunter's statements involve accepting the generalizability hypothesis, (i.e., that the validity of cognitive tests is positive and nonzero for all jobs).
While validity generalization research with cognitive ability tests shows quite strongly that there is little to no variation in true validity for individual job/test combinations, it is very clear that the validity of cognitive ability tests does vary across jobs. One of the clearest illustrations of this is found in a study by Schmidt et al. (1981). For a sample of 35 Army jobs, validity coefficients for the 10 subtests of the Army Classification Battery
were available for two independent samples of about 300 individuals per job. For each subtest, the 35 validity coefficients from the first sample were correlated with the 35 validity coefficients from the second sample. With the exception of one subtest (radiocode aptitude), correlations between samples were substantial, ranging from .68 to .86. The pattern of validity coefficients was stable across samples: jobs with higher validities in one sample had high validities in the other sample, and vice versa. If true validity did not vary across jobs, variation from sample to sample would be a function of sampling error, and the correlation across samples would be essentially zero. Thus, jobs do moderate validity, if by “moderate” we mean “influence the size of a validity coefficient.” However, in this study, Schmidt et al. define “moderate” as “produce a near zero validity.” Using this definition, they conclude that jobs do not moderate validity, as the low end of a confidence interval (i.e., two standard deviations below the mean validity) is greater than zero. This formulation accepts the generalization hypothesis, and by definition rejects the situational specificity hypothesis. A less extreme definition of “moderate” would lead one to support both hypotheses.
This somewhat lengthy introduction to validity generalization sets the stage for considering the application of validity generalization to the JPM Project. Assume the availability of validity coefficients for ASVAB subtests and Service-wide composites for each of the 27 MOS included in the project, using hands-on performance tests as the criterion. For each subtest and composite, observed and expected variance can be computed and compared, and residual variance can be used to put a confidence interval around the mean of the validity coefficients. If this lower bound is positive and non-zero, it has thus been shown that the test in question is predictive of job performance for the MOS in question. If one feels confident that the sampled MOS are representative of the universe of MOS, this conclusion is generalized to the universe.
A number of comments on this approach are needed. First, offering this as at least a partial solution to the question of demonstrating ASVAB validity for predicting on-the-job performance is contingent on producing the expected results, namely, that the lower bound for validity will prove to be positive nonzero. The body of research leading to the expectation that this result will be found is substantial (see Hunter, 1980; Hunter and Hunter, 1984). The one potentially important difference between the present set of validity studies and the cumulated literature on the validity of cognitive ability tests is the criterion used. Most validity generalization work to date has categorized studies as using training criteria (typically end-of-course knowledge test scores) or performance criteria (typically supervisory ratings) (Hunter and Hunter, 1984). Is there reason to expect a different pattern of results using hands-on job performance criteria? Recent work by Hunter (1986) suggests not. Hunter found 12 studies where three different
types of criteria were collected: hands-on work samples, paper and pencil job knowledge tests, and supervisory ratings. Breaking these studies down into military and civilian subsamples, he found that general cognitive ability, after correcting for restriction of range and criterion unreliability, correlated .80 with knowledge, .75 with work samples, and .47 with ratings in the civilian subsample, and .63 with knowledge, .53 with work samples, and .24 with ratings in the military subsample. Knowledge and work sample criteria correlated .80 and .70 in the civilian and military subsamples, respectively, suggesting that a high degree of similarity between validity findings using knowledge criteria and work sample criteria is likely. However, recently completed and as yet unpublished research undertaken as part of the JPM Project indicates lower levels of validity using hands-on performance measures.
Second, this approach presumes that it is sufficient to demonstrate positive nonzero validity; point estimates of true validity are not necessary. As discussed above, it is clear that the true validity of cognitive ability tests does vary across jobs; if one wishes to estimate true validity for MOS not included in the Job Performance Measurement Project, a system of relating variance in job descriptors to variance in validity coefficients is needed. Approaches to such a system will be discussed in the section below on synthetic validity.
Third, the validity generalization approach discussed here is directly relevant only to the issue of establishing test validity and not to the issue of setting selection standards for various jobs. Both of these issues could be dealt with simultaneously with a validity generalization model dealing in regression slopes and intercepts rather than correlation coefficients; in fact, such a model has been developed by Raju (1986). However, this type of model is only applicable in situations in which a common predictor and criterion metric are used in all studies. Thus, such a model might be applied with a single organization if, for example, the job performance of sales clerks was measured using the same procedure in 20 retail stores and regression equations were computed for each store. In the JPM Project, as in most validity generalization applications, the criterion metric varies across studies. Standardizing the data within each organization prior to cumulation is not a solution: the resulting standardized beta weight is the correlation. Note, though, that at one level the issue of justifying cutoffs can be addressed. If a test is valid and the relationship between test and criterion is linear, it can be argued that any cutoff is justifiable in the sense that there is no single point above which individuals will perform successfully and below which individuals will not perform successfully. Extensive research by Hawk (1970) shows that test-criterion relationships for cognitive ability tests do not depart from linearity at a rate greater than would be expected by chance. Any chosen cutoff is justifiable in the sense that individuals above the
cutoff have a higher probability of success than individuals below the cutoff. Thus it could be argued that the issue of “validating ” a cutoff score is not intrinsically meaningful, and supply and demand and judgments by policy makers about the relative importance of various MOS can be the basis for establishing cutoffs.
Fourth, it should be noted that some skepticism about validity generalization remains. Some of this is naive. For example, “just because a test predicts performance in these twenty settings is no guarantee that it will predict performance in the twenty-first setting; therefore a local study is needed.” By this logic, the local study is also useless: just because the test predicts performance in the validation sample is no guarantee that it will do so with new applicant samples. Some is more sophisticated, such as concerns about the correction of validities for range restriction based on assumed rather than empirically determined measures of the degree of range restriction, or concerns about the statistical power of validity generalization procedures when applied to small numbers of validity coefficients (cf., Sackett et al., 1985). However, one indication of the degree of acceptance of validity generalization can be found in the 1987 Principles for the Validation and Use of Personnel Selection Procedures published by the Society for Industrial and Organizational Psychology, Division 14 of the American Psychological Association: “Current research has shown that the differential effects of numerous variables are not so great as heretofore assumed; much of the difference in observed outcomes of validation research can be attributed to statistical artifacts. . . . it now seems well established from both validity generalization studies and cooperative validation efforts that validities are more generalizable than has usually been believed” (p. 26).
The careful sampling of the full spectrum of MOS provides a basis for more confidence than one would usually have in conducting meta-analyses on 27 effect size measures. Recent work on the statistical power of metaanalysis to detect existing moderator variables (Sackett et al., 1986) indicates that a meta-analysis of 27 effect-size measures with average sample sizes of about 150 will be quite powerful.
Linking Hands-On Performance Measures and Training Criteria
A second application of meta-analytic techniques may be appropriate for this project. As discussed earlier, Hunter (1986) examined the relationship between hands-on performance measures and paper and pencil job knowledge tests and found an average correlation of .80 in civilian samples and .70 in military samples. Hands-on performance measures will be available for all 27 MOS in the JPM Project; if training performance is retrievable for the subjects in these samples, this finding can be replicated. Correlations between hands-on measures and training grades can be computed for each
sample and used as input for a meta-analysis. Should the lower bound of this distribution of correlations be reasonably high, we can have some confidence that correlations between ASVAB and training can be generalized to on-the-job performance. Hunter et al. (1985) summarize the substantial body of data relating ASVAB scores to training grades; confirming a strong relationship between training criteria and hands-on criteria could serve as a partial response to a critic concerned about the validity generalization analyses discussed earlier on grounds that only a limited set of MOS were actually included in the analyses. A linkage to a much larger body of literature would thus be made. Another possibility is to correlate the validity coefficients of ASVAB and training with the validity coefficients of ASVAB and hands-on measures. Such a correlation should be interpreted carefully however: unless there is meaningful nonartifactual variance in both distributions of validity coefficients, a relationship between the two sets of validity coefficients can not be obtained.
The concept of synthetic validity is not new. The basic notion is that if various job components can be identified and the validity of predictors of performance on jobs involving each component can be established, one can identify valid predictors of performance in a new job if one knows which job components constitute that new job. A wide variety of techniques have been proposed and/or examined under the rubric of synthetic validity. Trattner (1982) identified four different synthetic validity models: Lawshe's synthetic validity (Lawshe, 1952), Guion's synthetic validity (Guion, 1965), McCormick 's job component validity (McCormick et al., 1972), and Primoff's J-coefficient (Primoff, 1975). Mossholder and Arvey's (1984) review of synthetic validity approaches noted that synthetic validity has been talked about substantially more often than it has been applied. Mossholder and Arvey singled out McCormick's job component model and Primoff's J-coefficient as two approaches that have been the focus of serious research efforts. This paper examines the applicability of these two approaches to the present problem of establishing validity for new MOS.
McCormick's Job Component Model
In this approach, the unit of analysis is the job. For a number of jobs, validity coefficients for a given predictor/criterion combination are obtained. Information about each job is obtained through a structured job analysis questionnaire; job dimension scores are derived from this questionnaire and then used as predictors of the validity coefficients for each job. Thus, this is a logical follow-up to validity generalization analysis: for predictor/crite-
rion combinations for which validity coefficients are found to exhibit more variance than would be expected as a result of artifact, job dimensions identified through structured job analysis are examined as possible moderators of the validity coefficients.
The job components used by McCormick are derived from the Position Analysis Questionnaire (PAQ), a 187-item structured worker-oriented job analysis instrument (McCormick et al., 1972). Factor analysis of the PAQ has produced 32 dimensions, or components; these can be further reduced to 13 overall dimensions. Mecham et al. (1977) identified 163 jobs for which both PAQ ratings and validity coefficients for each of the nine General Aptitude Test Battery (GATB) subtests were available. For each GATB subtest they regressed test validity on the PAQ dimensions. Results were disappointing: shrunken multiple correlations were near zero for four tests, in the teens for two tests, in the .20s for two tests (intelligence and spatial aptitude), and .39 for manual dexterity. They conducted similar analyses using mean test score rather than validity coefficients as the criterion with much more success; these analyses will be discussed in a subsequent section dealing with setting cutoff scores.
A similar approach was taken by Gutenberg et al. (1983). In contrast to the raw empiricism of Mecham et al., Gutenberg et al. hypothesized that specific PAQ dimensions would moderate the validity of specific GATB subtests. They found that two PAQ dimensions, decision making and information processing, correlated significantly with cognitive GATB subtests.
The correlations between job dimensions and validity coefficients obtained in the two GATB studies have not been as large as one might hope for. However, it should be noted that the Gutenberg et al. study corrected validity coefficients for range restriction and produced larger correlations than Mecham et al., suggesting that methodological artifacts may be constraining the relationship between job dimensions and validity coefficients. Note that sampling error has accounted for most artifactual variance in meta-analytic studies and that Gutenberg et al. report that sample sizes for the 111 jobs used in their study ranged from 31 to 537. A reanalysis of their data to determine the impact of sample size would be informative. A median split could be made based on sample size and the analyses repeated separately for the group of studies with relatively large sample sizes and the group with relatively small sample sizes. Assuming that sample size is not systematically associated with some job dimension, we would expect a substantially larger relationship between job dimensions and validity coefficients in the large sample size group; this should be a better estimate of the degree to which job dimensions moderate GATB validities.
If validity coefficients for a given predictor/criterion combination can be predicted from PAQ job dimensions, validity for new MOS could be estab-
lished by obtaining PAQ ratings for the new MOS and applying the appropriate prediction formula. An immediate drawback in this approach is the availability of validity data for only 27 MOS. Obviously the Mecham et al. approach of using all PAQ dimensions in a regression equation is not feasible: 45 predictors and 27 cases precludes such an approach. More viable is the Gutenberg et al. approach of identifying a small number of job dimensions on a priori grounds for each predictor/criterion combination under consideration. A panel of psychologists could be asked to reach consensus on the five dimensions most likely to moderate validity coefficients for each predictor/criterion combination. Regression equations using these five predictors could be computed using, say, 20 of the 27 MOS included in the JPM Project; these equations could then be applied to each of the seven holdout MOS as a test of the effectiveness of the procedure for estimating validity for new MOS. Implicit in the above discussion is the need to obtain PAQ profiles on each of the 27 MOS.
While the above discussion focused on using PAQ dimensions as the job descriptor, the approach outlined above could be undertaken using any standardized job descriptor. One possible explanation for McCormick et al.'s lack of success in predicting validity coefficients using PAQ dimensions is that PAQ dimensions do not constitute the most appropriate job descriptor for this purpose. Consider the array of job descriptors discussed in an earlier section of this paper: specific behaviors, general behaviors, ability requirements, and global descriptors. Issues related to the use of each for this purpose will be reviewed.
One issue is practicality. This synthetic validity model requires a standardized job descriptor system applicable to all MOS. Thus, a system describing each MOS in terms of job-specific behaviors cannot be used in this approach. The availability of data on only 27 MOS also imposes practical constraints. It was proposed above that if the PAQ were used as the job descriptor, expert judgment would be used to identify a subset of PAQ dimensions for examination as possible moderators of validity. This is clearly a makeshift approach, and the possibility that the optimal dimensions will not be selected is very real. This problem is minimized or eliminated if the job descriptor system used involves a small number of dimensions. Selecting 5 of 10-15 abilities used in an ability requirement approach seems less likely to exclude critical dimensions than selecting 5 of 45 PAQ dimensions. Using a global descriptor, such as Hunter's job complexity scale, eliminates the problem entirely.
Another issue is the conceptual appropriateness of each type of job descriptor for this purpose. This discussion can be avoided and replaced by brute empiricism: for each of the 27 MOS included in the project, job analytic work could be done to produce job profiles in terms of general behavioral dimensions, ability dimensions, and global descriptors. The ex-
tent to which each of these factors moderates validity could be examined. However, there is some basis for predicting the outcome of such an effort. First, the validity generalization literature discussed earlier has led to the recognition that within a class of jobs, such as clerical work, differences in specific behaviors performed do not have a substantial influence on validity. Commonality of underlying abilities required leads to similar validity despite lack of overlap in specific behaviors performed. This leads to the hypothesis that more general approaches, namely, ability requirements or global descriptors, are better candidates. Second, successful attempts at examining moderators of validity across diverse jobs have used general rather than molecular job descriptors. Hunter (1980) found that regression weights for using a general cognitive ability composite to predict performance increased from .07 to .40, moving from the lowest to the highest levels of his job complexity scale in a sample of 515 GATB validity studies; similarly, the regression weights of psychomotor ability decreased from .46 to .07, moving from the lowest to the highest levels of the complexity scale. The PAQ dimensions used successfully by Gutenberg et al. (decision making and information processing) are among the most “abilitylike” of the PAQ dimensions.
Third, as Pearlman (1980) notes, the more molecular approaches lack the isomorphism with the individual differences variables being considered as predictors of performance that is found with the molar approaches. Isomorphism between job descriptor constructs and predictor constructs is conceptually elegant, making for a readily explainable and interpretable system. Thus, while isomorphism is by no means a requirement for a successful approach to examining moderators of validity, it is certainly a virtue if such an approach proves viable. Pearlman suggests the use of ability requirements as the descriptor to be used for job grouping for validity purposes.
Therefore, I would suggest that ability requirements and global job complexity be considered as moderators of validity. Fleishman's ability requirement scales (Fleishman and Quaintance 1984) seem particularly worthy of consideration due to the extensive research leading to the development of the scales and the care taken in the definition of each ability. A separate rating scale is provided for each ability, containing a general definition of the ability, definitions of the high and low ends of the scale, a description of how the ability differs from other abilities, and illustrative tasks for various levels of the ability. For example, low, medium, and high levels of the ability “verbal comprehension” are illustrated by “understand a comic book,” “understand a newspaper article in the society section reporting on a recent party,” and “understand in entirety a mortgage contract for a new home.”
Recall our earlier discussion of possible operationalizations of the importance of a given ability as time spent using the ability, contribution of the ability to variance in performance, and level of the ability required. The
Fleishman scales clearly fall into the third category. Conceptually, this third category—level required—seems better suited as a moderator of predictor cutoffs than of validity. The second—contribution to variance in performance—seems better suited to the task at hand. Thus, a separate importance rating, explicitly defining importance as contribution to variance, might be obtained along with the level required rating.
Therefore, it is suggested that ratings of each of the 27 project MOS be obtained using the Fleishman scales with the modification discussed above. Ratings should be made by a number of independent raters to achieve adequate reliability. Existing task analyses of these MOS should aid the rating process. If rated ability requirements are found to moderate validity, predictions of validity for new MOS can then be made.
A wide variety of algebraically equivalent versions of the J-coefficient are available (see Hamilton, 1981; Primoff, 1955; Urry, 1978). Trattner (1982) describes the J-coefficient as the correlation of a weighted sum of standardized work behavior scores with a test score. Exactly what constitutes these standardized work behaviors, or job elements, varies across J-coefficient applications. In other words, the J-coefficient is a means of estimating the correlation between a test and a composite criterion. The computation of a J-coefficient for a given predictor requires (1) the correlation between the predictor and each criterion dimension, (2) intercorrelations among criterion dimensions, and (3) importance weights for each criterion dimensions.
It is critical to note that the importance of various criterion constructs is a policy issues as well as a scientific one, and take issue with the notion that there is such a thing as “true” overall performance. Consider, for example, two potential dimensions of military job performance: current job knowledge and performance under adverse conditions. It does not seem unreasonable that a policy directive to emphasize combat readiness would increase the importance attached to the second relative to the first. Presuming a lack of perfect correspondence between individuals' standing on the two criterion constructs, the rank order of a group of individuals on a composite criterion would change; which order is “right” reflects policy priorities. Thus the scientific contribution is to identify predictors of each criterion construct; for any given set of weighted criteria we can then estimate the validity of a selection system.
The relevance of the J-coefficient to this project lies mainly in the contribution the approach can make to the issue of determining the correlation between each predictor and each criterion construct. The J-coefficient formulas, of course, accept any validity estimate; users of the approach typi-
cally rely on judgments of the relevance of test items or entire tests for each criterion construct. As judgmental approaches to validity estimation will be discussed separately, no further attention is needed for the J-coefficient itself.
Judgmental Estimates of Validity
Recent research has reported considerable success in obtaining validity estimates by pooling the direct judgments of test validity across a number of judges. Schmidt et al. (1983) and Hirsh et al. (1986) asked psychologists to provide direct estimates of the validity of six subtests of the Naval Basic Test Battery, using training performance as the criterion, for a set of nine jobs. The jobs were selected because of the availability of criterion-related validity studies with sample sizes greater than 2,000, thus providing a standard against which the judgments could be compared that was virtually free of sampling error. Schmidt et al. used experienced psychologists as judges and found that the pooled judgment of ten psychologists deviated on average from the true value by the same amount as would be expected in a criterion-related validity study with a sample size of 673. In other words, this pooled judgment provided a far better estimate of validity than all but the largest scale validity studies. In contrast, Hirsh et al. used the same job/test combinations with a sample of new Ph.D.s and found that the judgment of a single experienced Ph.D. was as accurate as the pooled judgment of ten new Ph.D.s. The pooled judgment of ten new Ph.D.s proved as accurate as a validity study with a sample size of 104.
The differences found between experienced and inexperienced judges are of great interest. Schmidt et al. (1983) attribute the success of experienced judges to their experience conducting validation research and in accumulating information about validity research done by others. This line of reasoning suggests that even experienced judges will not be successful in estimating validity for predictor/criterion combinations for which little prior information is available. An alternative explanation for the success of experienced judges is that it is simply due to broader experience with the world of work. They have spent more time in the workplace and have better insights into job requirements. Thus, even for predictor/criterion combinations for which no validity evidence exists at present, they may be able to make accurate judgments. Note that in the J-coefficient literature there is evidence that job incumbent judgments of test-criterion relationships are predictive of empirical validity results, suggesting that work experience, rather than test validation experience, may be the critical factor. Thus there is some basis for positing both that experienced psychologists will be able to estimate validity for a wide variety of predictor-criterion combinations and that experienced nonpsychologists, such as job incum-
bents, may also be able to do so. Panels of psychologists and experienced incumbents could be assembled and asked to make validity judgments. As nonpsychologists are not likely to be comfortable with estimating validity coefficients, ratings of the importance of the predictor construct for differentiating between high and low performers on the criterion construct could be obtained and correspondence between these ratings and empirical validity coefficients determined empirically.
Schmidt et al.'s (1983) speculation that the success of experienced psychologists is a function of their memory of validation results for other jobs could be examined. Both the psychologist and incumbent samples could first be asked to estimate validity for five MOS included in the Job Performance Measurement Project in the absence of any information about project results and then asked to estimate validity for five additional MOS. For these additional MOS the judges will be presented with Job Performance Measurement Project validity results for the other 22 MOS to serve as anchors for their judgments. Thus, the impact of information about predictor-criterion relationships for other specialties on the accuracy of validity judgments could be examined.
Paired Comparison Judgments of Validity
An alternative approach to estimating validity judgmentally is the use of paired comparison judgments. Rather than estimating validity directly, judges could be presented with pairs of occupational specialties and asked, for each predictor-criterion combination, which specialty in the pair has the higher validity coefficient. Paired comparison judgments could be obtained from psychologists for 20 of the 27 MOS in the Job Performance Measurement Project data base. These judgments could be pooled across raters and scaled, and the scaling solution then compared with obtained validity coefficients from the project. If a substantial degree of correspondence was found between the scaling solution and obtained validity coefficients, then validity estimates for new MOS could be produced by obtaining paired comparison judgments comparing the new MOS with those for which validity is known and thus mapping the new MOS into the scaling solution. The seven holdout MOS could be used to demonstrate the viability of this approach. This approach is dependent on the assumption that the JPM Project data base includes the full range of MOS, such that the scale points represent the full range of validity coefficients likely to be obtained. Note that this approach demands that judges be very knowledgeable of all MOS involved in the judgments. However, a complete set of judgments is not needed from each judge; each judge can rate partially overlapping sets of MOS. Finally, note that with modification of existing software, this judgment task can be administered by computer.
Determination of Minimum Standards
The determination of minimum cutoff scores has been and remains a problem for which no simple or agreed upon solution exists. Many approaches to setting cutoffs have been identified (e.g., Buck, 1977; Campion and Pursell, 1980; Drauden, 1977; Nedelsky, 1954; Guion, 1965.) One thing all have in common is a subjective component: setting a cutting score requires a value judgment (Cronbach, 1949).
Much of the discussion of cutoff scores is in the context of either achievement testing in an educational setting or the use of content valid work sample or knowledge tests in public sector employment settings. In both of these settings one is typically setting a predictor cutoff in the absence of criterion information. Judgments about expected test item performance of minimally satisfactory performers are typically combined to identify minimum test cutoffs. Without criterion data, a standard is lacking for which these techniques for setting cutoffs can be evaluated. With criterion data and a large sample size, a different type of approach is possible. Based on expert judgment, the minimum acceptable level of criterion performance is identified, and the regression equation relating predictor and criterion is used to identify the predictor score corresponding to this minimum level of acceptable performance. Given the probabilistic nature of predictor-criterion relationships, some individuals scoring above this cutoff will fail and some individuals scoring below this cutoff will succeed. The relative value assigned by the organization to each of these types of prediction error will influence the choice of actual cutoff.
This approach could be applied to all predictor-criterion combinations for the 27 MOS included in the project. Panels of officers directly supervising individuals in each of the 27 MOS could be convened to reach consensus on the minimum acceptable level of performance on each performance construct. Two panels of five for each MOS would provide a group size conducive to consensus decision making and allow a comparison of the judgments of two independent panels. Thus for each predictor-criterion combination for each MOS the predictor score corresponding to minimum acceptable criterion performance could be identified. The availability of these predictor scores would provide a standard for evaluation against which techniques for setting cutoff scores can be assessed even in situations in which empirical predictor and criterion data are not available (e.g., new MOS).
Three techniques for identifying predictor cutoffs are examined. These directly parallel techniques proposed for estimating validity for new MOS. Each is discussed in turn.
Earlier we discussed the use of a synthetic validity approach to examining variance in validity coefficients across MOS. Examples of this approach using the PAQ were examined, and recommendations were made that abil-
ity requirements, rather than general or specific job behaviors, be used as the standardized job descriptor. In applying this approach to setting cutoff scores, previous research has reported a high degree of success (median correlation of .73) in using PAQ dimension scores as predictors of mean test scores obtained by job incumbents (Mecham et al., 1977.) The Mecham et al. work is based on what they call the “gravitational hypothesis,” namely, that people gravitate to jobs commensurate with their abilities. Mecham et al. advocate cutoff scores based on predicted mean test score (e.g., a cutoff of 1 or 2 standard deviations below the predicted mean). This approach is conceptually meaningful only when individuals are free to gravitate to particular jobs. If a test is used to assign individuals to jobs, mean test score merely reflects the organization's a priori assumptions about the job, rather than revealing anything about needed ability levels. Thus, in the military context, predicting mean test score is not very informative. However, the general strategy of using job information (e.g., PAQ dimensions) to predict needed predictor construct scores can be applied by substituting the regression-based predictor score corresponding to the needed minimum level of criterion performance for the mean predictor score.
Again, given 45 PAQ dimensions and 27 MOS, judgments of the PAQ dimensions most likely to be predictive of the needed predictor construct scores could be obtained for each predictor construct to achieve a reasonable ratio of predictors to cases. For each predictor construct, regression equations using 5 PAQ dimensions to predict variance in needed predictor construct scores could be computed for 20 MOS; the resulting equations will be applied to the 7 holdout MOS.
As was the case in considering moderators of validity coefficients, alternatives to the PAQ as the job descriptor of choice should be considered. Without repeating the earlier discussion, the Fleishman ability scales seem particularly well-suited to this task. The explicit measurement of level of ability required links directly to the task of setting predictor cutoffs.
The second approach involves direct estimates of minimum predictor cutoffs. While many approaches to setting cutoffs are based on judgments about predictors (Buck, 1977), such approaches typically involve judgments at the test item level (e.g., judged likelihood that each response option will be chosen by minimally qualified applicants). Such approaches are conceptually more meaningful when dealing with achievement tests, such as those used in an educational setting, than with ability, interest, or biodata measures. Thus, rather than aggregating item-level judgments, direct judgments of minimum predictor cutoffs could be examined. As in the case of direct estimates of validity, panels of psychologists and incumbents could be convened to estimate needed cutoff scores for 5 MOS in the absence of information about cutoff scores for other MOS and then make estimates for 5 additional MOS with access to the regression-based predictor cutoffs for the other 22 MOS.
Finally, a paired comparison process similar to that proposed for validity estimation could be examined. Psychologists could be asked to judge which of a pair of MOS requires a higher predictor score for a given predictorcriterion combination. Judgments could be obtained for all pairs for 20 project MOS; these judgments could be scaled and compared with the regression-based predictor cutoffs. Each of the 7 holdout MOS could then be compared with the MOS for which cutoffs are known and the results mapped into the scaling solution to produce cutoff estimates for the holdout MOS.
In retrospect, this paper may be mistitled. The focus has not been on clustering per se, but rather on exploring possible approaches to extending validity findings and empirically based predictor cutoffs beyond the 27 MOS included in the Job Performance Measurement Project. No single best approach has been identified; rather, a number of possibilities have been examined.
A critical question is whether point estimates of validity are needed for various MOS, or whether all that is needed is confidence that the predictors in question have meaningful levels of validity for various MOS. If the second will suffice, the dual strategy of conducting a meta-analysis of the validity studies correlating ASVAB subtests and composites with hands-on performance measures and conducting a meta-analysis of hands-on performance-training performance correlations should provide a clear picture of ASVAB validity for the universe of MOS. The analysis of correlations between ASVAB and hands-on measures is expected, at least by this author, to produce a similar pattern of findings to meta-analyses of cognitive ability tests using training or rating criteria; the expected strong relationship between hands-on measures and training criteria provides a link to the larger body of validity studies using training criteria.
If point estimates of validity are needed, a number of possibilities have been proposed: synthetic validity, direct estimation of validity, and paired comparison judgments of job similarity. Each could be attempted and the relative validity, cost, and ease of use of each could be examined. Considerable attention was paid to the issue of the choice of job descriptor, as the synthetic validity approach involves regressing validity coefficients on standardized job descriptors. Conceptual arguments as well as empirical data were reviewed dealing with the choice of specific behavior, general behavior, ability requirements, and global job information as the job descriptor. While the choice can be viewed as an empirical question to be answered by analyzing the 27 MOS involved in the project using multiple job analytic systems, a strong argument was made for using general, rather than molecular job descriptors, with particular attention paid to ability requirements as
the descriptor of choice. Each of these three approaches to generalizing point estimates of validity was seen as applicable with minor modification to the issue of establishing predictor cutoffs.
As indicated in the opening section of this paper, attention has not been paid to quantitative procedures for grouping jobs. The concern with both the descriptive and inferential grouping methods was that groupings were made on the basis of relative similarity of jobs. What was lacking was an external criterion for determining whether jobs were sufficiently similar that they could be treated the same for the purpose at hand. Data showing that job A was more similar to job B than to job C is not useful without a basis for knowing whether or not the magnitude of differences between the jobs is enough to require that the jobs be treated differently. The synthetic validity approaches discussed in this paper offer the needed criterion. The magnitude of differences on an ability requirement scale needed to produce a change in cutoff score of a given magnitude can be determined, and then used to guide clustering decisions. Hierarchical clustering procedures produce a full range of possible clustering solutions, from each job as an independent cluster to all jobs grouped in one large cluster. At each interim stage, the size of within-cluster differences can be determined; with information as to the magnitude of differences needed to affect the personnel decision in question, one has a basis for informed decisions as to the appropriate number of clusters to retain and as to which jobs can be treated the same for the purpose at hand.
Arvey, R.D., and M.E. Begalla 1975 Analyzing the homemaker job using the Position Analysis Questionnaire Journal of Applied Psychology 60:513-517.
Bangert-Drowns, R.L. 1986 Review of developments in meta-analytic method. Psychological Bulletin 99: 388-399.
Buck, L.S. 1977 Guide to the setting of appropriate cutting scores for written tests: a summary of the concerns and procedures. Washington, D.C.: U.S. Civil Service Commission, Personnel Research and Development Center.
Campion, M.A., and E.D. Pursell 1980 Adverse impact, expected job performance, and the determination of cut scores. Paper presented at the meeting of the American Psychological Association Montreal, August.
Cascio, W. 1982 Applied Psychology in Personnel Management. Reston, Va.: Reston Publishing.
Cooper, H.M. 1984 The Integrative Research Review. Beverly Hills, Calif.: Sage Publications.
Cornelius, E.T., T.J. Carron, and M.N. Collins 1979 Job analysis models and job classification. Personnel Psychology 32:693-708.
Cornelius, E.T., F.L. Schmidt, and T.J. Carron 1984 Job classification approaches and the implementation of validity generalization results. Personnel Psychology 37:247-260.
Cronbach, L.J. 1949 Essentials of Psychological Testing, 3rd. ed. New York: Harper and Row.
Drauden, G. 1977 Setting Passing Points. Minneapolis Civil Service Commission, Minneapolis, Minn.
Fleishman, E.A., and M.K. Quaintance 1984 Taxonomies of Human Performance. Orlando, Fla.: Academic Press.
Glass, G.V., B. McGaw, and M.L. Smith 1981 Meta-analysis in Social Research. Beverly Hills, Calif.: Sage Publications.
Guion, R.L. 1965 Personnel Testing. New York: McGraw-Hill.
Gutenberg, R.L., R.D. Arvey, H.G. Osburn, and P.R. Jeanneret 1983 Moderating effects of decision-making/information-processing job dimensions on test validities. Journal of Applied Psychology 68:602-608.
Hamilton, J.W. 1981 Options for small sample sizes in validation: a case for the J-coefficient Personnel Psychology 34:805-816.
Harvey, R.J 1986 Quantitative approaches to job classification: a review and critique Personnel Psychology 39:267-289.
Hawk, J. 1970 Linearity of criterion-GATB patitude relationships. Measurement and Evaluation in Guidance 2:249-251.
Hedges, L.V. 1985 Statistical Methods for Meta-analysis. New York: Academic Press.
Hirsh, H.R., F.L. Schmidt, and J.E. Hunter 1986 Estimation of employment validities by less experienced judges. Personnel Psychology 39:337-344.
Hunter, J.E. 1980 Test validation for 12,000 jobs: an application of synthetic validity and validity generalization to the GATB. Washington, D.C.: U.S. Employment Service, U.S. Department of Labor.
1986 Cognitive ability, cognitive aptitudes, job knowledge, and job performance Journal of Vocational Behavior 29:340-362.
Hunter, J.E., J.J. Crosson, and D.H. Friedman 1985 The Validity of the ASVAB for Civilian and Military Job Performance Rockville, Md.: Research Applications, Inc.
Hunter, J.E., and R.F. Hunter 1984 Validity and utility of alternative predictors of job performance Psychological Bulletin 96:72-98.
Hunter, J.E., F.L. Schmidt, and G.B. Jackson 1982 Meta-analysis: Cumulating Research Findings Across Studies. Beverly Hills, Calif.: Sage Publications.
Lawshe, C.H. 1952 What can industrial psychology do for small business? Personnel Psychology 5:31-34.
McCormick, E.J., P.R. Jeanneret, and R.C. Mecham 1972 A study of job characteristics and job dimensions as based on the Position Analysis Questionnaire. Journal of Applied Psychology Monograph 56:347-368.
Mecham, R.C., E.J. McCormick, and P.R. Jeanneret 1977 Position Analysis Questionnaire Technical Manual, System II. PAQ Services, Inc. (Available from University Book Store, 360 W. State St., West Lafayette, IN 47906.)
Mossholder, K.W., and Arvey, R.D. 1984 Synthetic validity: a conceptual and comparative review. Journal of Applied Psychology 69:322-333.
Nedelsky, L. 1954 Absolute grading standards for objective tests. Educational and Psychological Measurement 14:3-19.
Pearlman, K. 1980 Job families: a review and discussion of their implications for personnel selection. Psychological Bulletin 87:1-28.
Pearlman, K., F.L. Schmidt, and J.E. Hunter 1980 Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology 65:373-406.
Primoff, E.S. 1955 Test Selection by Job Analysis (Technical test series, No. 20). Washington, D.C.: U.S. Civil Service Commission, Standards Division.
1975 How to Prepare and Conduct Job Element Examinations. Personnel Research and Development Center. Washington, D.C.: U.S. Civil Service Commission.
Raju, N.S. 1986 An evaluation of the correlation, covariance, and regression slope models. Paper presented at the meeting of the American Psychological Association Washington, D.C.
Richardson, Bellows, Henry, and Co. 1983 Technical Reports: The Candidate Profile Record. Washington, D.C.
Rosenfeld, M., B. Shimberg, and R.F. Thornton. 1983 Job Analysis of Licensed Psychologists in the United States and Canada . Princeton, N.J.: Center for Occupational and Professional Assessment, Educational Testing Service.
Rosenthal, R. 1984 Meta-analytic Procedures for Social Research. Beverly Hills, Calif.: Sage Publications.
Sackett, P.R., N. Schmitt, H. L. Tenopyr, J. Kehoe, and S. Zedeck. 1985 Commentary on “40 questions about validity generalization and meta-analysis.” Personnel Psychology 38:697-798.
Sackett, P.R., M.M. Harris, and J.M. Orr. 1986 On seeking moderator variables in the meta-analysis of correlation data: a Monte Carlo investigation of statistical power and resistance to type I error. Journal of Applied Psychology 71:302-310.
Schmidt, F.L., J.E.Hunter, and K. Pearlman. 1981 Task differences as moderators of aptitude test validity in selection: a red herring. Journal of Applied Psychology 66:166-185.
Schmidt, F.L., J.E. Hunter, P.R. Croll, and R.C. McKenzie. 1983 Estimation of employment test validities by expert judgment. Journal of Applied Psychology 68:590-601.
Schmidt, F.L., J.E. Hunter, K. Pearlman, and H.R. Hirsh. 1985 Forty questions about validity generalization and meta-analysis. Personnel Psychology 38:697-798.
Schneider, B., and N. Schmitt. 1986 Staffing Organizations. Glenview, Ill.: Scott-Foresman.
Trattner, M.H. 1982 Synthetic validity and its application to the uniform guidelines validation requirements. Personnel Psychology 35:383-397.
Urry, V.W 1978 Some variations on derivation by Primoff and their extensions. Washington, D.C.: U.S. Civil Service Commission, Personnel Research and Development Center.