Exploring Strategies for Clustering Military Occupations

Paul R. Sackett

CLUSTERING MILITARY OCCUPATIONS

The Joint Services Project on Assessing the Performance of Enlisted Personnel has resulted in the collection of data on a variety of criterion measures for a number of occupational specialties. Intercorrelations among these criteria are being examined, as are relationships between the Armed Services Vocational Aptitude Battery (ASVAB) subtests and composites and performance on these criterion measures. A fundamental issue facing the Services is that of extending the results of these efforts from the limited set of occupational specialties included in the project to the universe of military occupational specialties (MOS).

More specifically, three different types of extension are needed. The first is the issue of ASVAB validity: based on known ASVAB-performance relationships for a small number of MOS, we wish to infer ASVAB-performance relationships for the universe of MOS. The second is the issue of intercorrelations among criteria. For a small number of MOS, intercorrelations among various types of criteria (e.g., hands-on performance tests and training grades) are known; we wish to generalize these relationships to the universe of specialties. The third is the issue of setting predictor cutoffs for various MOS. For MOS for which ASVAB and criterion data are available, it is at least possible (even if not current practice) to set cutoffs to ensure that no more than a specified proportion of applicants will fall below some



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Exploring Strategies for Clustering Military Occupations Paul R. Sackett CLUSTERING MILITARY OCCUPATIONS The Joint Services Project on Assessing the Performance of Enlisted Personnel has resulted in the collection of data on a variety of criterion measures for a number of occupational specialties. Intercorrelations among these criteria are being examined, as are relationships between the Armed Services Vocational Aptitude Battery (ASVAB) subtests and composites and performance on these criterion measures. A fundamental issue facing the Services is that of extending the results of these efforts from the limited set of occupational specialties included in the project to the universe of military occupational specialties (MOS). More specifically, three different types of extension are needed. The first is the issue of ASVAB validity: based on known ASVAB-performance relationships for a small number of MOS, we wish to infer ASVAB-performance relationships for the universe of MOS. The second is the issue of intercorrelations among criteria. For a small number of MOS, intercorrelations among various types of criteria (e.g., hands-on performance tests and training grades) are known; we wish to generalize these relationships to the universe of specialties. The third is the issue of setting predictor cutoffs for various MOS. For MOS for which ASVAB and criterion data are available, it is at least possible (even if not current practice) to set cutoffs to ensure that no more than a specified proportion of applicants will fall below some

OCR for page 305
Performance Assessment for the Workplace: VOLUME II specified level of criterion performance. We wish to set justifiable cutoffs for MOS for which high-quality criterion data are not available. The critical question is what aspects of jobs produce variations in validity coefficients, criterion intercorrelations, and cutoff scores. If this question can be answered, we can then ask two more questions: (1) which MOS can be shown to be sufficiently similar to MOS for which predictor and criterion data are available that we can infer that validity is the same and/or that appropriate cutoffs are the same; and (2) for MOS that are not sufficiently similar to any for which predictor-criterion data are available, can we establish relationships between job characteristics and validity coefficients, criterion intercorrelations, and cutoff scores such that we can make projections to MOS for which predictor-criterion data are not available? This paper considers approaches to addressing the need to assess job similarity in the context of the questions stated in the above paragraph, rather than as a general review of the job clustering literature. My single greatest concern about both the job analysis and job clustering literatures is the pervasive tendency to ignore the purpose for which job analysis is being done or for which jobs are being compared. When comparing jobs, two major decisions need to be made: (1) what job descriptor to use (e.g., tasks, abilities), and (2) what quantitative clustering procedure to use. The second has received more attention than the first; a detailed review by Harvey (1986) makes it unneccesary to treat this issue in detail here. The first factor has been shown (e.g., Cornelius et al., 1979) to have a large impact on decisions about job similarity. For example, jobs very different at the task level may be quite similar at the ability level. Decisions about the appropriate job descriptor are needed for subsequent efforts to examine the relationships between job characteristics and validities and cutoff scores. JOB ANALYSIS METHODS: THE CHOICE OF THE JOB DESCRIPTOR Numerous approaches to analyzing jobs exist. Textbooks in the fields of industrial/organizational psychology and personnel management commonly catalog 6-12 job analysis methods (e.g., functional job analysis, Position Analysis Questionnaire (PAQ), task checklist, job element method, critical incidents, ability requirement scales, threshold traits analysis) (e.g., Cascio, 1982; Schneider and Schmitt, 1986). One way to disentangle the myriad of approaches is to characterize them on a number of dimensions. Dimensions include source of information (e.g., incumbent versus supervisor versus job analyst) and method of collecting information (e.g., observation versus interview versus questionnaire), purpose (e.g., setting selection standards versus setting wages), and job descriptor (e.g., describing tasks versus describ-

OCR for page 305
Performance Assessment for the Workplace: VOLUME II ing attributes needed for task performance). Of particular interest here are these last two: the purpose for which the job analysis information is collected and the job descriptor chosen. Pearlman's (1980) review of the literature on the formation of job families identifies four major categories of job descriptors. The first he labels “job-oriented content,” referring to systems that describe work activities in terms of work outcomes or tasks. In other words, the focus is on what work is accomplished. Such systems are job specific. Pearlman gives two examples of task statements: “turns valves to regulate flow of pulp slush from main supply line to pulp machine headbox,” and “install cable pressurization systems.” I have relabeled this category with the more descriptive title “specific behaviors.” Researchers and practitioners describing jobs at this level typically use the label “tasks,” and generate a detailed list of tasks statements. Four to five hundred task statements are not uncommon. Pearlman's second category is labeled “worker-oriented content,” referring to systems that describe work activities in terms of behaviors or job demands that are not job specific. Thus these systems are intended as applicable to a wide variety of jobs and commonly involve evaluating jobs using a standard questionnaire. I have relabeled this category “general behaviors.” McCormick's Position Analysis Questionnaire (PAQ) typifies this approach. Sample PAQ items include “use quantitative materials” and “estimate speed of moving objects. ” Thus researchers and practitioners describing jobs in these terms typically use an inventory of 100-200 behavioral statements. Pearlman's third category is labeled “attribute requirements,” referring to systems that describe jobs in terms of the areas of knowledge, skill, or ability needed for successful job performance. Two very different approaches can fall into this category. The first involves the identification of specific areas of knowledge, skill, and ability needed for performance in one specific job in the context of the development of selection tests that will be justified on content validity grounds. This is a very common activity among psychologists developing selection systems in public sector settings. The critical feature is that the applicants for the job in question are expected to already have obtained the training to perform the job; thus the focus is on determining the extent to which applicants possess specific knowledge and skills needed for immediate job performance. In these settings it is not uncommon to develop detailed lists of 100-200 areas of needed knowledge, skill, and ability; these lists are then used to guide test development. The second is more applicable to the military situation in that it is more applicable to settings in which training takes place after selection. As knowledge and specific skills will be acquired in training, selection is based on abilities shown to be predictive of knowledge and skill acquisition and/or subsequent job performance. Thus, rather than focusing on large numbers of areas of job-specific knowledge and skill, this approach involves describing

OCR for page 305
Performance Assessment for the Workplace: VOLUME II jobs in terms of a fixed set of cognitive, perceptual, and psychomotor abilities. I use the label “ability requirements” to refer to this subset of the more general category “attribute requirements.” An example of this approach is Fleishman's work on ability requirements (Fleishman and Quaintance, 1984). Based on an extensive program of research, a list of abilities was created, as well as rating scales for evaluating the degree to which each of the abilities is required. The present list identifies 52 abilities; smaller numbers could be used if, for example, motor requirements were not relevant to the purpose for which job information was being collected. A focus solely on cognitive ability requirements would involve 14 abilities. Examples include “number facility” and “fluency of ideas.” Thus the ability requirements approach involves describing jobs in terms of a relatively limited number of abilities required for job performance. Pearlman's fourth category is labeled “overall nature of the job,” referring to approaches that characterize jobs very broadly, such as by broad job family (managerial, clerical, sales). An example of this category that may be of particular interest to the Job Performance Measurement Project (JPM Project) is Hunter's (1980) grouping of all 12,000 jobs in the Dictionary of Occupation Titles into one of five categories on a job complexity scale. This complexity scale is based on a recombination of the Data and Things scales used by the U.S. Department of Labor to classify jobs. Hunter shows that validity coefficients for composites of General Aptitude Test Battery (GATB) subtests differ across levels of this complexity variable and are very similar within levels of this variable. As Pearlman points out, distinctions between these categories are not always clear, and some approaches to job analysis involve multiple categories. However, it is conceptually useful to conceive of these four categories as a continuum from more specific to less specific. A given job can be described in terms of a profile of 400-500 specific behaviors, 100-200 general behaviors, 10-40 abilities, or a single global descriptor, such as job complexity. It should be recognized that this is not merely a continuum of level of specificity; there are clearly qualitative differences in moving from behaviors performed to abilities required. Nonetheless, this discussion should clarify the differences in level of detail involved in the various approaches to describing jobs and should set the stage for a discussion of the relationship between the purpose for which job information is being collected and the type of job descriptor chosen. These two issues—purpose and job descriptor chosen—are closely intertwined. The question “which job analysis method is most appropriate” can only be answered in the context of a specific purpose. An illustration would be an example from a job analysis of the job “psychologist. ” An issue of concern was whether different specialties within psychology —clinical, counseling, industrial/organizational, and school—were similar enough that a common li-

OCR for page 305
Performance Assessment for the Workplace: VOLUME II censing exam was appropriate for these four specialties. The Educational Testing Service (ETS) was commissioned to conduct a comparative job analysis of these four areas (Rosenfeld et al., 1983). An inventory of 59 responsibilities and 111 techniques and knowledge areas was designed and mailed to a carefully selected sample of licensed psychologists. The study found a common core of responsibilities among all four specialties and chided various practice areas for emphasizing the uniqueness of their own group. I am not denying that there are commonalities among different types of psychologists. However, I will argue that I could have easily designed a survey instrument that would have produced different results. One thing industrial/organizational psychologists have learned from our experience with job analysis is that the more general the data collected, the more likely it is that jobs will appear similar when subjected to statistical analysis; conversely, the more specific the inventory items, the greater the apparent differences among jobs. The art of job analysis lies in determining a level of specificity that meets the purposes of the particular job analysis application. Consider some of the statements making up the ETS inventory. Responsibility 1 leads the inventory reading: “Conduct interviews with client/patient, family members or others to gain an understanding of an individual 's perceived problem.” This is endorsed by a high proportion of respondents from all specialties, yet it can mean dramatically different things, from interviewing a corporate executive to gain insight into an organization's incentive pay plan to interviewing a 7-year-old suspected victim of child abuse. More examples: “observe the behavior of individuals who are the focus of concern,” and “formulate a working hypothesis or diagnosis regarding problems or dysfunctions to be addressed.” Again, these can refer to dramatically different activities. More to the point, given that the purpose of the job analysis is to support the creation of one or more licensing exams, these can require different skills, abilities, training and experience. By being more specific and rephrasing Responsibility 1 as multiple tasks (“interview business clients,” “interview adult patients,” “interview children”), the chances of concluding that the jobs are different increase. By getting even more general (“gather information verbally”), the chances of concluding that the jobs are similar increase. Each of these three levels of specificity present information which is true. However, the question of which level of specificity is appropriate depends on the purpose for which the information is being collected. In the above example, the three levels of specificity illustrated all focus on worker activities. The job descriptor chosen is in all cases behavioral; they vary on a continuum from general behaviors to specific behaviors. Similarly, one may reach different conclusions about job similarities and differences if different categories of job descriptors are chosen (e.g., focusing on job activities versus focusing on abilities required for job performance).

OCR for page 305
Performance Assessment for the Workplace: VOLUME II A multiorganization study of bank teller and customer service jobs illustrates this nicely (Richardson, Bellows, Henry, and Co., 1983). A 66-item behavioral work element questionnaire (e.g., “cashes savings bonds,” “verifies signatures,” “types entries onto standardized forms”) and a 32-item ability requirement questionnaire (e.g., “ability to sort and classify forms,” “ability to compute using decimals,” “ability to pay attention to detail”) were administered. While the vast majority of incumbents held the title “paying and receiving teller,” 20 other job titles were found (e.g., new accounts representative, customer service representative, drive-in teller, safe deposit custodian). The issue was whether these 20 jobs were sufficiently similar to the job of paying and receiving teller that a selection test battery developed for the paying and receiving tellers could also be used for the other jobs. A correlation between each job and the paying and receiving teller was computed, first based on the behavioral work element ratings and then based on the ability ratings. In a number of cases, dramatically different findings emerged. The new accounts representative, customer service representative, and safe deposit custodian correlated .21 with the paying and receiving teller when comparing the jobs based on similarity of rated behavioral work elements. These same three jobs correlated .90, .92, and .88 with the paying and receiving teller when comparing the jobs based on similarity of rated ability requirements. Thus the use of different job descriptors leads to different conclusions about job similarity. Conceptually, one could argue that for purposes of developing an ability test battery, the ability requirements data seem better suited. If data on these same jobs were being collected to determine whether a common training program for new hires was feasible, one might argue that the work behavior data seem better suited. Again, the question “which jobs are sufficiently similar that they can be treated the same” cannot be answered without information as to the purpose for which the jobs are being compared. A study by Cornelius et al. (1979) reinforces this point and takes it one step further. They analyzed seven nominally different first-level supervisory jobs in chemical processing plants. Hierarchical clustering analysis was done to establish job groupings based on three types of data: task similarity, similarity of Position Analysis Questionnaire profiles, and similarity of ability requirements. Each type of data produced a different pattern of job similarities and a different clustering of jobs. Cornelius et al. properly tell us that purpose will dictate which set of data we should rely on. However, even after this decision has been made, problems remain. Cornelius et al.'s task analysis data, for example, indicate that both five-cluster and three-cluster solutions are feasible. Hierarchical cluster analysis, as well as other grouping methods, can only establish relative similarity among jobs. In the Cornelius et al. study, if 40 percent of tasks in common is seen as sufficient to label jobs similar, the seven jobs would fall into three clusters.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II If 60 percent of tasks in common is seen as sufficient to label jobs similar, the seven jobs would fall into five clusters. The question left unanswered is “given that an appropriate job descriptor has been chosen, how large a difference between jobs on the chosen descriptor is needed to have a significant impact on the criterion of interest? ” In a selection setting, how different do jobs have to be before validity coefficients are affected? In a training situation, how different do jobs have to be before separate training programs are required? In a performance appraisal situation, how different do jobs have to be before separate performance ratings forms need to be constructed? Thus job clustering can only be meaningful with reference to an external criterion. In summary, the above discussion highlights a number of concerns about job grouping. First, different descriptors can produce very different job groupings. Second, different levels of specificity within a given general type of descriptor (e.g., task) can produce very different job groupings. Third, even if a given type of job descriptor and level of specificity are agreed on, the magnitude of job differences that will be needed to classify jobs differently remains a problem. An external criterion is needed. The implications of the above discussion for the JPM Project are clear. First, there is reason to expect that different job descriptors will produce different job groupings. The choice of job descriptor should not be a function of the availability of job descriptor data using a particular approach, but rather a function of the type of job descriptor data which is most closely linked to the purpose for which jobs are being grouped. Second, it must be realized that the two goals of grouping jobs with similar test validities and grouping jobs with similar levels of ability required to ensure a specified level of performance must be treated independently. Grouping jobs based on validity may produce very different job clusters than grouping jobs based on required ability levels. Conceivably these two purposes could require different job descriptors for optimal clustering. Approaches to identifying the appropriate job descriptor for these purposes are discussed in a subsequent section of this paper. One additional aspect of the choice of the job descriptor merits some discussion, namely, the nature of the data to be collected about the descriptor chosen. Given that a descriptor has been chosen (e.g., specific behaviors or abilities), it is common to ask job experts to rate the importance of each job component. However, “importance ” can be conceptualized in a number of ways, three of which are discussed here. Using abilities as an example, one approach to importance is in terms of time: what proportion of total time on the job is spent using the ability in question. The Position Analysis Questionnaire, for example, uses this type of scale for some items. A second approach is in terms of contribution to variance in job performance: to what extent does the ability in question contribute to differentiating the

OCR for page 305
Performance Assessment for the Workplace: VOLUME II more successful employees from the less successful. The job element approach to job analysis for selection system development uses such a scale. A third approach is in terms of level: what degree of a given ability is needed for successful job performance. Fleishman 's Ability Requirement Scales exemplify this approach. Conceptually, it is clear that these three can be completely independent. The abilities that are used most frequently may be possessed by virtually all incumbents and thus not contribute to variance in job performance. A given ability may contribute equally to variance in job performance in two jobs, yet the level of ability needed may differ dramatically across the jobs. Thus, even if it were agreed that abilities required is the appropriate job descriptor for a given application, operationalizing ability as importance, frequency of use, contribution to variance in performance, or level required can lead to different conclusions about job similarity. It would seem logical to hypothesize that judgments about contributions to variance in job performance would be most appropriate for determining for which jobs a given test should have similar validity and that judgments about level required would be most appropriate for determining which jobs should have similar test cutoffs. The distinctions made in the above paragraph are not typically made. In fact, researchers sometimes seem to feel that the choice of the descriptor is all that is important and do not even mention the aspect of the descriptor that is rated. For example, a paper by Cornelius et al. (1984) describes the construction and use of a 26-item ability element battery to group jobs in the petroleum/petrochemical industry. They used the results of this inventory to assign jobs to one of three occupational groups, but did not tell us whether ability was operationalized as frequency of use, contribution to variance in performance, or level required. The use of one operationalization of importance where another seems better suited is found in Arvey and Begalla's (1975) examination of the job of homemaker. They administered the PAQ to a sample of homemakers and compared the PAQ profile for this position with each of the large number of profiles in the PAQ data base. These comparisons were made for two human resource management purposes: attempting to associate a wage with the homemaker job and making inferences about job transfer and training decisions. Jobs most similar in PAQ profiles were patrolman, home economist, airport maintenance chief, and kitchen helper; a number of supervisory positions followed closely (electrician foreman, gas plant maintenance foreman, fire captain) in the list of the 20 most similar positions. Arvey and Begalla note that a major theme running through many of the occupations listed was a trouble-shooting emergency handling orientation. Based on this list of most similar occupations, it is not clear that the goal of identifying jobs amenable to entry by homemakers was met. Arvey and Begalla note this and interpret their findings with appropriate caution. The

OCR for page 305
Performance Assessment for the Workplace: VOLUME II predicted salary for the job was $740 per month, in 1969 dollars, which the authors felt was overinflated. They offer distortion of responses based on desire on the part of the respondents to make their positions seem more important as an explanation of the high salary. In light of our discussion of various operationalizations of job element importance, another explanation seems likely: the descriptions provided are accurate (i.e., not intentionally distorted), but the information requested is not well suited to the task at hand. The ratings scales used in the PAQ typically reflect time spent: either a direct rating of frequency or a rating of importance, operationalized vaguely as “consider such factors as amount of time spent, the possible influence on overall job performance if the worker does not properly perform the activity, etc.” I would hypothesize that different patterns of similarity would be found if “level required” rather than “time spent” were used to rate items. Conceptually, level required seems better suited to the tasks of identifying jobs amenable to entry by homemakers and setting wage levels. Jobs very similar in the amount of time spent on the PAQ dimension “processing information” may be very different in the level of information processing involved. In short, it is suggested that careful attention be paid to both the selection of the job descriptor and to the operationalization of job element importance. The following sections of this paper separately address the issues of identifying valid predictors of performance for the universe of MOS and setting minimum standards on these predictors. Multiple potential solutions to the problem are presented. EXTENDING ASVAB VALIDITY TO THE UNIVERSE OF MOS Validity Generalization/Meta-Analysis Validity generalization is a form of meta-analysis. The application of meta-analytic techniques to the examination of predictor-criterion relationship in the selection arena has been labeled validity generalization; the use of the two terms is the result of the parallel development of data cumulation techniques by two groups of researchers—Glass and colleagues (e.g., Glass et al., 1981) and Schmidt and Hunter and colleagues (e.g., Hunter et al., 1982)—who applied different labels to similar techniques. Note that there are five-book length treatments of cumulative techniques (Glass et al., 1981; Hunter et al., 1982; Rosenthal, 1984; Cooper, 1984; Hedges, 1985) and a number of thorough and critical treatments of the topic in archival journals (e.g., Schmidt et al., 1985; Sackett et al., 1985; Bangert-Drowns, 1986). An introduction to validity generalization is in order. For years psychologists have observed that when a given test is validated in different settings, the resulting validity coefficients vary; in some cases the amount of varia-

OCR for page 305
Performance Assessment for the Workplace: VOLUME II tion is substantial. Historically, the explanation offered for this was that situational factors affected validity. Due to these unspecified situational factors (for example, organizational climate, leadership style, and organizational structure) a test valid in one situation might not be valid in another. Thus there is the doctrine of “situation specificity,” defined as the belief that due to these factors one could not safely rely on validity studies done elsewhere, but rather, one must do a validity study in each new setting. To understand validity generalization, it is helpful to distinguish between “true validity” and “observed validity.” True validity is the correlation that is obtained if there is an infinitely large sample size that is perfectly representative of the applicant pool of interest and if the criterion measure is a perfectly reliable measure of true job performance. Observed validity is the correlation obtained in our research—typically with smaller Ns than preferred, with samples that may not be perfectly representative of the job applicant population, and with less than perfect criterion measures (e.g., supervisory ratings of performance). Historically, researchers have not differentiated between observed validity and true validity: when observed validity differences were found between studies, it was assumed that the differences were real. Recently, it has been suggested that these differences are not real, but simply reflect differences in sample size, criterion reliability, or range restriction. Could it be that true validity does not differ across situations? If it weren't for these methodological problems, would validities be the same across studies? These ideas make for interesting speculation; what was needed were ways of testing them. Validity generalization models are means of testing these ideas: they offer a way of assessing true validity and of assessing how much variability in validity coefficients we can expect due to the methodological problems listed above. The amount of variability in observed validity coefficients is compared with the amount of variability expected due to methodological artifacts: if expected validity equals or nearly equals observed validity, one concludes that differences in validities across studies are not real, but merely the result of the effects of these artifacts. Procedurally, validity generalization ties together a number of well-known psychometric ideas. One starts with a number of validity studies and a validity coefficient for each. For each study, one obtains an estimate of criterion reliability. Each validity coefficient is corrected using the well-known formula for correction for attenuation in the criterion. Each validity coefficient is also corrected for range restriction—the extent to which the sample used in the study has a narrower range of test scores than would be obtained from job applicants—using well-known formulas for range restriction. The mean and variance of this distribution of corrected validity coefficients is then computed and compared with the variance expected due to sampling error, which is a function of N and the mean validity coefficient. If the

OCR for page 305
Performance Assessment for the Workplace: VOLUME II variance expected due to sampling error and the variance in corrected validity coefficients are nearly equal, we conclude that validity is not situation specific, and that the best estimate of true validity is the mean of the corrected validity coefficients. Validity generalization analyses might appear to be straightforward under the conditions outlined above. However, if criterion reliability values or information about range restriction is not available for each study, assumptions must be made about what criterion reliability was likely to be, about how much range restriction was likely to have occurred, and about the linearity of the predictor/criterion relationship. These assumptions are critical: if the values assumed are incorrect, the estimated value of true validity can be substantially in error. Furthermore, when the range restriction is severe, the extrapolation permitted by these assumptions is tenuous. A source of confusion in understanding and interpreting validity generalization/meta-analytic research lies in the failure to differentiate between two different statistical tests that can be performed on a set of validity coefficients; these are tests of the situational specificity hypothesis and the generalizability hypothesis. The situational specificity hypothesis is rejected when variance in validity coefficients is essentially zero after correcting for artifacts. Rejecting this hypothesis implies accepting the hypothesis that true validity is virtually constant for the job/test combination under consideration. The generalizability hypothesis is less stringent. It involves the recognition that even if one fails to reject the situational specificity hypothesis and thus acknowledges that validity varies across jobs, it is still possible that even the low end of a distribution of validity coefficients is of a magnitude sufficient to consider the test useful. Thus, if one's interest is not in a point estimate of validity for a given situation but rather in simply the assurance that test validity will be above a level considered minimally acceptable, one can accept the generalization hypothesis if the low end of a confidence interval around mean validity exceeds this level. The research of Schmidt and Hunter has asserted that cognitive ability tests are valid for all jobs (Hunter and Hunter, 1984). Some have interpreted this as implying that tests are equally valid for all jobs. This misinterpretation is based on confusing the situational specificity hypothesis and the generalizability hypothesis. Schmidt and Hunter's statements involve accepting the generalizability hypothesis, (i.e., that the validity of cognitive tests is positive and nonzero for all jobs). While validity generalization research with cognitive ability tests shows quite strongly that there is little to no variation in true validity for individual job/test combinations, it is very clear that the validity of cognitive ability tests does vary across jobs. One of the clearest illustrations of this is found in a study by Schmidt et al. (1981). For a sample of 35 Army jobs, validity coefficients for the 10 subtests of the Army Classification Battery

OCR for page 305
Performance Assessment for the Workplace: VOLUME II tent to which each of these factors moderates validity could be examined. However, there is some basis for predicting the outcome of such an effort. First, the validity generalization literature discussed earlier has led to the recognition that within a class of jobs, such as clerical work, differences in specific behaviors performed do not have a substantial influence on validity. Commonality of underlying abilities required leads to similar validity despite lack of overlap in specific behaviors performed. This leads to the hypothesis that more general approaches, namely, ability requirements or global descriptors, are better candidates. Second, successful attempts at examining moderators of validity across diverse jobs have used general rather than molecular job descriptors. Hunter (1980) found that regression weights for using a general cognitive ability composite to predict performance increased from .07 to .40, moving from the lowest to the highest levels of his job complexity scale in a sample of 515 GATB validity studies; similarly, the regression weights of psychomotor ability decreased from .46 to .07, moving from the lowest to the highest levels of the complexity scale. The PAQ dimensions used successfully by Gutenberg et al. (decision making and information processing) are among the most “abilitylike” of the PAQ dimensions. Third, as Pearlman (1980) notes, the more molecular approaches lack the isomorphism with the individual differences variables being considered as predictors of performance that is found with the molar approaches. Isomorphism between job descriptor constructs and predictor constructs is conceptually elegant, making for a readily explainable and interpretable system. Thus, while isomorphism is by no means a requirement for a successful approach to examining moderators of validity, it is certainly a virtue if such an approach proves viable. Pearlman suggests the use of ability requirements as the descriptor to be used for job grouping for validity purposes. Therefore, I would suggest that ability requirements and global job complexity be considered as moderators of validity. Fleishman's ability requirement scales (Fleishman and Quaintance 1984) seem particularly worthy of consideration due to the extensive research leading to the development of the scales and the care taken in the definition of each ability. A separate rating scale is provided for each ability, containing a general definition of the ability, definitions of the high and low ends of the scale, a description of how the ability differs from other abilities, and illustrative tasks for various levels of the ability. For example, low, medium, and high levels of the ability “verbal comprehension” are illustrated by “understand a comic book,” “understand a newspaper article in the society section reporting on a recent party,” and “understand in entirety a mortgage contract for a new home.” Recall our earlier discussion of possible operationalizations of the importance of a given ability as time spent using the ability, contribution of the ability to variance in performance, and level of the ability required. The

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Fleishman scales clearly fall into the third category. Conceptually, this third category—level required—seems better suited as a moderator of predictor cutoffs than of validity. The second—contribution to variance in performance—seems better suited to the task at hand. Thus, a separate importance rating, explicitly defining importance as contribution to variance, might be obtained along with the level required rating. Therefore, it is suggested that ratings of each of the 27 project MOS be obtained using the Fleishman scales with the modification discussed above. Ratings should be made by a number of independent raters to achieve adequate reliability. Existing task analyses of these MOS should aid the rating process. If rated ability requirements are found to moderate validity, predictions of validity for new MOS can then be made. J-coefficient A wide variety of algebraically equivalent versions of the J-coefficient are available (see Hamilton, 1981; Primoff, 1955; Urry, 1978). Trattner (1982) describes the J-coefficient as the correlation of a weighted sum of standardized work behavior scores with a test score. Exactly what constitutes these standardized work behaviors, or job elements, varies across J-coefficient applications. In other words, the J-coefficient is a means of estimating the correlation between a test and a composite criterion. The computation of a J-coefficient for a given predictor requires (1) the correlation between the predictor and each criterion dimension, (2) intercorrelations among criterion dimensions, and (3) importance weights for each criterion dimensions. It is critical to note that the importance of various criterion constructs is a policy issues as well as a scientific one, and take issue with the notion that there is such a thing as “true” overall performance. Consider, for example, two potential dimensions of military job performance: current job knowledge and performance under adverse conditions. It does not seem unreasonable that a policy directive to emphasize combat readiness would increase the importance attached to the second relative to the first. Presuming a lack of perfect correspondence between individuals' standing on the two criterion constructs, the rank order of a group of individuals on a composite criterion would change; which order is “right” reflects policy priorities. Thus the scientific contribution is to identify predictors of each criterion construct; for any given set of weighted criteria we can then estimate the validity of a selection system. The relevance of the J-coefficient to this project lies mainly in the contribution the approach can make to the issue of determining the correlation between each predictor and each criterion construct. The J-coefficient formulas, of course, accept any validity estimate; users of the approach typi-

OCR for page 305
Performance Assessment for the Workplace: VOLUME II cally rely on judgments of the relevance of test items or entire tests for each criterion construct. As judgmental approaches to validity estimation will be discussed separately, no further attention is needed for the J-coefficient itself. Judgmental Estimates of Validity Recent research has reported considerable success in obtaining validity estimates by pooling the direct judgments of test validity across a number of judges. Schmidt et al. (1983) and Hirsh et al. (1986) asked psychologists to provide direct estimates of the validity of six subtests of the Naval Basic Test Battery, using training performance as the criterion, for a set of nine jobs. The jobs were selected because of the availability of criterion-related validity studies with sample sizes greater than 2,000, thus providing a standard against which the judgments could be compared that was virtually free of sampling error. Schmidt et al. used experienced psychologists as judges and found that the pooled judgment of ten psychologists deviated on average from the true value by the same amount as would be expected in a criterion-related validity study with a sample size of 673. In other words, this pooled judgment provided a far better estimate of validity than all but the largest scale validity studies. In contrast, Hirsh et al. used the same job/test combinations with a sample of new Ph.D.s and found that the judgment of a single experienced Ph.D. was as accurate as the pooled judgment of ten new Ph.D.s. The pooled judgment of ten new Ph.D.s proved as accurate as a validity study with a sample size of 104. The differences found between experienced and inexperienced judges are of great interest. Schmidt et al. (1983) attribute the success of experienced judges to their experience conducting validation research and in accumulating information about validity research done by others. This line of reasoning suggests that even experienced judges will not be successful in estimating validity for predictor/criterion combinations for which little prior information is available. An alternative explanation for the success of experienced judges is that it is simply due to broader experience with the world of work. They have spent more time in the workplace and have better insights into job requirements. Thus, even for predictor/criterion combinations for which no validity evidence exists at present, they may be able to make accurate judgments. Note that in the J-coefficient literature there is evidence that job incumbent judgments of test-criterion relationships are predictive of empirical validity results, suggesting that work experience, rather than test validation experience, may be the critical factor. Thus there is some basis for positing both that experienced psychologists will be able to estimate validity for a wide variety of predictor-criterion combinations and that experienced nonpsychologists, such as job incum-

OCR for page 305
Performance Assessment for the Workplace: VOLUME II bents, may also be able to do so. Panels of psychologists and experienced incumbents could be assembled and asked to make validity judgments. As nonpsychologists are not likely to be comfortable with estimating validity coefficients, ratings of the importance of the predictor construct for differentiating between high and low performers on the criterion construct could be obtained and correspondence between these ratings and empirical validity coefficients determined empirically. Schmidt et al.'s (1983) speculation that the success of experienced psychologists is a function of their memory of validation results for other jobs could be examined. Both the psychologist and incumbent samples could first be asked to estimate validity for five MOS included in the Job Performance Measurement Project in the absence of any information about project results and then asked to estimate validity for five additional MOS. For these additional MOS the judges will be presented with Job Performance Measurement Project validity results for the other 22 MOS to serve as anchors for their judgments. Thus, the impact of information about predictor-criterion relationships for other specialties on the accuracy of validity judgments could be examined. Paired Comparison Judgments of Validity An alternative approach to estimating validity judgmentally is the use of paired comparison judgments. Rather than estimating validity directly, judges could be presented with pairs of occupational specialties and asked, for each predictor-criterion combination, which specialty in the pair has the higher validity coefficient. Paired comparison judgments could be obtained from psychologists for 20 of the 27 MOS in the Job Performance Measurement Project data base. These judgments could be pooled across raters and scaled, and the scaling solution then compared with obtained validity coefficients from the project. If a substantial degree of correspondence was found between the scaling solution and obtained validity coefficients, then validity estimates for new MOS could be produced by obtaining paired comparison judgments comparing the new MOS with those for which validity is known and thus mapping the new MOS into the scaling solution. The seven holdout MOS could be used to demonstrate the viability of this approach. This approach is dependent on the assumption that the JPM Project data base includes the full range of MOS, such that the scale points represent the full range of validity coefficients likely to be obtained. Note that this approach demands that judges be very knowledgeable of all MOS involved in the judgments. However, a complete set of judgments is not needed from each judge; each judge can rate partially overlapping sets of MOS. Finally, note that with modification of existing software, this judgment task can be administered by computer.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Determination of Minimum Standards The determination of minimum cutoff scores has been and remains a problem for which no simple or agreed upon solution exists. Many approaches to setting cutoffs have been identified (e.g., Buck, 1977; Campion and Pursell, 1980; Drauden, 1977; Nedelsky, 1954; Guion, 1965.) One thing all have in common is a subjective component: setting a cutting score requires a value judgment (Cronbach, 1949). Much of the discussion of cutoff scores is in the context of either achievement testing in an educational setting or the use of content valid work sample or knowledge tests in public sector employment settings. In both of these settings one is typically setting a predictor cutoff in the absence of criterion information. Judgments about expected test item performance of minimally satisfactory performers are typically combined to identify minimum test cutoffs. Without criterion data, a standard is lacking for which these techniques for setting cutoffs can be evaluated. With criterion data and a large sample size, a different type of approach is possible. Based on expert judgment, the minimum acceptable level of criterion performance is identified, and the regression equation relating predictor and criterion is used to identify the predictor score corresponding to this minimum level of acceptable performance. Given the probabilistic nature of predictor-criterion relationships, some individuals scoring above this cutoff will fail and some individuals scoring below this cutoff will succeed. The relative value assigned by the organization to each of these types of prediction error will influence the choice of actual cutoff. This approach could be applied to all predictor-criterion combinations for the 27 MOS included in the project. Panels of officers directly supervising individuals in each of the 27 MOS could be convened to reach consensus on the minimum acceptable level of performance on each performance construct. Two panels of five for each MOS would provide a group size conducive to consensus decision making and allow a comparison of the judgments of two independent panels. Thus for each predictor-criterion combination for each MOS the predictor score corresponding to minimum acceptable criterion performance could be identified. The availability of these predictor scores would provide a standard for evaluation against which techniques for setting cutoff scores can be assessed even in situations in which empirical predictor and criterion data are not available (e.g., new MOS). Three techniques for identifying predictor cutoffs are examined. These directly parallel techniques proposed for estimating validity for new MOS. Each is discussed in turn. Earlier we discussed the use of a synthetic validity approach to examining variance in validity coefficients across MOS. Examples of this approach using the PAQ were examined, and recommendations were made that abil-

OCR for page 305
Performance Assessment for the Workplace: VOLUME II ity requirements, rather than general or specific job behaviors, be used as the standardized job descriptor. In applying this approach to setting cutoff scores, previous research has reported a high degree of success (median correlation of .73) in using PAQ dimension scores as predictors of mean test scores obtained by job incumbents (Mecham et al., 1977.) The Mecham et al. work is based on what they call the “gravitational hypothesis,” namely, that people gravitate to jobs commensurate with their abilities. Mecham et al. advocate cutoff scores based on predicted mean test score (e.g., a cutoff of 1 or 2 standard deviations below the predicted mean). This approach is conceptually meaningful only when individuals are free to gravitate to particular jobs. If a test is used to assign individuals to jobs, mean test score merely reflects the organization's a priori assumptions about the job, rather than revealing anything about needed ability levels. Thus, in the military context, predicting mean test score is not very informative. However, the general strategy of using job information (e.g., PAQ dimensions) to predict needed predictor construct scores can be applied by substituting the regression-based predictor score corresponding to the needed minimum level of criterion performance for the mean predictor score. Again, given 45 PAQ dimensions and 27 MOS, judgments of the PAQ dimensions most likely to be predictive of the needed predictor construct scores could be obtained for each predictor construct to achieve a reasonable ratio of predictors to cases. For each predictor construct, regression equations using 5 PAQ dimensions to predict variance in needed predictor construct scores could be computed for 20 MOS; the resulting equations will be applied to the 7 holdout MOS. As was the case in considering moderators of validity coefficients, alternatives to the PAQ as the job descriptor of choice should be considered. Without repeating the earlier discussion, the Fleishman ability scales seem particularly well-suited to this task. The explicit measurement of level of ability required links directly to the task of setting predictor cutoffs. The second approach involves direct estimates of minimum predictor cutoffs. While many approaches to setting cutoffs are based on judgments about predictors (Buck, 1977), such approaches typically involve judgments at the test item level (e.g., judged likelihood that each response option will be chosen by minimally qualified applicants). Such approaches are conceptually more meaningful when dealing with achievement tests, such as those used in an educational setting, than with ability, interest, or biodata measures. Thus, rather than aggregating item-level judgments, direct judgments of minimum predictor cutoffs could be examined. As in the case of direct estimates of validity, panels of psychologists and incumbents could be convened to estimate needed cutoff scores for 5 MOS in the absence of information about cutoff scores for other MOS and then make estimates for 5 additional MOS with access to the regression-based predictor cutoffs for the other 22 MOS.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Finally, a paired comparison process similar to that proposed for validity estimation could be examined. Psychologists could be asked to judge which of a pair of MOS requires a higher predictor score for a given predictorcriterion combination. Judgments could be obtained for all pairs for 20 project MOS; these judgments could be scaled and compared with the regression-based predictor cutoffs. Each of the 7 holdout MOS could then be compared with the MOS for which cutoffs are known and the results mapped into the scaling solution to produce cutoff estimates for the holdout MOS. CONCLUSION In retrospect, this paper may be mistitled. The focus has not been on clustering per se, but rather on exploring possible approaches to extending validity findings and empirically based predictor cutoffs beyond the 27 MOS included in the Job Performance Measurement Project. No single best approach has been identified; rather, a number of possibilities have been examined. A critical question is whether point estimates of validity are needed for various MOS, or whether all that is needed is confidence that the predictors in question have meaningful levels of validity for various MOS. If the second will suffice, the dual strategy of conducting a meta-analysis of the validity studies correlating ASVAB subtests and composites with hands-on performance measures and conducting a meta-analysis of hands-on performance-training performance correlations should provide a clear picture of ASVAB validity for the universe of MOS. The analysis of correlations between ASVAB and hands-on measures is expected, at least by this author, to produce a similar pattern of findings to meta-analyses of cognitive ability tests using training or rating criteria; the expected strong relationship between hands-on measures and training criteria provides a link to the larger body of validity studies using training criteria. If point estimates of validity are needed, a number of possibilities have been proposed: synthetic validity, direct estimation of validity, and paired comparison judgments of job similarity. Each could be attempted and the relative validity, cost, and ease of use of each could be examined. Considerable attention was paid to the issue of the choice of job descriptor, as the synthetic validity approach involves regressing validity coefficients on standardized job descriptors. Conceptual arguments as well as empirical data were reviewed dealing with the choice of specific behavior, general behavior, ability requirements, and global job information as the job descriptor. While the choice can be viewed as an empirical question to be answered by analyzing the 27 MOS involved in the project using multiple job analytic systems, a strong argument was made for using general, rather than molecular job descriptors, with particular attention paid to ability requirements as

OCR for page 305
Performance Assessment for the Workplace: VOLUME II the descriptor of choice. Each of these three approaches to generalizing point estimates of validity was seen as applicable with minor modification to the issue of establishing predictor cutoffs. As indicated in the opening section of this paper, attention has not been paid to quantitative procedures for grouping jobs. The concern with both the descriptive and inferential grouping methods was that groupings were made on the basis of relative similarity of jobs. What was lacking was an external criterion for determining whether jobs were sufficiently similar that they could be treated the same for the purpose at hand. Data showing that job A was more similar to job B than to job C is not useful without a basis for knowing whether or not the magnitude of differences between the jobs is enough to require that the jobs be treated differently. The synthetic validity approaches discussed in this paper offer the needed criterion. The magnitude of differences on an ability requirement scale needed to produce a change in cutoff score of a given magnitude can be determined, and then used to guide clustering decisions. Hierarchical clustering procedures produce a full range of possible clustering solutions, from each job as an independent cluster to all jobs grouped in one large cluster. At each interim stage, the size of within-cluster differences can be determined; with information as to the magnitude of differences needed to affect the personnel decision in question, one has a basis for informed decisions as to the appropriate number of clusters to retain and as to which jobs can be treated the same for the purpose at hand. REFERENCES Arvey, R.D., and M.E. Begalla 1975 Analyzing the homemaker job using the Position Analysis Questionnaire Journal of Applied Psychology 60:513-517. Bangert-Drowns, R.L. 1986 Review of developments in meta-analytic method. Psychological Bulletin 99: 388-399. Buck, L.S. 1977 Guide to the setting of appropriate cutting scores for written tests: a summary of the concerns and procedures. Washington, D.C.: U.S. Civil Service Commission, Personnel Research and Development Center. Campion, M.A., and E.D. Pursell 1980 Adverse impact, expected job performance, and the determination of cut scores. Paper presented at the meeting of the American Psychological Association Montreal, August. Cascio, W. 1982 Applied Psychology in Personnel Management. Reston, Va.: Reston Publishing. Cooper, H.M. 1984 The Integrative Research Review. Beverly Hills, Calif.: Sage Publications. Cornelius, E.T., T.J. Carron, and M.N. Collins 1979 Job analysis models and job classification. Personnel Psychology 32:693-708.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Cornelius, E.T., F.L. Schmidt, and T.J. Carron 1984 Job classification approaches and the implementation of validity generalization results. Personnel Psychology 37:247-260. Cronbach, L.J. 1949 Essentials of Psychological Testing, 3rd. ed. New York: Harper and Row. Drauden, G. 1977 Setting Passing Points. Minneapolis Civil Service Commission, Minneapolis, Minn. Fleishman, E.A., and M.K. Quaintance 1984 Taxonomies of Human Performance. Orlando, Fla.: Academic Press. Glass, G.V., B. McGaw, and M.L. Smith 1981 Meta-analysis in Social Research. Beverly Hills, Calif.: Sage Publications. Guion, R.L. 1965 Personnel Testing. New York: McGraw-Hill. Gutenberg, R.L., R.D. Arvey, H.G. Osburn, and P.R. Jeanneret 1983 Moderating effects of decision-making/information-processing job dimensions on test validities. Journal of Applied Psychology 68:602-608. Hamilton, J.W. 1981 Options for small sample sizes in validation: a case for the J-coefficient Personnel Psychology 34:805-816. Harvey, R.J 1986 Quantitative approaches to job classification: a review and critique Personnel Psychology 39:267-289. Hawk, J. 1970 Linearity of criterion-GATB patitude relationships. Measurement and Evaluation in Guidance 2:249-251. Hedges, L.V. 1985 Statistical Methods for Meta-analysis. New York: Academic Press. Hirsh, H.R., F.L. Schmidt, and J.E. Hunter 1986 Estimation of employment validities by less experienced judges. Personnel Psychology 39:337-344. Hunter, J.E. 1980 Test validation for 12,000 jobs: an application of synthetic validity and validity generalization to the GATB. Washington, D.C.: U.S. Employment Service, U.S. Department of Labor. 1986 Cognitive ability, cognitive aptitudes, job knowledge, and job performance Journal of Vocational Behavior 29:340-362. Hunter, J.E., J.J. Crosson, and D.H. Friedman 1985 The Validity of the ASVAB for Civilian and Military Job Performance Rockville, Md.: Research Applications, Inc. Hunter, J.E., and R.F. Hunter 1984 Validity and utility of alternative predictors of job performance Psychological Bulletin 96:72-98. Hunter, J.E., F.L. Schmidt, and G.B. Jackson 1982 Meta-analysis: Cumulating Research Findings Across Studies. Beverly Hills, Calif.: Sage Publications. Lawshe, C.H. 1952 What can industrial psychology do for small business? Personnel Psychology 5:31-34. McCormick, E.J., P.R. Jeanneret, and R.C. Mecham 1972 A study of job characteristics and job dimensions as based on the Position Analysis Questionnaire. Journal of Applied Psychology Monograph 56:347-368.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Mecham, R.C., E.J. McCormick, and P.R. Jeanneret 1977 Position Analysis Questionnaire Technical Manual, System II. PAQ Services, Inc. (Available from University Book Store, 360 W. State St., West Lafayette, IN 47906.) Mossholder, K.W., and Arvey, R.D. 1984 Synthetic validity: a conceptual and comparative review. Journal of Applied Psychology 69:322-333. Nedelsky, L. 1954 Absolute grading standards for objective tests. Educational and Psychological Measurement 14:3-19. Pearlman, K. 1980 Job families: a review and discussion of their implications for personnel selection. Psychological Bulletin 87:1-28. Pearlman, K., F.L. Schmidt, and J.E. Hunter 1980 Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology 65:373-406. Primoff, E.S. 1955 Test Selection by Job Analysis (Technical test series, No. 20). Washington, D.C.: U.S. Civil Service Commission, Standards Division. 1975 How to Prepare and Conduct Job Element Examinations. Personnel Research and Development Center. Washington, D.C.: U.S. Civil Service Commission. Raju, N.S. 1986 An evaluation of the correlation, covariance, and regression slope models. Paper presented at the meeting of the American Psychological Association Washington, D.C. Richardson, Bellows, Henry, and Co. 1983 Technical Reports: The Candidate Profile Record. Washington, D.C. Rosenfeld, M., B. Shimberg, and R.F. Thornton. 1983 Job Analysis of Licensed Psychologists in the United States and Canada . Princeton, N.J.: Center for Occupational and Professional Assessment, Educational Testing Service. Rosenthal, R. 1984 Meta-analytic Procedures for Social Research. Beverly Hills, Calif.: Sage Publications. Sackett, P.R., N. Schmitt, H. L. Tenopyr, J. Kehoe, and S. Zedeck. 1985 Commentary on “40 questions about validity generalization and meta-analysis.” Personnel Psychology 38:697-798. Sackett, P.R., M.M. Harris, and J.M. Orr. 1986 On seeking moderator variables in the meta-analysis of correlation data: a Monte Carlo investigation of statistical power and resistance to type I error. Journal of Applied Psychology 71:302-310. Schmidt, F.L., J.E.Hunter, and K. Pearlman. 1981 Task differences as moderators of aptitude test validity in selection: a red herring. Journal of Applied Psychology 66:166-185. Schmidt, F.L., J.E. Hunter, P.R. Croll, and R.C. McKenzie. 1983 Estimation of employment test validities by expert judgment. Journal of Applied Psychology 68:590-601. Schmidt, F.L., J.E. Hunter, K. Pearlman, and H.R. Hirsh. 1985 Forty questions about validity generalization and meta-analysis. Personnel Psychology 38:697-798. Schneider, B., and N. Schmitt. 1986 Staffing Organizations. Glenview, Ill.: Scott-Foresman.

OCR for page 305
Performance Assessment for the Workplace: VOLUME II Trattner, M.H. 1982 Synthetic validity and its application to the uniform guidelines validation requirements. Personnel Psychology 35:383-397. Urry, V.W 1978 Some variations on derivation by Primoff and their extensions. Washington, D.C.: U.S. Civil Service Commission, Personnel Research and Development Center.