8

Evaluating the Quality of Performance Measures: Criterion-Related Validity Evidence

Investigations of empirical relationships between test scores and criterion measures (e.g., training grades, supervisor ratings, job knowledge test scores) have long been central to the evaluation and justification of using test scores to select and classify personnel in both civilian and military contexts. Such investigations, commonly known as criterion-related validity studies, seek evidence that performance on criteria valued by an organization can be predicted with a useful degree of accuracy from test scores or other predictor variables. The implications of a criterion-related validity study depend only secondarily, however, on the strength of the statistical relationship that is obtained. They depend first and foremost on the validity and acceptance of the criterion measure itself.

As summarized by the Office of the Assistant Secretary of Defense —Force Management and Personnel (1987), the program of criterion-related validity studies conducted by the Services in the past was generally based on the statistical relationship between aptitude test scores (i.e., the ASVAB) and performance in military training. Performance in basic and technical training has been the traditional criteria with which the Services validate their selection and classification measures because the data are credible, reasonably reliable, and available.

Training criteria are certainly relevant to the mission of the Services. Failures in training are expensive, and a logical case can be made that training outcomes should be related to performance on the job. The weak-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 141
Performance Assessment for the Workplace 8 Evaluating the Quality of Performance Measures: Criterion-Related Validity Evidence Investigations of empirical relationships between test scores and criterion measures (e.g., training grades, supervisor ratings, job knowledge test scores) have long been central to the evaluation and justification of using test scores to select and classify personnel in both civilian and military contexts. Such investigations, commonly known as criterion-related validity studies, seek evidence that performance on criteria valued by an organization can be predicted with a useful degree of accuracy from test scores or other predictor variables. The implications of a criterion-related validity study depend only secondarily, however, on the strength of the statistical relationship that is obtained. They depend first and foremost on the validity and acceptance of the criterion measure itself. As summarized by the Office of the Assistant Secretary of Defense —Force Management and Personnel (1987), the program of criterion-related validity studies conducted by the Services in the past was generally based on the statistical relationship between aptitude test scores (i.e., the ASVAB) and performance in military training. Performance in basic and technical training has been the traditional criteria with which the Services validate their selection and classification measures because the data are credible, reasonably reliable, and available. Training criteria are certainly relevant to the mission of the Services. Failures in training are expensive, and a logical case can be made that training outcomes should be related to performance on the job. The weak-

OCR for page 141
Performance Assessment for the Workplace ness of training outcomes, however, is that they tend to be primarily scores on paper-and-pencil tests of cognitive knowledge that do not tap many of the important aspects of job proficiency, such as psychomotor ability or problem-solving skills (Office of the Assistant Secretary of Defense—Force Management and Personnel, 1987:1-2). Concerns about the criterion measures that were most commonly used to validate the ASVAB in the past provided much of the motivation for the JPM Project, the goals of which, as has been noted, “are to (1) develop prototype methodologies for the measurement of job performance; and (2) if feasible, link enlistment standards to on-the-job performance” (Office of the Assistant Secretary of Defense—Force Management and Personnel, 1987:3). Realization of the first of these goals would make possible the use of measures of job performance in criterion-related validity studies, the results of which are a necessary, albeit not sufficient, condition for realizing the second goal of the project. OVERVIEW OF CRITERION-RELATED VALIDATION Criterion Constructs: Measurement and Justification Given an adequate criterion measure, the criterion-related test validation paradigm—though subject to a variety of technical complications that will be considered below—is conceptually straightforward. Basically, empirical evidence is needed to establish the nature and degree of relationship between test scores and scores on the criterion measure. But the opening phrase of this paragraph assumes away the “criterion problem,” which is the most fundamental problem in criterion-related validation research. Indeed, as Gottfredson (Vol. II:1) points out, it is one of the most important but most difficult problems of personnel research. Some 40 years ago, Thorndike (1949) defined the “ultimate criterion” as “the complete final goal of a particular type of selection or training ” (p. 121). Thorndike's ultimate criterion is an abstraction that is far removed from actual criterion measures. The notion of the ultimate criterion, however, provides a useful reminder that measures that can be realized in practice are only approximations of the conceptual criteria of interest. The value of a criterion-related study depends on the closeness of the criterion measure to this conceptual ultimate criterion. The conceptual criterion of interest for the JPM Project is actual on-the-job performance. The reasons for this choice are evident. Among the justifications that might be presented for the use of a test to select or classify applicants, none is apt to be more persuasive or intuitively appealing than the demonstration that test scores predict actual on-the-job performance. Like Thorndike's ultimate criterion, however, actual on-the-job performance

OCR for page 141
Performance Assessment for the Workplace is not something that can be simply counted or scored and then correlated with test scores. Rather, as described in Chapter 4, Chapter 6, and Chapter 7, measures of job performance must be developed and justified. They must be accepted as valid, reliable, and relevant to the goals of the Services before they can serve as the criteria by which the validity of aptitude tests will, in turn, be judged. It is for these reasons that previous chapters have devoted so much attention to the development, validation, and assessment of the reliability of job performance measures. There is no need to repeat the discussion of previous chapters regarding the evaluation of the quality of criterion measures. However, two threats to the validity of any criterion measure deserve special emphasis here and will guide the discussion of specific criterion measures in subsequent sections. Criterion contamination occurs when the criterion measure includes aspects of performance that are not part of the job or when the measure is affected by “construct-irrelevant ” (Messick, 1989) factors that are not part of the criterion construct. Criterion deficiency occurs when the criterion measure fails to include or underrepresents important aspects of the criterion construct. Criterion contamination and criterion deficiency are illustrated by training criteria, whose weaknesses were acknowledged by the Office of the Assistant Secretary of Defense—Force Management and Personnel (1987). Training grades, which are based largely on written tests of cognitive knowledge about the job, may be contaminated by a greater dependence on certain cognitive abilities, such as verbal ability, than is true of actual on-the-job performance. And training measures may be deficient if they leave out tasks that require manipulation of equipment that may be crucial to successful job performance. Concerns about possible criterion contamination and deficiency are not limited to measures of training performance. A hands-on job performance measure, for example, might lack validity because it represents only a small, or atypical, fraction of the important tasks that an individual is required to perform on the job (criterion deficiency). Ratings of the adequacy of performance of a hands-on task might be influenced by irrelevant personal characteristics, such as race, gender, or personal appearance (criterion contamination). Criterion contamination is most serious when construct-irrelevant factors that influence the criterion measure are correlated with the predictors. Similarly, criterion deficiency is most serious when the criterion measure fails to include elements of job performance that are related to the predictor constructs (Brogden and Taylor, 1950). Of particular concern are situations in which criterion deficiency or contamination “enhance[s] the apparent validity of one predictor while lowering the apparent validity of another” (Cronbach, 1971:488). An understanding of predictor constructs and criterion constructs is necessary to evaluate these possibilities.

OCR for page 141
Performance Assessment for the Workplace Predictor-Criterion Relationships The relationship between a predictor and a criterion measure may be evaluated in a variety of ways (see, e.g., Allred, Vol. II). Correlation coefficients are often used to express the relationship, but a more basic summary is provided by simple tables and graphs. In the most basic form, the data of a criterion-related validation study using a single test and a single criterion measure consist of pairs of test and criterion scores for each person or counts of the number of times each combination of test and criterion scores occurs. Consider, for example, a simple hypothetical situation with three levels of test scores (low, middle, and high) and four levels of criterion performance (unacceptable, adequate, above average, and superior). Pairs of test and job performance criterion scores are obtained for a sample of 400 individuals. The number of people with each possible combination of test and criterion scores is shown in Table 8-1. This simple table contains all the information about the relationship between the test scores and the criterion measure. With such a small number of possible scores on the test and the criterion, this basic two-way table can also be used to summarize the findings. For example, individuals with low test scores are most likely to have performance on the criterion that is unacceptable (47 of 100 compared with 16 of 200 with test scores in the middle range or 2 of 100 for those with high test scores). Similarly, the percentage of individuals with high test scores who had superior criterion performance scores (15 percent) is nearly twice that of the total group (33 of 400, or about 8 percent) and 7.5 times as great as that of individuals with low test scores (2 percent). Such simple frequencies provide the basis for constructing another useful summary of the data, known as an expectancy table. “Such a table reports the estimated probability that people with particular values on a test, or on a combination of predictors, will achieve a certain score or higher on the criterion” (Wigdor and Garner, 1982:53). The expectancy table corresponding to the Table 8-1 frequencies is shown as Table 8-2. As can be seen, almost all individuals with high test scores (98 percent) are predicted to TABLE 8-1 Frequency of Test and Criterion Score Combinations Criterion Performance Test Score   Low Middle High Total Superior 2 16 15 33 Above average 11 24 30 65 Adequate 40 144 53 237 Unacceptable 47 16 2 65 Total 100 200 100 400

OCR for page 141
Performance Assessment for the Workplace TABLE 8-2 Illustrative Expectancy Table: Estimated Probability of a Particular Level of Job Performance Given a Particular Test Score Criterion Performance Test Score   Low Middle High Superior 2 8 15 Above average or better 13 20 45 Adequate or better 53 92 98 have adequate or better criterion performance and nearly half of them (45 percent) are predicted to have performance that is above average or superior. And almost half (47 percent; 100 percent minus the 53 percent predicted to have adequate or better performance) of the individuals with low test scores would be expected to perform at an unacceptable level. In addition to demonstrating a relationship, an expectancy table makes it obvious that even when the relationship is relatively strong, as in the illustrative example, there will be errors of prediction. Although the vast majority of individuals with high test scores would be expected to have adequate criterion performance, 2 percent would still be expected to perform at an unacceptable level. Similarly, 2 percent of the individuals with low test scores would be expected to have superior performance on the criterion. If the criterion categories of unacceptable, adequate, above average, and superior used for the example in Table 8-1 were given score values of 1, 2, 3, and 4, respectively, a mean score on the criterion measure for individuals in each of the three test score categories could be easily computed (Table 8-3). The tendency shown in Table 8-1 and Table 8-2 for individuals with higher test scores also to have higher performance on the criterion than their counterparts with lower test scores is again apparent in Table 8-3. What is lost, however, is an indication of the degree of error in the predicted performance (e.g., the fact that 2 percent of the individuals with low test scores had superior criterion performance). A variety of tabular and graphical summaries similar in general nature to the above tables can be useful in summarizing relationships between test scores and criterion measures. Scatter diagrams and tables or figures show TABLE 8-3 Mean Criterion Scores for the Total Sample and for Groups with Low, Middle, and High Test Scores   Test Score   Low Middle High Total Sample Criterion mean 1.68 2.20 2.58 2.16

OCR for page 141
Performance Assessment for the Workplace ing the spread as well as the average criterion scores of individuals with specified levels of test scores are particularly useful. See Allred (Vol. II) for a detailed discussion of these and other related techniques. Although graphs and tables have considerable utility, more concise statistical summaries are more typical and can also be useful for certain purposes. The most common statistical summaries of criterion-related validity results are correlation coefficients and regression equations. A correlation coefficient summarizes in a single number ranging from −1.0 to 1.0 the degree of relationship between test scores and a criterion measure (or between other pairs of variables). A correlation of .0 indicates that there is no linear relationship between the two sets of scores, while a correlation of 1.0 (or −1.0) indicates that there is a perfect positive (or negative) relationship. For simplicity, only linear relationships are considered here. Linear relationships are commonly assumed and widely used in criterion-related validity studies. It is important, however, to keep in mind the possibility that relationships are nonlinear; a variety of techniques is available to investigate the possibility of nonlinearity (see, e.g., Allred, Vol. II). The correlation between test scores and scores on the criterion for the data in Table 8-1 is .40. In practice, an observed correlation (validity coefficient) of this magnitude between test scores and scores on a criterion would not be unusual. A linear regression equation expresses the relationship between the test and the criterion scores in terms of a predicted level of criterion score for each value of the test score (low = 1, middle = 2, and high = 3). For the example in Table 8-1, the regression equation is as follows: predicted criterion score = 1.265 + .45 × the test score. Thus, the predicted criterion scores are 1.72, 2.16, and 2.62 for individuals with test scores of 1, 2, and 3, respectively. These predicted values may be compared with the mean criterion scores of 1.68, 2.20, and 2.58 for the three respective score levels (see Table 8-3). The small differences between the two sets of values are due to the use of a linear approximation in the regression equation. This general overview has ignored a number of complications that must be considered in criterion-related validity studies. For example, the effects of the reliability of the criterion measure, the effects of basing coefficients only on samples of job incumbents who have already been selected on the basis of test scores and successful completion of training, and the possibility that validities and predictive equations may differ as a function of subgroup (e.g., men and women or blacks, whites, and Hispanics)—are all important considerations in a criterion-related validity study. Issues of how multiple criterion measures should be combined, the degree of generalization of validities across jobs, and the degree to which different combinations of predictors yield different validities across jobs are also critical considerations. Some of these complications are considered in this chapter.

OCR for page 141
Performance Assessment for the Workplace Because they are better dealt with in the specific context of the JPM Project, we now turn to a discussion of some of the specifics of the project that are most relevant to an evaluation of the criterion-related validity evidence. THE NATURE AND INTERRELATIONSHIPS OF CRITERION MEASURES As discussed in previous chapters, hands-on performance measures are viewed as providing the “benchmark data to evaluate certain surrogate (less expensive, easier to administer tests and/or existing performance information) indices of performance as substitutes for the more expensive, labor intensive hands-on job performance” (Office of the Assistant Secretary of Defense—Force Management and Personnel, 1987:3). Consequently, the quality of hands-on measures takes on special importance within the context of the JPM Project. It is reasonable to consider hands-on performance measures as benchmarks only to the degree that they are valid and reliable measures of job performance constructs. The threats to validity of criterion contamination and criterion deficiency apply as much to hands-on measures as to alternative criterion measures, such as job knowledge tests, ratings, or administrative records. As is also true of other types of measures, the quality of hands-on measures also depends on the reliability of the measures, that is, the degree to which the scores that are obtained can be generalized across test administrators, tasks, and administration occasions. Further, as Gottfredson (Vol. II) has noted, “job performance can be measured in many ways, and it is difficult to know which are the most appropriate, because there is generally no empirical standard or ‘ultimate' criterion against which to validate criterion measures.” Thus, it is important to consider the strengths and weaknesses of each of the criterion measures investigated in the JPM Project as well as their relationship. Hands-On Measures The development of hands-on measures and the evaluation of their reliability and content representativeness were discussed in previous chapters. Here our focus is the construct validity (see Chapter 4) of the measures. The scoring weights given to steps and tasks, the correlations of part-scores with total scores, and the correlations of hands-on measures with other criterion and predictor measures —all contribute to the evaluation of the construct validity of a hands-on measure. Consider, for example, the hands-on measures developed for the occupational specialty (MOS) of Marine Corps infantry rifleman. The total hands-on test score (TOTAL) for an infantry rifleman consisted of a weighted sum of the score from the hands-on basic

OCR for page 141
Performance Assessment for the Workplace infantry core (CORE) and scores obtained from MOS-unique (UNIQUE1) and supplementary (UNIQUE2) tasks (Mayberry, 1988). The CORE was a weighted sum of scores obtained from tasks in 12 basic infantry duty areas. The UNIQUE1 task involved live fire with a rifle; the UNIQUE2 task consisted of more advanced versions of two of the tasks in CORE (squad automatic weapon and tactical measures). The task weights, test-retest reliabilities, correlations of task scores with the hands-on TOTAL score, and correlations of task scores with GT, the General Technical aptitude area composite from the ASVAB used for classification into infantry occupational specialties, are shown in Table 8-4. The pattern of correlations of task scores with the TOTAL is consistent with what would be expected from knowledge of the scoring weights and the test-retest reliabilities. The land navigation task score, for example, would be expected to have a relatively high correlation with the TOTAL TABLE 8-4 Task Scoring Weights, Test-Retest Reliabilities, and Correlations with Hands-On Total Score (TOTAL) and General Technical (GT) Aptitude Area Composite (Marine Infantry Rifleman) Basic Infantry Core (CORE) Duty Area/Task Scoring Weight Reliability Correlations       TOTAL GT Land navigation 3 .73 .64 .51 Tactical measures 1 3 .61 .58 .38 Squad automatic weapon 1 2.5 .20 .44 .18 Communications 2.5 .47 .48 .30 NBC defense 2 .39 .53 .29 First aid 2 .27 .48 .29 Security/intelligence 1.5 .22 .42 .22 Grenade launcher 1.5 .48 .40 .21 Mines 63 1.5 .25 .32 .20 Night vision device 1 .22 .30 .13 Light antitank weapon 1 .48 .40 .26 Hand grenades 1 .25 .13 .02 MOS unique and supplementary tasks Rifle, live fire * .45 .66 .13 Tactical measures 2 * .45 .49 .32 Squad automatic weapon 2 * .17 .33 .11 * The hands-on total score is defined by: TOTAL = .60 (CORE) + .25 (UNIQUE1) + .15 (UNIQUE2) where CORE is the weighed sum of the basic infantry core duty area/task scores, UNIQUE1 is the score from the rifle live fire task, and UNIQUE2 is the score from the MOS supplementary tasks (tactical measures 2 and squad automatic weapon 2). SOURCE: Based on Mayberry (1988).

OCR for page 141
Performance Assessment for the Workplace hands-on score because it is one of the two basic infantry duty area tasks with the highest-scoring weights and it has the highest test-retest reliability. The observed correlation of .64 between land navigation and TOTAL is consistent with this expectation. The fact that rifle, live fire, has the highest correlation (.66) with TOTAL of any of the tasks despite its marginal test-retest reliability is due in part to the fact that it has the largest weight for any single task used to define TOTAL. As noted in Table 8-4, UNIQUE1, which is the single rifle live fire task, has a weight of .25 in the computation of TOTAL. The CORE score, which is given a weight of .60 in computing TOTAL, is a composite based on 12 tasks; the UNIQUE2 score, with a weight of .15, is a composite based on 2 tasks. From an inspection of the correlations of the task scores with TOTAL and a review of the actual measures for each duty area, it is clear that TOTAL measures a relatively complex job performance construct. It involves a combination of tasks that depend heavily on cognitive knowledge of duty area responsibilities (e.g., land navigation, tactical measures 1 and 2, nuclear, biological, chemical (NBC) defense, and communications). TOTAL is also strongly related to tasks requiring complex psychomotor skills, most notably live fire with a rifle. This apparent complexity and the differential dependency of subtasks on cognitive ability are supported by an inspection of the correlations of the task scores with GT, the General Technical ASVAB composite score. As would be expected, tasks judged to have a greater cognitive component have relatively high correlations with GT, and the duty areas that generally involve manipulation of weapons (rifle, live fire, squad automatic weapon 1 and 2, and hand grenades) have correlations of less than .20 with GT. It is evident that the predictive validity that can be obtained using scores based on the ASVAB for the Marine infantry rifleman MOS depends not only on the way in which hands-on task measures are obtained but the way in which an overall composite hands-on score is defined. Increasing the relative weight that is given to such tasks as land navigation and tactical measures could be expected to increase the criterion-related validities of ASVAB composites. Conversely, increasing the relative weight that is given to such tasks as rifle, live fire, and squad automatic weapon would be expected to decrease ASVAB validities. We do not mean to suggest that the Marines should use weights other than the ones reported in Table 8-4. Those weights are based on judgments of subject matter experts regarding the importance of each task to the job of a marine rifleman. The point is, however, that, before a hands-on measure is accepted as a benchmark or even as the most important criterion measure to consider, it is critical that the construct validity of the hands-on measure be evaluated and the relevance of the construct as measured to the mission of the Service be judged.

OCR for page 141
Performance Assessment for the Workplace Job Knowledge Tests Written job knowledge tests are sometimes used as performance criteria. Compared with hands-on performance tests, they are relatively inexpensive to construct, administer, and score. The rationale for using written tests as a criterion measure is generally based on a showing of content validity (using job analyses to justify the test specifications) and on arguments that job knowledge is a necessary, albeit not sufficient, condition for adequate performance on the job. Some have suggested more elaborate justifications for paper-and-pencil tests of job knowledge. Hunter (1983, 1986), for example, has argued that very high correlations between job knowledge and job performance measures are to be expected (on the assumption that knowing how and being able to do something are much the same) and has reported estimated correlations based on corrections for reliability and range restriction as high as .80 between job knowledge and work-sample measures of job performance. Such estimates are obtained only after substantial adjustments for reliability and range restriction, however, and those adjustments depend on strong assumptions. Moreover, as noted by Wigdor and Green (1986:98), written tests “require a much greater inferential leap from test performance to job performance” than do job-sample tests. Paper-and-pencil job knowledge tests are widely criticized as criterion measures on the grounds of contamination and deficiency. The written format itself introduces a factor of vocabulary-grammar-verbal facility into the performance test that may not be a part of the job—or, even if relevant to the job, not the object of measurement. In multiple-choice tests, a small set of alternatives is identified for the examinee, a situation that is unlikely to be reproduced in the actual work setting. The major deficiency, of course, is that such tests do not deal directly with the ability to perform a task. One problem involved in using a written test as a criterion measure is particularly pertinent to the JPM Project. The fact that the ASVAB is also a paper-and-pencil test means that all aspects that are common to such tests will lead to high correlations between the predictor and the criterion. This represents a special evaluation problem; because the predictor and the criterion are similarly contaminated, the degree of correlation may be spurious. Correlations of job knowledge tests with other criterion measures, particularly with hands-on job performance measures, take on particular importance due to the concern about criterion contamination that may be correlated with the predictor test scores. Results reported for 15 specialties/ratings (9 Army, 4 Marine Corps, and 2 Navy) for which hands-on performance measures and written job knowledge tests were administered are shown in Table 8-5. As can be seen, the correlations are consistently positive, ranging from .35 to .61. These correlations demonstrate that job knowledge tests are significantly related to hands-on performance measures. The degree of

OCR for page 141
Performance Assessment for the Workplace relationship would appear to be even stronger if adjustments were made for the less-than-perfect reliabilities of both measures. For example, the estimated reliability (relative G coefficient) for the hands-on measure for machinist's mates in the engine room is .72 (Laabs, 1988; see also the results in Table 6-3 for 2 examiners and 11 tasks). Adjusting the .43 correlation in Table 8-5 for the .72 reliability of the hands-on measure and an assumed job knowledge test reliability of .85 would yield a corrected correlation of .55. Increases of a similar order of magnitude might reasonably be expected for the other correlations in Table 8-5. Even with adjustments for reliability, the correlations between job knowledge and hands-on job performance tests would remain substantially less than 1.0. Thus, as would be anticipated, the two criterion measures do not measure exactly the same constructs. In other words, using a strict standard of equivalence, job knowledge tests are not interchangeable with hands-on performance tests. Compared with other variables, however, the link be TABLE 8-5 Correlations of Paper-and-Pencil Job Knowledge Test Scores With Hands-On Job Performance Total Score Service Specialty (MOS/Rating) Correlation Army Infantryman .44   Cannon crewman .41   Tank crewman .47   Radio teletype operator .56   Light wheel vehicle/power generator mechanic .35   Motor transport operator .43   Administrative specialist .57   Medical specialist .46   Military police .37 Marine Corps Infantry assaultman .49   Infantry machinegunner .61   Infantry mortarman .55   Infantry rifleman .52 Navy Machinist's mate (engine room) .43   Machinist's mate (generator room) .39   Radioman .54 SOURCES: Army results are based on Hanser's report to the Committee on the Performance of Military Personnelat the September 1988 workshop in Monterey, Calif. The Marine Corpsresults are based on tables provided for the September 1988 workshop.The Navy results are based on Laabs's report at that workshop andOffice of the Assistant Secretary of Defense (1987:45-46).

OCR for page 141
Performance Assessment for the Workplace TABLE 8-12 Number of Studies by Focal Group Sample Size for Current ASVAB Differential Prediction Data Sample Size in Focal Group Knowledge Hands-on Performance Job Race comparisons Fewer than 25 7 2 25 to 75 8 5 75 or more 6 6 Gender comparisons Fewer than 25 6 0 25 to 75 7 5 75 or more 3 2 SOURCE: Data submitted to the Committee on the Performance of MilitaryPersonnel. than in the comparison groups. This degree of instability in the results for women and minorities should be kept in mind in evaluating the group-to-group differences presented in this section. It should also be remembered that validity correlations will differ depending on the heterogeneity of the group scores. When one group has substantially less variance on either variable in the correlation, the size of the correlation will be smaller, even if, from a regression perspective, the test is equally predictive for both groups. Also, when one group has a considerably lower mean value on the predictor, and there is a substantial degree of selectivity for the job, the selection will impinge more strongly on the group with the lower mean; corrections for range restriction, which are typically made on the total population, might be inappropriate unless the same regression system is applicable to both groups. Comparisons between minorities and nonminorities and between women and men are discussed separately below. Minority Group Comparisons Validity Coefficients Although comparisons of correlations can be quite misleading when concern centers on entire prediction systems, they are nevertheless useful when concern centers on groups for which it may be hypothesized that the predictor has no utility in selection. Table 8-13 summarizes correlations between criterion measures and selection composites used by the Services. (One occupational specialty with only three black

OCR for page 141
Performance Assessment for the Workplace TABLE 8-13 Weighted Average Correlations Between ASVAB Predictors and Job Performance Criteria for Black and Nonminority Samples Predictor Criterion Number of Studies Blacks Nonminorities     N Avg. R N Avg. R Selection composite Hands-on 21 1,487 .22 5,557 .29 Job knowledge 12 1,304 .26 4,485 .43 AFQT Hands-on 20 1,118 .14 4,303 .17 Job knowledge 11 935 .20 3,231 .38 NOTE: Average correlations using within-group sample sizes as weights. SOURCE: Data submitted to the Committee on the Performance of MilitaryPersonnel. recruits in the validation sample was excluded from this analysis). When viewed in this global manner, these results reveal the already-noted fact that validities tend to be higher when a written job knowledge test is used as the criterion as opposed to a hands-on measure of performance (due at least in part to method effects). However, the average correlations for blacks and nonminorities are much more similar in magnitude when the hands-on measure serves as the criterion (.22 and .29 as compared with correlations of .26 and .43 on the written test). For both criteria, the average predictive validity of the selection test is lower for minorities, but only markedly so when another written test serves as the criterion. Although there is substantial variability in the size of the difference from one job specialty to another, scatterplots of correlations that had been transformed to Fisher's Z's showed that, in the majority of cases, when a predictor lacked utility for minorities, its utility for nonminorities was also questionable. Prediction Equations The comparisons of prediction equations consider only cases for which a measure of hands-on performance served as the criterion. These were the only data in which all four Services were represented. Customary procedures were followed in this evaluation of group-to-group differences in prediction equations. The sections that follow discuss standard errors of estimates and regression slopes and intercepts, in that order, followed by direct comparisons of predicted scores using the within-group and combined-group equations. Throughout this section, the results of the differential prediction analyses that were provided by the four Services for this study are presented without regard to the levels of statistical significance observed in individual analyses. Rather than base decisions about the extent of prediction differences

OCR for page 141
Performance Assessment for the Workplace among groups on individual studies, this section attempts to describe general trends that can be seen from a description of observed results from all pertinent studies. To discourage potentially misleading overinterpretations of these trends, results from studies with particularly small samples of recruits in the focal group of a comparison are identified. Standard Errors of Prediction Studies of group-to-group differences in prediction systems have generally found regression equations to yield more accurate predictions of performance for nonminorities and women than for minorities and men (Linn, 1982). The use of hands-on performance criteria in the JPM Project, as opposed to paper-and-pencil tests, does not appear to have altered that finding to any great extent. The stem-and-leaf plot in Table 8-14 shows the distribution of ratios of mean-squared errors of prediction for blacks and nonminorities. Ratios based on samples of fewer than 25 blacks are shown in regular type, others in boldface type. Ratios above 1.0 are cases in which prediction for blacks is less accurate than those for nonminorities. Thus, in 13 of 20 occupational specialties the errors of prediction for blacks were larger than those for nonminorities, given the selection composite as predictor. The same picture emerges when the AFQT TABLE 8-14 Stem-and-Leaf Plot of Ratios of Mean-Squared Errors of Prediction for Blacks and Nonminorities   Leaf Stem Leaf       1.8   Predictions for Blacks   5 1.7 4 Less Accurate   1.6     1.5     6 1.4     0 1.3 29   2 1.2 125   8653 1.1 27 AFQT 50 1.0 046899 Selection Composite   8530 .9 47   631 .8 23   7 .7 9   9 .6 7   .5     3 .4     .3   Predictions for Blacks   .2 9 More Accurate NOTE: Leaves in boldface type are based on samples of 25 or more black recruits. SOURCE: Based on data submitted to the Committee on the Performanceof Military Personnel.

OCR for page 141
Performance Assessment for the Workplace is the predictor. However, the largest differences were for jobs with small sample sizes in the black cohort. Thus, the stable values in the table suggest relatively minor group differences in the accuracy of prediction. Slopes and Intercepts Hypothesis tests of group-to-group differences in slopes were conducted for all occupational specialties with appropriate data. The t-ratios on which such tests were based are shown in the stem-and-leaf plot in Table 8-15. In this plot, t-ratios based on the selection composite form leaves to the right. Those based on the AFQT form leaves to the left. Positive t-ratios indicate that the slope for blacks was steeper than the slope for nonminorities. (There seemed to be no pattern that would indicate that the regression line for blacks was systematically flatter than that for nonminorities, as might be expected from other comparisons of this type; see Linn, 1982; Hartigan and Wigdor, 1989). As the values in the table indicate, the t-ratios for slope differences between blacks and nonminorities tend to be larger in absolute value when the AFQT is the predictor variable in the equation. When the selection composite is the predictor, no slope differences are significant at the 10 percent level. Many of the t-ratios in the plot are instances in which the significance test had very low power due to the small sample of black recruits (Drasgow and Kang, 1984). If one were to argue that the alpha level for these tests should be raised to guard against Type II errors (here, concluding the slopes are the same when, in fact, they are not), only two slope differences involving the selection composite would be significant at the 25 TABLE 8-15 Stem-and-Leaf Plot of T-Ratios for Differences Between Slopes for Blacks and Nonminorities   Leaf Stem Leaf     0 2.       7 1. 8     10 1. 023   9865 0. 5699 AFQT 320 0. 13 Selection Composite   10 −0. 022233     −0. 6   100 −1. 001   86 −1. 5     −2.     85 −2.   NOTE: Leaves in boldface type are based on samples of 25 or more black recruits. SOURCE: Based on data submitted to the Committee on the Performance of Military Personnel.

OCR for page 141
Performance Assessment for the Workplace percent level. However, 5 of 20 slopes differences would be judged significant at the 10 percent level when the AFQT is the predictor. The contrasting results of tests for slope differences provide a useful illustration of the possible effects of selection on the outcome of a differential prediction study. Because the AFQT is not an explicit selection variable in this context (see Dunbar and Linn, Vol. II), it is possible that the more frequent slope differences are at least a partial artifact of differing degrees of range restriction in the comparison groups. The fact that the two largest slope differences for the selection composite were from jobs with nonsignificant slope differences for the AFQT is consistent with this hypothesis of a selection effect. Given the above results regarding slopes, within-group intercepts in jobs in which slopes were not significantly different at the 25 percent level were recalculated based on a pooled estimate of the slope. For each job, the intercept for blacks was subtracted from that for nonminorities, and the difference was divided by the standard deviation of the criterion in the black sample. These standardized differences between intercepts are given in the stem-and-leaf plot shown as Table 8-16. Although in most cases small in magnitude, the intercept differences do show a trend found in previous differential prediction studies toward positive values (Linn, 1982; Houston and Novick, 1987), which implies that the use of prediction equations based on the pooled sample in these jobs would result in overprediction of black performance more often than not. However, the amount of overprediction TABLE 8-16 Stem-and-Leaf Plot of Ratios of Standardized Differences Between Intercepts for Blacks and Nonminorities   Leaf Stem Leaf     2 1.         1. 1       0.       6 0.     AFQT 4 0. 45 Selection Composite   333322 0. 2223     11110 0. 000011     1 −0. 011111     2 −0.         −0.     NOTE: Leaves in boldface type are based on samples of 25 or more black recruits.   SOURCE: Based on data submitted to the Committee on the Performance of Military Personnel.

OCR for page 141
Performance Assessment for the Workplace that would typically be found appears to be smaller than in previous studies, which have used less performance-oriented criterion variables. Criterion Predictions Perhaps a more complete picture of the magnitude of group-to-group differences in regression lines can be obtained by examining the actual criterion predictions made by group-specific and pooled-group equations. Such differences were calculated for each occupational specialty at three points on the predictor scale: one standard deviation below the black mean (−1 SD), the mean in the black sample, and one standard deviation above the black mean (+1 SD). Means and standard deviations of these differences (expressed in black standard deviation units) were calculated for groups of studies based on the size of the black sample. The results are shown in Table 8-17. Three features of the results in the table should be noted. First, all average differences in the table are positive, which reflects once again the fact that the combined-group equations, on average, lead to overprediction of black performance (i.e., predictions that are somewhat higher on average than the performance achieved by blacks with particular scores on predictor variables). Second, the largest differences in the table tend to be those that are the least stable with respect to sampling (from jobs with data on fewer than 25 black recruits). And third, the differences tend to be larger when the AFQT is the predictor. This last result, again, may well be due to the fact that the AFQT is not an explicit selection variable for most jobs; hence, group differences in this case are confounded by differing degrees of range restriction in the comparison groups. TABLE 8-17 Standardized Differences Between Predicted Scores from Black and Pooled Equations Black Sample Size   Selection Composite AFQT     −1SD Mean +1SD −1 SD Mean +1 SD Fewer than 25 (Avg.) .12 .17 .22 .20 .27 .34   (SD) .37 .44 .56 .38 .47 .67 Between 25 and 75 (Avg.) .08 .07 .05 .12 .08 .04   (SD) .14 .11 .20 .17 .18 .33 More than 75 (Avg.) .07 .10 .13 .07 .15 .23   (SD) .10 .09 .11 .10 .09 .20 SOURCE: Based on data submitted to the Committee on the Performance of Military Personnel.

OCR for page 141
Performance Assessment for the Workplace The Performance-Based Focus on Fairness As we have seen in the previous analyses, there are small differences in the accuracy of prediction of job performance from test scores for blacks and nonminorities, but for practical purposes the same regression lines predicted performance about as well for both groups. Remembering always the thinness of the data for the black job incumbents, we observe that the ASVAB appears to fulfill that particular definition of fairness. However, this analysis does not address imperfect prediction per se as it affects the lower-scoring group. A comparison of the relative size of group differences on the predictor and criterion measures is illuminating. Table 8-18 summarizes the mean differences between the scores of black and nonminority enlisted personnel on the AFQT and two types of criterion measure, a written job knowledge test and the hands-on performance test. In these figures, a negative difference indicates that the average black job incumbent scored below the average nonminority job incumbent by the given number of standard deviation units. The average group differences in scores across the 21 studies was −.85 of a standard deviation on the AFQT, −.78 of a standard deviation on the job knowledge test, and −.36 of a standard deviation on the hands-on test. In addition to these mean differences, the stem-and-leaf plot in Table 8-19, which compares group differences on hands-on and job knowledge criterion measures, shows that the entire distribution of differences on the hands-on criterion is shifted in a direction that indicates greater similarity between blacks and nonminorities on the more concrete and direct measure of job performance. To the extent that we have confidence in the hands-on criterion as a good measure of performance on the job, these findings strongly suggest that scores on the ASVAB exaggerate the size of the differences that will ultimately be found in the job performance of the two groups. These results are in line with results for the private sector, as recently reviewed in Fairness in Employment Testing (Hartigan and Wigdor, 1989), which found an average score difference of 1 standard deviation between blacks and whites on the General Aptitude Test Battery compared with a one-third standard deviation on the criterion measure (typically, supervisor ratings). One might interpret the JPM Project results as meaning that the initial edge given nonminorities as a group by somewhat higher general aptitudes (as measured by verbal and mathematical reasoning tests) is greatly reduced by experience on the job. Note, however, that the written job knowledge test used as the second criterion measure in this analysis exhibits only a small diminution in the size of the average group differences. Since the AFQT and the job knowledge test are both paper-and-pencil multiple-choice tests, a strong method effect seems to be involved. This issue goes well beyond the JPM Project; it calls for the attention of the measurement profession as a whole.

OCR for page 141
Performance Assessment for the Workplace TABLE 8-18 Stem-and-Leaf Plot of Standardized Differences Between: (a) Mean AFQT and Job Knowledge Criterion Performance of Black and Nonminority Enlisted Personnel and (b) Mean AFQT and Hands-On Criterion Performance of Black and Nonminority Enlisted Personnel   Leaf Stem Leaf   a.   +.0         −.0         −.1   AFQT     −.2 11       −.3 2 Mean = −.85   87 −.4 27 SD = .41   7 −.5       951 −.6 568     7 −.7 9   Job Knowledge 3 −.8 579   Criterion 0 −.9       3 −1.0 04678   Mean = −.78 75 −1.1     SD = .24   −1.2 38       .         .         .         −1.8 7   b. 55 +.0       3 −.0       85 −.1   AFQT   9410 −.2 11     974222 −.3 2 Mean = −.85   62 −.4 27 SD = −.41   72 −.5         −.6 568       −.7 9   Hands–On   −.8 579   Criterion   −.9       0 −1.0 04678   Mean = −.36   −1.1 3   SD = .31 4 −1.2 38       .         .         .         −1.8 7   NOTE: Leaves in boldface type are based on samples of 25 or more black recruits.   SOURCE: Based on data submitted to the Committee on the Performance of Military Personnel.

OCR for page 141
Performance Assessment for the Workplace TABLE 8-19 Stem-and-Leaf Plot of Standardized Differences Between Mean Criterion Performance of Black and Nonminority Enlisted Personnel   Leaf Stem Leaf       +.0 55       −.0 3       −.1 58 Hands-On Criterion     −.2 0149       −.3 222479 Mean = −.36   87 −.4 26 SD = .31   7 −.5 27     951 −.6       7 −.7     Job Knowledge 3 −.8     Criterion 0 −.9       3 −1.0 0   Mean = −.78 75 −1.1     SD = .24   −1.2 4   NOTE: Leaves in boldface type are based on samples of 25 or more black recruits. SOURCE: Based on data submitted to the Committee on the Performanceof Military Personnel. Gender Comparisons Data on the differences between prediction systems for men and women in military occupational specialties are limited not only by the small number of female recruits in individual jobs, but also by the number of jobs for which the relevant data are available. Accordingly, only general discussion of apparent trends in these data is possible. Validity Coefficients Weighted average correlations between ASVAB predictors and performance criteria for men and women are shown in Table 8-20. As was true in the comparisons based on race, the average correlation is higher when the criterion is a written test than when the criterion is a direct measure of performance. However, the dominant feature of these data is the markedly higher average correlation between the AFQT and hands-on performance among women. In addition, and unlike the black/nonminority comparison, there seems to be no clear relationship between the size of the group difference between validity coefficients and the type of criterion variable. Regression Equations Standard errors of estimate, slopes, and intercepts from the regressions of hands-on performance on the ASVAB selection composite were examined in the 16 occupational specialties for which data

OCR for page 141
Performance Assessment for the Workplace TABLE 8-20 Weighted Average Correlations Between ASVAB Predictors and Performance Criteria for Samples of Men and Women Predictor Criterion Number of Studies Women Men     N Avg. R N Avg. R Selection composite Hands-on 3 814 .24 3,454 .26 Job knowledge 7 606 .41 2,318 .43 AFQT Hands-on 13 814 .27 3,454 .19 Job knowledge 7 606 .40 2,318 .37 NOTE: Average correlations using within-group sample sizes as weights. SOURCE: Data submitted to the Committee on the Performance of MilitaryPersonnel. on male and female recruits were available. Although the results from individual jobs were quite mixed, when viewed as a whole the data do suggest some inconsistencies with previous findings from gender comparisons of this type. In general, the accuracy of prediction, as measured by the standard error of estimate, was similar for men and women. In 10 of the 16 jobs, predictions were more accurate among women than among men. This is generally consistent with findings reported by Linn (1982) in educational contexts and by Dunbar and Novick (1988) in a military training context. In instances in which the standard errors of estimate were markedly different in magnitude, sampling error appeared to be a likely cause of the difference. Comparisons of slopes and intercepts showed mixed results. Whereas the typical finding in studies of differential prediction by gender is that slopes among samples of women are steeper than among samples of men, half of the jobs in the current data showed the opposite trend. One might suspect that the use of a performance criterion rather than a written test of job knowledge might account for this discrepancy with previous studies. However, even among the seven jobs that provided criterion data from a test of job knowledge, there were four in which the slope in the samples of women was flatter than it was in the samples of men. A second and perhaps more notable contrast with the findings of previous studies is the difference between intercepts in instances in which slopes were judged to be similar. In 8 of 12 such instances, the intercept of the regression equation in the female sample was smaller than that of the male sample. Such findings would imply that the use of a combined prediction equation for men and women in these instances would result in the overprediction of female performance on the hands-on criterion, that is, female recruits

OCR for page 141
Performance Assessment for the Workplace would be predicted to perform better than they actually would. This finding is counter to the results summarized by Linn (1982), in which the dominant trend was for the criterion performance of women to be underpredicted by the use of a combined equation. Whether this contrast can be attributed to the use of the hands-on performance criterion is only suggested by results from the seven specialties that also used a job knowledge test as the criterion. Among those seven, five showed intercept differences that would suggest underprediction of female performance on the test of job knowledge, a result more in line with the literature on gender differences in prediction systems. Given the inconsistencies reported here, further investigations of prediction differences between men and women in the military should continue to examine the role of the criterion in the interpretation of such differences. CONCLUSION The JPM Project has succeeded in demonstrating that it is possible to develop hands-on measures of job performance for a wide range of military jobs. The project has also shown that the ASVAB can be used to predict performance on these hands-on measures with a useful degree of validity. With the addition of measures of spatial ability and perceptual-psychomotor abilities to the ASVAB, some modest increments in the prediction of hands-on performance could be obtained. Noncognitive predictors, however, are apt to be useful for purposes of predicting criteria such as discipline and leadership but are unlikely to improve the prediction of actual hands-on job performance. Although the 23 jobs studied to date in the JPM Project are relatively diverse, many more jobs were not included in the research than were. The validation task that remains to be accomplished is to develop a means of generalizing the results for these 23 jobs to the full range of jobs in the Services. Each Service is currently conducting research that seeks to accomplish this task (Arabian et al., 1988). Another task that remains to be accomplished, building on these results, is to link enlistment standards to job performance. The Department of Defense is currently conducting research that seeks “to develop a method or methods for linking recruit quality requirements and job performance data” (Office of the Assistant Secretary of Defense —Force Management and Personnel, 1989:8-1). The hands-on job performance data are a critical aspect of that effort.