Part II:
Job Performance Measurement Issues

The JPM Project was an ambitious effort to measure on-the-job performance of enlisted military personnel. The project offered researchers the opportunity to test hypotheses about differences between hands-on measures of job performance and paper-and-pencil surrogates of performance, about ethnic group differences in performance, and about gender difference in performance. Among the many concerns of the JPM Project and the committee, three were of primary interest. The first concern centered on the adequacy of test administration activities, such as scheduling, test security, and administration consistency from one individual to the next. A second concern was that the scaling of hands-on performance scores should go beyond rank ordering. That is, there was a need for a score scale that could be interpreted in terms of, at a minimum, acceptable and unacceptable performance, and preferably at finer gradations. A third concern centered on how job tasks should be selected. The committee recommended that stratified, random sampling of tasks be used rather than purposive sampling. They argued that purposive selection might capture only a certain type of task amenable to testing and might not be representative of the job, whereas stratified random sampling provided an unbiased selection of representative tasks and could more easily be defended.

Part II of this volume contains two papers dealing with various aspect of job performance measurement. In the first paper, Lauress Wise addresses the three concerns listed above in a thorough analysis of issues surrounding the validity of the JPM data and data from other sources such



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Part II: Job Performance Measurement Issues The JPM Project was an ambitious effort to measure on-the-job performance of enlisted military personnel. The project offered researchers the opportunity to test hypotheses about differences between hands-on measures of job performance and paper-and-pencil surrogates of performance, about ethnic group differences in performance, and about gender difference in performance. Among the many concerns of the JPM Project and the committee, three were of primary interest. The first concern centered on the adequacy of test administration activities, such as scheduling, test security, and administration consistency from one individual to the next. A second concern was that the scaling of hands-on performance scores should go beyond rank ordering. That is, there was a need for a score scale that could be interpreted in terms of, at a minimum, acceptable and unacceptable performance, and preferably at finer gradations. A third concern centered on how job tasks should be selected. The committee recommended that stratified, random sampling of tasks be used rather than purposive sampling. They argued that purposive selection might capture only a certain type of task amenable to testing and might not be representative of the job, whereas stratified random sampling provided an unbiased selection of representative tasks and could more easily be defended. Part II of this volume contains two papers dealing with various aspect of job performance measurement. In the first paper, Lauress Wise addresses the three concerns listed above in a thorough analysis of issues surrounding the validity of the JPM data and data from other sources such

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop as the Synthetic Validity (SYNVAL) project. He also examines the appropriateness of using these data for setting performance goals in the cost/performance trade-off model. In the second paper in this section, Rodney McCloy pursues further the critical issue of generalizing performance results from jobs on which performance has been measured to jobs for which no data are available. This issue has been of particular importance because only a few jobs were selected for detailed study in the JPM Project and there was a need to generalize the findings to the several hundred jobs performed by first-term enlisted personnel. For the current model, the multilevel regression analysis method was recommended because of its contributions to performance prediction at the job level. The SYNVAL approach was considered but was deemed too time-consuming for the present project. McCloy discusses the application of the multilevel regression analysis in detail.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Setting Performance Goals for the DOD Linkage Model Lauress L. Wise The times certainly are changing, particularly for the Department of Defense (DoD). The types of threats to which we must be ready to respond are changing; the size of the forces available to respond to these threats has decreased significantly and is likely to decrease further yet; and the resources available for recruiting, training, and equipping our forces have also declined dramatically. The debate continues as to how much we can afford to spend on defense in the post-cold war era and how much we can afford to cut. Efforts to keep missions, forces, and resources in some kind of balance are now focused on an emerging concept of readiness. The DoD cost/performance trade-off model can play a central role in balancing readiness and resources. Those who sponsored the development of this model could not possibly have anticipated the importance of their efforts, but now that the model is nearing completion, the need for this type of linkage is all too obvious. The model actually contains two separate linkages. Recruiting resources are linked to the levels of recruit quality, defined in terms of aptitude scores and educational attainment, obtained through application of the recruiting resources. In this first linkage, the model also suggests optimal mixtures of expenditures for recruiters, advertising, and incentives that will yield a given recruit quality mix with the smallest possible total cost. The second linkage is between recruit quality and performance in specific occupational specialties. The full cost/performance trade-off model takes specifications for required performance levels for each different job or family of jobs, deter-

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop mines a recruit quality mix that will yield the desired performance levels, and predicts the recruiting costs required to obtain this mix of recruit quality. One more linkage is needed. The final step is to tie the emerging concept of readiness to levels of performance in different military specialties. This is, of course, a rather large step. Readiness is not yet a well-defined notion, but it is doubtlessly related to the number and effectiveness of different types of units, with unit effectiveness further related to individual performance levels. The focus of this paper is not, however, on how this last linkage might be achieved, but rather on how we might best set goals for performance levels today, while we are waiting for this final linkage to be created. The DoD model has allowed us to replace the question of ''What level of recruit quality do we need?" with the question "What level of performance do we need?" The goal of this paper is to discuss issues and methods of trying to answer the latter question with information that is currently available to DoD personnel planners and policy makers. The remainder of this paper is organized into four sections. The first section discusses issues related to the performance metric used in the DoD model. What is the meaning of the performance scale and what is a reasonable answer to the question "What level of performance do we need? The second section describes a normative approach to setting performance level goals. The general idea is to look at predicted performance levels for new recruits at different times and see how these levels varied across time and by job. At the very least, this normative approach will provide plausible ranges for performance level goals. The third section describes criterion-referenced approaches to setting performance level goals. In such an approach, judgments about the acceptability of different levels of performance are analyzed, and then additional judgments about minimum rates of acceptable performance are also collected. The final section lays out suggestions for additional research to further strengthen the support for specific performance level goals. THE PERFORMANCE METRIC: DESCRIPTION OF THE SCALE FOR HANDS-ON PERFORMANCE QUALIFICATION Performance in the DoD model is defined as percent-GO. The percent-GO scale is derived from the hands-on performance tests developed in the Joint-Service Job Performance Measurement/Enlistment Standards (JPM) Project. A general description of their development is provided by Wigdor and Green (1991:Chapter 4). More detailed descriptions of the development of these measures are provided by the researchers from each Service who worked on their development. Campbell et al. (1990) describe the

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Army measures; the Air Force procedures are documented in Lipscomb and Hedge (1988). The description by Carey and Mayberry (1992) of the development and scoring of tests for Marine Corps mechanics specialties is a particularly good source, since this was one of the last efforts undertaken and it built on lessons learned in earlier efforts. As described by Green and Wigdor (1991), the hands-on performance test scale is an attempt to create a domain-referenced metric in which scores reflect the percentage of relevant tasks that a job incumbent can perform successfully. They were developed as criteria for evaluating the success of selection and classification decisions. An estimate of the proportion of the job that the selected recruit could perform (after training and some specified amount of on-the-job experience and when sufficiently motivated) was judged the most valid measure of success on the job. In general terms, hands-on test scores do provide at least relative information about success on the job that is reliable and valid. As such they are quite satisfactory for the purposes for which they are intended. There are several issues, however, that affect the level and linearity of the scores derived from them. Among others, these include the sampling of tasks, the scoring of tasks, and the way in which scores were combined across tasks. Task Sampling In the JPM Project, a limited number of tasks was selected for measuring performance in each job. If these tasks were selected randomly from an exhaustive list of job tasks, generalization from scores on the hands-on tests to the entire domain of tasks would be simple and easy to defend. This was not, however, the case. Task sampling procedures varied somewhat across the Services. In nearly all cases, there was some attempt to cluster similar tasks and then sample separately from each cluster. In the Army, for example, a universe of up to 700 tasks (or task fragments) was consolidated into a list of 150 to 200 tasks; these tasks were then grouped into 6 to 9 task clusters. One, two, or possibly three tasks were then sampled from each of these clusters. This stratified sampling approach actually leads to a more carefully representative sample of tasks in comparison to simple random sampling. Technically, however, this approach also meant that tasks in different clusters were sampled with different probabilities. Statistical purists might require differential weighting of task results, inversely proportional to sampling probabilities, in order to create precise estimates of scores for the entire domain. A second and more serious concern with the task sampling procedures is that many types of tasks were either excluded altogether or were selected with very low frequency. There was an attempt to collect judgments about the importance of each task as well as the frequency with which it was

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop pertormed. Few, if any, low importance or infrequent tasks were selected. In addition, some tasks were ruled out because it would be difficult or dangerous to collect work samples. Tasks for which poor performance by unsupervised recruits could result in damage to individuals or equipment were generally excluded. Tasks that were too easy (did not discriminate among incumbents) and, in a very few cases, too difficult were often excluded from consideration. A consequence of these exclusions is that performance on the sampled tasks generalized more precisely to performance on all job tasks that were judged moderately to highly important, were frequently performed, were not too dangerous, and were challenging enough to be at least a little difficult. This generalization is not necessarily bad, but the relevant domain should be kept in mind when it comes to setting performance standards. Higher performance levels would almost surely be expected for important and frequent tasks than for less important and less frequent tasks, but lower performance levels might be required for less dangerous tasks in comparison to more dangerous tasks; lower performance levels would also be expected for more difficult tasks in comparison to trivially easy tasks. In theory, these differences might offset each other, but to an unknown extent, so that performance on the sampled tasks might not be much different from performance across the entire job domain as called for by the Committee on the Performance of Military Personnel. Task Scoring In their "idealized" description of a competency interpretation of the hands-on performance test scores, Green and Wigdor (1991:57) talk about the percentage of tasks in the job that an individual can do. Task performance is not, of course, dichotomous in most cases. For the most part, tasks were divided into a number (from 3 or 4 to as many as 20 or 30) of discrete steps, and criteria were established for successful performance of each of these steps. Naturally there were exceptions: the "type straight copy" task for Army clerks was scored in terms of words-per-minute adjusted for errors, one of the gunnery tasks for Marine Corps infantrymen was scored in terms of number of hits on target. For the most part, however, dichotomous scores were awarded for each of a discrete number of observable steps. In many or most cases, the criterion for successfully performing a step was clear and unambiguous. A mechanic changing a tire either did or did not tighten the lug nuts before replacing the cover, for example. In other cases, the criterion was somewhat arbitrary, as in "the grenade landed within some fixed (but mostly arbitrary) distance of the target'' or "the rifle was disassembled within an arbitrarily fixed amount of time." (These standards may have had some strong rationale, but they were not always obvious to the test developers.)

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop The discrete performance steps varied in terms of their criticality. If a job incumbent had to perform every step successfully in order to be considered successful on the task as a whole, than task success rates would be very low. The scoring generally focused on the process followed more than the overall output. In many cases, it was possible to achieve a satisfactory output even if some of the less critical steps were skipped or poorly performed. Weighting each of the individual steps according to their importance would have required an enormous amount of judgment by subject matter experts and would, in most cases, have led to less reliable overall scores. For purposes of differentiating high and low performers, the percentage of steps performed correctly, without regard to the importance of each step, proved quite satisfactory. When it comes to interpreting the resulting scores, however, it is in most cases impossible to say how many tasks an individual performed correctly because task standards were not generally established. Thus, the real interpretation of the hands-on test scores should be the percentage of task steps that an individual can perform correctly, not the percentage of tasks. Combining Scores from Different Tasks For the most part, the scores for each task were put onto a common metric—the percentage rather than number of steps performed successfully—and then averaged to create an overall score. Since the individual task scores were not dichotomous, there was some room for compensation with very high performance on one task (e.g., all steps completed successfully on a difficult task) compensating for somewhat lower performance on another task (e.g., several steps missed on a relatively easier task). As noted above, the tasks were not a simple random sample from a larger domain, and some form of task weighting—either by importance and frequency or by sampling probabilities—would have been possible. The fact that weights were not used should not create problems in interpretation so long as there were not highly significant interactions between task difficulty and importance or frequency. Some bias in the overall scale would also have resulted from the conversion from number to percentage of steps if there were a strong interaction between the number and difficulty of the steps within each task. Several of the Services also examined different ways of grouping tasks or task steps into clusters in order to create meaningful subscores. The Army analyzed scores from six general task clusters: communications, vehicle operation and maintenance, basic soldiering, identifying targets or threats, technical or job-specific, and safety. Groupings of individual task steps into four knowledge and two skill categories were also analyzed. The Marine Corps created a matrix that mapped task steps onto different "behavioral elements." Although interesting, these subscores did not lead to

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop significant findings and have little bearing on the issue of setting overall performance standards. In summary, the hands-on test scores derived from the JPM work do lend themselves to an interpretation of job competency. They are scaled in terms of the percentage of steps for important tasks that an individual will perform successfully. It is not unreasonable to interpret these scores in a general sense as the percentage of the central or important parts of the job that the individual can perform successfully. Since a great deal of aggregation is involved in setting performance requirements for the DoD model, a general sense interpretation is probably quite sufficient. Many of the issues raised above could be of significant concern if scores on individuals were being interpreted. Given the imprecision of the prediction of the performance scores from enlistment tests and high school credentials and the highly aggregated nature of the predictions, it seems reasonable to proceed with a general "percentage of job" interpretation. A NORMATIVE APPROACH One approach to setting performance level requirements is to ask what levels we have experienced in the past. This is essentially a normarive approach in which requirements for future years are tied to norms developed from prior years. If an important objective of the DoD model is to determine whether current quality levels are sufficient or perhaps excessive, then this normative approach is entirely circular, since performance level requirements will be tied back to current quality levels. Furthermore, we would be better off simply using aptitude scores to define quality requirements, since very little new information would be generated in linking performance requirements back to current or past quality levels. At a more detailed level, however, several interesting questions can be addressed through analyses of normarive data. First is the question of the degree of variability in predicted performance levels across jobs. It may well be, for example, that observed differences in recruit quality are evened if high-quality recruits are more likely to be assigned to difficult jobs. A high-quality recruit assigned to a difficult job may end up being able to successfully perform the same percentage of job tasks as a lower-quality recruit in an easier job. If this were the case, then performance level requirements might generalize to new jobs more easily than quality requirements would. Another question is how much predicted performance levels have varied over time, overall and by job. If performance levels have varied considerably, then using past performance levels to set future requirements would be questionable. If, however, performance levels (and performance level differences among jobs) are relatively stable across time, using past perfor-

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop mance levels as a benchmark would be more defensible, although, with dramatic changes in force levels, job requirements may not be as stable in the future as they have been in the past. Samples To examine these questions, the fiscal 1982 and 1989 accession cohorts were selected for analysis. The 1982 cohort was the earliest cohort for which the current form of the Armed Services Vocational Aptitude Battery (ASVAB, beginning with forms 8/9/10), was used exclusively in selection. This is important because the performance prediction equations in the linkage model are based on the subtests in the current ASVAB. Earlier ASVAB forms had different subtests and a number of assumptions would be required in generating AFQT and technical composite scores from these prior forms for use in the prediction equation. The 1989 cohort was the most recent for which data on job incumbents with at least two years of service are available. In addition, recruits from this cohort participated extensively in Operation Desert Storm and so some global assessment of their readiness is possible. For each cohort, the active-duty roster as of 21 months after the end of the enlistment year was examined to identify incumbents in the 24 JPM specialties. The primary military occupational specialty (MOS) at time of enlistment was considered as the basis for sorting recruits into jobs, but it was discovered that many recruits are not enlisted directly into several of the JPM specialties. Consequently, it was decided to select for the JPM specialties on the basis of MOS codes at about 24 months of service. This decision meant that recruits who left service prior to 24 months were not included, and we were thus not modeling the exact enlistment policies. However, examining score distributions among job incumbents considered successful had many advantages and was deemed entirely appropriate. The JPM samples included 24 different specialties. One of these specialties, Air Force avionics communications specialist, was deleted from the current study. The specialty code was changed prior to the 1989 accession year, and it was not possible to determine whether there was an appropriately comparable specialty. Variables ASVAB scores of record were obtained. A small number of cases in the 1982 cohort had enlisted using ASVAB forms 5, 6, or 7. These cases were deleted from the analyses since the ASVAB tests included in the AFQT and technical composites were not all available in these forms. Educational credential status was also obtained and coded as either high school

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop graduate or nongraduate. In these analyses, recruits with alternative high school diplomas were counted among the nongraduates. Job performance prediction equations for the JPM specialties that were developed in the Linkage Project (McCloy et al., 1992) were used. These equations use AFQT and technical composite scores (expressed as sums of standardized subtest scores), educational level, time in service, and the interaction (product) of time in service and the technical composite in a prediction equation. The weights for each predictor are determined from job analysis information from the Dictionary of Occupational Titles. A constant value of 24 months was used for time in service so that predictions would reflect "average" first tour performance for a 3- to 4-year enlistment. Since predicted performance is a linear function of time in service, the average across the first tour will be equal to the value predicted for the midpoint. Ignoring the first six months as mostly training time, the midpoint would occur at month 21 for a 3-year tour and at month 27 for a 4-year tour. Analyses The primary results were summarized in an analysis of variance with MOS (23 levels) and accession year (2 levels) treated as independent factors and predicted performance and the AFQT and technical composites each analyzed as dependent variables. One hypothesis tested with these analyses was that predicted performance might show smaller differences among jobs in comparison with the AFQT or technical composites. Separate intercepts (determined from job characteristics) were estimated for each job. It was plausible to believe that required performance score levels might be reasonably constant across jobs, even if input quality was not. A second hypothesis tested was that predicted performance would show relatively smaller differences across recruiting years in comparison to the AFQT and technical composites, since predicted performance combines both AFQT and technical composite scores and the latter might be less affected by differences in recruiting conditions than the AFQT. Findings Across all jobs and both recruiting years, the mean test level was 68.4. Table 1 shows the sample sizes, and the mean and standard deviation of predicted performance scores for each job and entry cohort. Mean predicted performance scores by job and year are also plotted in the table. Table 2 shows the means for the AFQT and technical composites and predicted performance by year. These means are adjusted for differences in the MOS distributions for the two years. Table 3 shows F-ratios testing the

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop TABLE 1 Mean Predicted Performance by Job and Entry Cohort     1982     1989     Mean Service MOS N Mean S.D. N Mean S.D. 89–82 Army 11B 10657 62.49 3.42 10341 62.71 3.15 0.22   13B 5178 60.33 3.74 4065 59.69 3.55 -0.64   19E 2585 64.98 3.27 2892 64.83 3.34 -0.15   31C 1497 71.35 2.94 1615 73.04 1.88 1.69   63B 1915 74.17 3.00 3123 73.84 2.91 -0.33   64C 3079 59.76 2.85 3902 59.45 3.06 -0.31   71L 3455 63.98 3.45 1490 64.72 2.26 0.74   91A 2863 66.10 2.71 4013 65.69 2.83 -0.41   95B 3397 69.56 2.12 3725 68.85 2.23 -0.71 USAF 122 205 67.14 2.95 235 67.97 2.15 0.83   272 587 69.11 2.27 623 69.77 1.83 0.66   324 355 76.17 1.89 152 75.52 2.09 -0.65   328 432 78.04 1.73 0         423 1111 68.32 2.48 711 69.52 2.03 1.20   426 867 78.60 2.53 744 78.87 2.12 0.27   492 210 71.73 3.00 146 71.86 1.94 0.13   732 816 62.08 2.87 765 63.05 2.07 0.97 Navy ET 1990 74.85 2.04 1547 74.43 2.17 -0.42   MM 3133 75.78 3.08 3805 75.30 3.07 -0.48   RM 2202 70.09 3.07 2075 69.54 2.54 -0.55 USMC 031 4392 61.39 3.21 3871 61.23 3.04 -0.16   033 947 61.98 3.01 1026 61.07 2.74 -0.91   034 960 78.56 4.87 1084 77.29 4.68 -1.27   035 1001 65.15 2.92 1421 65.71 2.58 0.56   Average 53834 66.13   53371 66.26   0.12 significance of differences across years, MOS, and the year-by-MOS interaction for these same three variables. The first significant finding from these analyses was that there was virtually no change in mean predicted performance between the 1982 and the 1989 cohorts, overall or for any of the jobs analyzed. A statistically significant mean gain in AFQT was offset by a significant mean drop in technical scores between the 1982 and 1989 cohorts, resulting in no significant difference in predicted performance. Second, there was some consistent variation among jobs in predicted performance levels, with lows of around 60 for Army field artillery (13B) and truck driver (64C), Air Force 732 and two of the Marine Corps infantry jobs and highs above 75 for Air Force 328 and 426 jobs and Marine Corps 034. This variation is consistent with the assumption that higher competency levels might be required in more critical or complex jobs. The variation across jobs in predicted performance was much more significant (much greater F-ratio) than the varia-

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop the least-squares equations are optimal for the samples on which they were derived; the job-specific linkage equations are not. The largest differences between R2OLS and R2cv primarily occur in the jobs having the smallest sample sizes (e.g., EM, GSM, 328X0). The absolute magnitude of the differences is not particularly large, however, ranging from .006 for 11B to .083 for 328X0. The question remaining is what to make of this difference in R2 values. Shrinkage Formulae The value of R2 obtained for the least-squares job-specific regression equation (R2OLS in Table 8) can be viewed as an upper bound because calculating least-squares regression weights capitalizes on chance fluctuations specific to the sample in which the equation is developed. Applying the weights from this equation to another sample would result in a decrease in R2, because the weights are suboptimal for the second sample. Thus, the R2 yielded by the regression weights "shrinks" relative to the original R2. The amount of shrinkage to be expected may be estimated using a shrinkage formula. Perhaps the best known of these is a formula developed by Wherry (1931): where N is the size of the sample used to estimate the equation, k is the number of predictors, and R2yx is the sample coefficient of determination (R2OLS from Table 8). Wherry's formula gives the value for R2 expected if the equation were estimated in the population rather than a sample. Because the population will virtually never be at the researcher's disposal, Wherry's formula is of little practical value. As noted by Darlington (1968) and Rozeboom (1978), the Wherry formula does not answer the more relevant question of what the R2 would be if the sample equation were applied to the population. Both Cattin (1980) and Campbell (1990) reported that no totally unbiased estimate for this value exists, although the amount of bias inherent to current estimates is generally small. They recommended a formula developed by Browne (1975), on the basis of its desirable statistical properties. Browne's formula, appropriate when the predictor variables are random (as opposed to fixed), is where ρ is the adjusted R2 from the Wherry formula; N and k are defined as above. In truth, Browne's formula contains two terms, this equation being

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop the first (and by far the larger). Browne reported the bias introduced by neglecting the second term of his R2 adjustment to be no greater than .02. (He also provided an equation for fixed predictor variables.) A second formula for estimating the validity of the sample equation in the population was provided by Rozeboom (1978): with N, k, and R2yx defined as above. The shrinkage formulae just described allow one to estimate the population multiple correlation for the full sample equation. If the average sample cross-validity coefficient is of interest, Lord (1950) and Nicholson (1960) independently developed a shrinkage formula for estimating this value: with N, k, and R2yx defined as above. Comparison of Adjusted and Cross-Validity R2 Values Because the job-specific least-squares equations are optimal for the samples on which they were developed but the job-specific linkage equations are not, the comparison of R2OLS to R2cv is not exactly fair. A more equitable comparison obtains through adjustment of the R2OLS values for shrinkage. Thus, the four shrinkage formulae were applied to the R2 values from the least-squares job-specific regression equations (i.e., R2OLS). These adjusted R2 values (R2adj) were then compared to the cross-validity R2 values obtained from the job-specific equations generated by the 23-job and primary (24-job) linkage equations in the holdout and new-job analyses, respectively (i.e., R2cv). The results appear in Table 8. In general, the decrease in R2 associated with using the job-specific linkage equation as compared to the least-squares equation is virtually identical to that expected based on the Browne, Rozeboom, and Lord-Nicholson formulae (i.e., R2cv ≈ R2adj)—the unweighted and weighted (by sample size) average differences (R2cv-R2adj) being -.007, -.003, .002; and -.014, -.011, and -.008; respectively. In contrast, R2adj as given by the Wherry formula is typically larger than R2cv (unweighted and weighted differences of -.019 and -.021, respectively), but this comparison is not particularly appropriate because no population equation exists. Of the four shrinkage formulae presented in Table 8, the Lord-Nicholson

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop adjustment probably provides the best referent for the holdout analyses (i.e., the 23-job linkage equations), because job-specific linkage equations were generated from a primary linkage equation estimated on a partial sample. The job-specific linkage equations were then applied to a second "sample" (i.e., the holdout job). Thus, the regression parameters for the 23-job equations are not full-sample weights and therefore not the best estimates available. This, in turn, means the parameters for the job-specific linkage equations generated from the 23-job equations are not the best estimates available. Nevertheless, the use of equations containing partial-sample weights rather than full-sample weights suggests that the Lord-Nicholson shrinkage formula provides an appropriate comparison. For the 7 new jobs that were not part of the original 24-job estimation sample, however, the sample-based linkage equations were generated using full-sample weights (i.e., the 24-job primary linkage equation) and used to estimate performance scores for all individuals in a new job (as will be the case upon implementation by manpower planners). Here, one could argue that the Browne formula is the correct referent (i.e., a sample equation based on full-sample weights, applied to a new sample from the population). One might also consider the 24-job equation to be a partial-sample equation, however, given that the data from the new Navy ratings were not incorporated into the sample to yield a 31-job equation. If so, for reasons given above, Lord-Nicholson remains a viable referent. The conclusion is the same no matter which comparison one chooses: the preponderance of small differences between R2cv and R2adj values demonstrates that the linkage methodology provides a means of obtaining predictions of job performance for jobs without criterion data that are nearly as valid (and sometimes more valid) as predictions obtained when (1) criterion data are available for the job, (2) a job-specific least-squares prediction equation is developed, and (3) the equation is applied in subsequent samples. Comparison of Validity Coefficients to the Literature Another means of assessing the predictive power of the job-specific linkage equations is to compare their validity coefficients with those reported in the literature for similar predictor/criterion combinations. McCloy (1990) demonstrated that the determinants of relevant variance in performance criteria differ across criterion measurement methods (i.e., written job knowledge tests, hands-on performance tests, and personnel file data and ratings of typical performance), leading to different correlations between a predictor or predictor battery and criteria assessing the same content but measured with different methods. Hence, the most relevant comparisons for the R2 values given in Table 8 are validity studies involving cognitive ability as a predictor and hands-on measures as performance criteria.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Unfortunately, relatively few studies employ hands-on performance tests as criteria. The vast majority of validity research has used supervisory ratings or measures of training success (e.g., written tests or course grades) as criteria. The preference for these measures is probably due to the ease and lower cost of constructing them, relative to hands-on tests. Nevertheless, there are a few studies that may serve as a standard of comparison. In a meta-analysis of all criterion-related validity studies published in the Journal of Applied Psychology and Personnel Psychology from 1964 to 1982, Schmitt et al. (1984) reported the mean correlation between various predictors and hands-on job performance measures to be r = .40, based on 24 correlations. They also provided mean validities for specific types of predictors when predicting performance on hands-on tests. General mental ability measures yielded a mean validity r = .43 (based on three correlations). Note that meta-analysis corrects the distribution of validity coefficients for range restriction and criterion unreliability. Hunter (1984, 1985, 1986) reported the correlation between measures of general cognitive ability and hands-on job performance measures to be r = .75 in civilian studies and r = .53 in the military. These correlations were adjusted for range restriction. A study of military job performance by Vineberg and Joyner (1982) reported an average validity of various predictors for task performance of r = .31, based on 18 correlations. In a later study, Maier and Hiatt (1984) reported validities of the ASVAB when predicting hands-on performance tests to range from r = .56 to .59. Finally, Scribner et al. (1986) obtained a multiple correlation of r = .45 when predicting range performance for tankers in the U.S. Army using general cognitive ability (AFQT), experience, and demographic variables. The R2 values for the job-specific least-squares and linkage equations given in Table 8 have not been corrected for range restriction or criterion unreliability. For the job-specific least-squares regression equations, values of the multiple correlation range from r = .26 (Army MOS 13B) to r = .71 (Marine Corps MOS 6113) with unweighted and weighted (by sample size) means of r = .43 and r = .40, respectively. For the job-specific linkage equations, values range from r = .18 (Air Force specialty 272X0) to r = .68 (Marine Corps MOS 6112) with unweighted and weighted means of r = .38 and r = .36, respectively. Clearly, the predictive validity of the job-specific linkage equations lies well within the range of validities that have appeared in the literature. Summary Taken together, the results from the cross-validity analyses suggest that the linkage methodology has yielded a performance equation that provides predictions for out-of-sample jobs that are not much below the best one

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop could expect. Predictions are generally better for high-density jobs than for low-density jobs. Nevertheless, the cross-validity analyses have strongly suggested that there is relatively little loss in predictive accuracy when predictions are made for jobs devoid of criterion information. Tempering this finding, however, is the finding that the absolute level of prediction typically ranges from about R2 = .10 to R2 = .20, even when using optimal (i.e., job-specific OLS) prediction equations. Clearly, there remains room for improvement in the prediction of hands-on performance. Nevertheless, from a slightly different perspective, the utility of prediction of job performance for out-of-sample jobs is increased by R2 percent over what it would be without the primary linkage equation. These results are positive and supportive of the multilevel regression approach to predicting performance for jobs without criterion data. Discussion One characteristic shared by validity generalization, synthetic validation, and multilevel regression is that they act as "data multipliers"—they take the results of a set of data and expand their application to other settings when the collection of complete data is too expensive or impossible. Validity generalization does not yield information that is directly applicable to the development of prediction equations for jobs without criteria. Rather, the results suggest (1) whether measures of a particular construct would be valid across situations and (2) whether there is reliable situational variance in the correlations. Synthetic validity does provide information directly applicable to the task of performance predictions without performance criteria. In fact, as mentioned earlier, no performance criteria of any kind are required. Judgments and good job analytic data alone are sufficient for the production of prediction equations. This would appear to be highly advantageous to small organizations that might otherwise be unable to afford a large-scale performance measurement/validation effort. Furthermore, the largest synthetic validity study ever undertaken, the Army's SYNVAL project, demonstrated these equations to be nearly as predictive as optimal least-squares equations that had been adjusted for shrinkage. Although not developed for this purpose, multilevel regression analysis has been shown to provide a means of generating equations that occasionally exceed appropriately adjusted validity values from least-squares equations. The results compare favorably with the results from the SYNVAL project, although, unlike the SYNVAL data, the data supplied to the multilevel regression analyses had not been corrected for range restriction. It is possible that the results could be more positive if more appropriate job analytic information were used. Recall that the job characteristic data used

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop in the Linkage project were originally collected on civilian jobs and transferred to the most similar military occupations. A job analysis instrument specifically applied to military jobs might result in better Mj variables and therefore better estimates of the job-specific regression parameters. The Navy has finished a job clustering project that used a job analysis questionnaire developed for Navy jobs—the Job Activities Inventory (JAI; Reynolds et al., 1992) that could easily be modified for application to all military jobs. One potential drawback of applying multilevel regression techniques is that a number of jobs must have criterion data for estimating the primary linkage equation. The 24 jobs used in the Linkage project did supply enough stability to obtain statistically reliable results based on across-job variation, but including more jobs in the estimation sample would certainly have resulted in better estimates. Increasing the estimation sample should not be unreasonably difficult for larger organizations with some form of performance assessment program in place. For one, the performance criterion does not need to be a hands-on performance test. Written tests of job knowledge or supervisory ratings could serve as criteria just as easily. Performance prediction equations could be developed for new jobs or jobs not having the performance criterion in question. For example, assessment center research might be helped by this method of estimating predicted performance scores. Sending promising young managers to assessment centers is very costly. A primary equation could be developed based on the individuals who were sent to the assessment centers. Estimated assessment center scores could then be obtained from job-specific regression equations developed from the primary equation. There are a couple of potential drawbacks to this application, including the ability to differentiate between various managerial positions and the effects of range restriction. The application of multilevel regression techniques also might provide benefits to organizations that are members of larger consortia. The organizational consortium could pool its resources and develop a primary performance prediction equation on a subset of jobs having criterion data across organizations within the consortium. Job-specific equations could then be developed for the remaining jobs. The research from the Synthetic Validation and Linkage projects has advanced our knowledge of the degree to which performance equations may be created for jobs without criteria. The methodology provided by multilevel regression analysis closely resembles synthetic validation strategies. Both rely heavily on sound job analytic data. After SYNVAL, Mossholder and Arvey's (1984) observation that little work had been done in the area of synthetic validity is no longer true. Further, the Linkage project has demonstrated another successful procedure for generating performance predic-

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop tion equations that operates without judgments about the validity of individual attributes for various job components. Both procedures should be examined closely in future research because they have the potential for turning an initial investment into substantial cost savings—they make a few data go a long, long way. ACKNOWLEDGMENTS The author wishes to thank Larry Hedges and Bengt Muthén for their invaluable help and patience in communicating the details of multilevel regression models and their application, the Committee on Military Enlistment Standards for their challenging comments and creative ideas, Linkage project director Dickie Harris for his support and good humor throughout this research, and the reviewers of the manuscript for their careful reading of a previous version of this chapter. Any errors that remain are the responsibility of the author. REFERENCES Browne, M.W. 1975 Predictive validity of a linear regression equation. British Journal of Mathematical and Statistical Psychology 28:79–87. Campbell, J.P., ed. 1986 Improving the Selection, Classification, and Utilization of Army Enlisted Personnel: Annual Report, 1986 Fiscal Year (Report 813101). Alexandria, Va.: U.S. Army Research Institute. Campbell, J.P. 1990 Modeling the performance prediction problem in industrial and organizational psychology. Pp. 687–732 in M.D. Dunnette and L.J. Hough, eds., Handbook of Industrial and Organizational Psychology , 2nd ed., Vol. 1. Palo Alto, Calif.: Consulting Psychologists Press. Campbell, J.P., McCloy, R.A., Oppler, S.H., and Sager, C.E. 1992 A theory of performance. Pp. 35–70 in N. Schmitt and W.C. Borman, eds., Personnel Selection in Organizations. San Francisco, Calif.: Jossey-Bass. Campbell, J.P., and Zook, L.M., eds. 1992 Building and Retaining the Career Force: New Procedures for Accessing and Assigning Army Enlisted Personnel (ARI Research Note). Alexandria, Va.: U.S. Army Research Institute. Cattin, P. 1980 Estimation of the predictive power of a regression model. Journal of Applied Psychology 65:407–414. Crafts, J.L., Szenas, P.L., Chia, W.J., and Pulakos, E.D. 1988 A Review of Models and Procedures for Synthetic Validation for Entry-Level Army Jobs (ARI Research Note 88–107). Alexandria, Va.: U.S. Army Research Institute. Darlington, R.B. 1968 Multiple regression in psychological research and practice. Psychological Bulletin 69:161–182.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Green, W.H. 1990 Econometric Methods. New York: McMillan. Harris, D.A., McCloy, R.A., Dempsey, J.R., Roth, C., Sackett, P.R., Hedges, L.V., Smith, D.A., and Hogan, P.F. 1991 Determining the Relationship Between Recruit Characteristics and Job Performance: A Methodology and a Model (FR-PRD-90-17). Alexandria, Va.: Human Resources Research Organization. Hedges, L.V. 1988 The meta-analysis of test validity studies: Some new approaches. Pp. 191–212 in H. Wainer and H.I. Braun, eds., Test Validity. Hillsdale, N.J.: Erlbaum. Hollenbeck, J.P., and Whitemer, E.M. 1988 Criterion-related validation for small sample contexts: An integrated approach to synthetic validity. Journal of Applied Psychology 73:536–544. Hunter, J.E. 1984 The Prediction of Job Performance in the Civilian Sector Using the ASVAB . Rockville, Md.: Research Applications. 1985 Differential Validity Across Jobs in the Military. Rockville, Md.: Research Applications. 1986 Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of Vocational Behavior 29:340–362. Hunter, J.E., and Hunter, R.F. 1984 Validity and utility of alternative predictors of job performance. Psychological Bulletin 98(1):72–98. Knapp, D.J., and Campbell, J.P. 1993 Building a Joint-Service Classification Research Road Map: Criterion-Related Issues (FR-PRD-93-11). Alexandria, Va.: Human Resources Research Organization. Lawshe, C.H. 1952 Employee selection. Personnel Psychology 5:31–34. Laurence, J.H., and Ramsberger, P.F. 1991 Low Aptitude Men in the Military: Who Profits, Who Pays? New York: Praeger. Longford, N.T. 1988 VARCL Software for Variance Component Analysis of Data with Hierarchically Nested Random Effects (Maximum Likelihood) . Princeton, N.J.: Educational Testing Service. Lord, F.M. 1950 Efficiency of prediction when a regression equation from one sample is used in a new sample . Research Bulletin (50–40), Princeton, N.J.: Educational Testing Service. Maier, M.H., and Hiatt, C.M. 1984 An Evaluation of Using Job Performance Tests to Validate ASVAB Qualification Standards (CNR 89). Alexandria, Va.: Center for Naval Analyses. McCloy, R.A. 1990 A New Model of Job Performance: An Integration of Measurement, Prediction, and Theory. Unpublished doctoral dissertation, University of Minnesota. McCloy, R.A., Harris, D.A., Barnes, J.D., Hogan, P.F., Smith, D.A., Clifton, D., and Sola, M. 1992 Accession Quality, Job Performance, and Cost: A Cost-Performance Tradeoff Model (FR-PRD-92-11). Alexandria, Va.: Human Resources Research Organization McCormick, E.J., Jeanneret, P.R., and Mecham, R.C. 1972 A study of job characteristics and job dimensions based on the Position Analysis Questionnaire (PAQ). Journal of Applied Psychology 56:347–367.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Mossholder, K.W., and Arvey, R.D. 1984 Synthetic validity: A conceptual and comparative review. Journal of Applied Psychology 69:322–333. Nicholson, G.E. 1960 Prediction in future samples. Pp. 424–427 in I. Olkin et al., eds., Contribution to Probability and Statistics. Stanford, Calif.: Stanford University Press. Primoff, E.S. 1955 Test Selection by Job Analysis: The J-Coefficient, What It Is, How It Works (Test Technical Series, No. 20). Washington, D.C.: U.S. Civil Service Commission. Reynolds, D.H. 1992 Developing prediction procedures and evaluating prediction accuracy without empirical data. In J.P. Campbell, ed., Building a Joint-Service Research Road Map: Methodological Issues in Selection and Classification (Draft Interim Report). Alexandria, Va.: Human Resources Research Organization. Reynolds, D.H., Barnes, J.D., Harris, D.A., and Hams, J.H. 1992 Analysis and Clustering of Entry-Level Navy Ratings (FR-PRD-92-20). Alexandria, Va.: Human Resources Research Organization. Rozeboom, W.W. 1978 The estimation of cross-validated multiple correlation: A clarification. Psychological Bulletin 85:1348–1351. Sackett, P.R., Schmidt, N., Tenopyr, M.L., Kehoe, J., and Zedeck, S. 1985 Commentary on ''Forty questions about validity generalization and meta-analysis." Personnel Psychology 38:697–798. Schmidt, F.L., and Hunter, J.E. 1977 Development of a general solution to the problem of validity generalization. Journal of Applied Psychology 62:529–540. Schmidt, F.L., Hunter, J.E., and Pearlman, K. 1981 Task differences as moderators of aptitude test validity in selection: A red herring. Journal of Applied Psychology 66:166–185. Schmidt, F.L., Hunter, J.E., Pearlman, K., and Hirsh, H.R. 1985 Forty questions about validity generalization and meta-analysis. Personnel Psychology 38:697–798. Schmidt, F.L., Hunter, J.E., Pearlman, K., and Shane, G.S. 1979 Further tests of the Schmidt-Hunter Bayesian validity generalization procedure. Personnel Psychology 32:257–281. Schmitt, N., Gooding, R.Z., Noe, R.D., and Kirsch, M. 1984 Meta-analysis of validity studies published between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology 37:407–422. Scribner, B.L., Smith, D.A., Baldwin, R.H., and Phillips, R.L. 1986 Are smart tankers better? AFQT and military productivity. Armed Forces and Society 12(2): 193–206. Steadman, E. 1981 Relationship of Enlistment Standards to Job Performance . Paper presented at the 1st Annual Conference on Personnel and Training Factors in Systems Effectiveness, San Diego, California. U.S. Department of Defense 1991 Joint-Service Efforts to Link Military Enlistment Standards to Job Performance. Report to the House Committee on Appropriations. Washington, D.C.: Office of the Assistant Secretary of Defense (Force Management and Personnel). U.S. Department of Labor 1977 Dictionary of Occupational Titles. Fourth Edition. Washington, D.C.: U.S. Department of Labor.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop Vineberg, R., and Joyner, J.N. 1982 Prediction of Job Performance: Review of Military Studies. Alexandria, Va.: Human Resources Research Organization. Waters, B.K., Barnes, J.D., Foley, P., Steinhaus, S.D., and Brown, D.C. 1988 Estimating the Reading Skills of Military Applicants: Development of an ASVAB to RGL Conversion Table (FR-PRD-88-22). Alexandria, Va.: Human Resources Research Organization. Waters, B.K., Laurence, J.H., and Camara, W.J. 1987 Personnel Enlistment and Classification Procedures in the U.S. Military. Paper prepared for the Committee on the Performance of Military Personnel. Washington, D.C.: National Academy Press. Wherry, R.J. 1931 A new formula for predicting the shrinkage of the coefficient of multiple correlation. Annals of Mathematical Statistics 2:446–457. Wigdor, A.K., and Green, B.F., Jr., eds. 1991 Performance Assessment for the Workplace, Volume I. Committee on the Performance of Military Personnel, Commission on Behavioral and Social Sciences and Education, National Research Council. Washington, D.C.: National Academy Press. Wing, H., Peterson, N.G., and Hoffman, R.G. 1985 Expert judgments of predictor-criterion validity relationships. Pp. 219–270 in Eaton, N.K., Goer, M.H., Harris, J.H., and Zook, L.M., eds., Improving the Selection, Classification, and Utilization of Army Enlisted Personnel: Annual Report, 1984 Fiscal Year (Report 660). Alexandria, Va.: U.S. Army Research Institute. Wise, L.L., Campbell, J.P., and Arabian, J.M. 1988 The Army synthetic validation project. Pp. 76–85 in B.F. Green, Jr., H. Wing, and A.K. Wigdor, eds., Linking Military Enlistment Standards to Job Performance: Report of a Workshop. Committee on the Performance of Military Personnel. Washington, D.C.: National Academy Press. Wise, L.L., Peterson, N.G., Hoffman, R.G., Campbell, J.P., and Arabian, J.M. 1991 Army Synthetic Validity Project Report of Phase III Results, Volume I (Report 922). Alexandria, Va.: U.S. Army Research Institute. Wright, G.J. 1984 Crosscoding Military and Civilian Occupational Classification Systems. Presented at the 26th Annual Conference of the Military Testing Association, Munich, Federal Republic of Germany.

OCR for page 35
Modeling Cost and Performance for Military Enlistment: Report of a Workshop This page in the original is blank.