7

Evaluating the Quality of Performance Measures: Content Representativeness

Having established that scores on well-developed hands-on assessment instruments are not simply haphazard, we now begin to examine the question of determining whether the scores are a meaningful indicator of the job performance the instruments were intended to assess. As a first approach, we examine the concept of content representativeness, frequently called content validity, as it relates to job performance measures and present some criteria and tools for evaluating whether a given performance sample can adequately represent an entire job. Chapter 8 further explores the meaningfulness of hands-on test scores by examining their relationships with other variables of interest.

Job performance tests attempt to replicate the full job as faithfully as possible within the constraints of time, cost, and assessment technique. Due especially to time and cost, not all job tasks and the behavior that accomplishes them are used for the measurement. Rather, a sample is chosen to represent the job and turned into a standardized assessment device. The scores produced by this assessment are used to characterize an incumbent's performance on the entire job. One of the greatest challenges posed by performance-based measurement is found in the matter of sampling job content. It is difficult technically to provide a scientifically supportable basis—either judgmental or empirical—for extrapolating from performance on a subset of job tasks to performance on the job as a whole. And there is the added complication (not unknown in other forms of testing) of the expectations of decision makers. From their perspective—and this is as



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 128
Performance Assessment for the Workplace 7 Evaluating the Quality of Performance Measures: Content Representativeness Having established that scores on well-developed hands-on assessment instruments are not simply haphazard, we now begin to examine the question of determining whether the scores are a meaningful indicator of the job performance the instruments were intended to assess. As a first approach, we examine the concept of content representativeness, frequently called content validity, as it relates to job performance measures and present some criteria and tools for evaluating whether a given performance sample can adequately represent an entire job. Chapter 8 further explores the meaningfulness of hands-on test scores by examining their relationships with other variables of interest. Job performance tests attempt to replicate the full job as faithfully as possible within the constraints of time, cost, and assessment technique. Due especially to time and cost, not all job tasks and the behavior that accomplishes them are used for the measurement. Rather, a sample is chosen to represent the job and turned into a standardized assessment device. The scores produced by this assessment are used to characterize an incumbent's performance on the entire job. One of the greatest challenges posed by performance-based measurement is found in the matter of sampling job content. It is difficult technically to provide a scientifically supportable basis—either judgmental or empirical—for extrapolating from performance on a subset of job tasks to performance on the job as a whole. And there is the added complication (not unknown in other forms of testing) of the expectations of decision makers. From their perspective—and this is as

OCR for page 128
Performance Assessment for the Workplace likely to be true of civilian-sector managers as the JPM Project showed it to be true of military officials—a performance measure must “look like” the job to be deemed credible. From Concept to Practice The content representativeness of a hands-on performance measure is the extent to which the content of the measure represents the tasks and required performance on the entire job (see Wigdor and Green, 1986:40-44; see also Guion, 1975). In other words, content representativeness refers to “how well a small sample of behavior observed in the measurement procedure represents the whole class of behavior that falls within the boundaries defining the content [job] domain” (Guion, 1977:3). This attribute is of particular importance in a test of job performance that purports to say something about competence. Content representativeness can be logically argued on the grounds that (1) the hands-on performance measure was constructed by systematically sampling a set of tasks/behaviors from a universe of tasks defined by a job analysis and (2) the translation of those job tasks/behaviors into the test preserved the important features of the tasks themselves and the behaviors they require. A practical example of the process of selecting representative tasks is provided by Lammlein (1987; see also Laabs and Baker, 1989) for the first-term Navy radioman hands-on performance measure (see Table 7-1). First, from among the universe of possible radioman tasks, the job domain for purposes of this research project was defined. Note that “Navy-wide” tasks were eliminated from the domain of interest, as were certain job-unique tasks based “upon such factors as feasibility of testing, operational requirements, availability of equipment, testing time, etc.” (Lammlein, 1987:25). Second, the domain of tasks was stratified by content area (e.g., preparing and processing messages) to help ensure that the tasks sampled for the hands-on performance measure covered the wide range of tasks on the job. Third, a job analysis survey and judgments of job experts were used to identify “critical” tasks: tasks important for mission success, tasks complicated to perform, and so on. And fourth, the most critical tasks, paying attention to content strata, were systematically selected. The resulting job sample was evaluated by subject matter experts; four tasks were dropped and a new four were included as more appropriate. This resulted in a sample of 15 tasks on the radioman hands-on performance measure. Not surprisingly, practice deviates from theory. Trade-offs had to be made in what tasks could reasonably be used in the radioman hands-on performance measure, how those tasks were selected, and the fidelity with which the hands-on performance measure replicated the tasks as they are

OCR for page 128
Performance Assessment for the Workplace TABLE 7-1 Selection of Tasks for the First-Term Navy Radioman Job Step Description 1 Develop task list (i.e., define job in terms of universe of tasks) based on: Review of training and job documentation (pay grades E2-E4). Two earlier job analyses. Judgment of subject matter experts (experienced supervisors and trainers). Job observation by project staff to ensure completeness of list. 2 Identify job content categories based on: Experts (N = 16) sorting 124 cards, each representing a task, into “piles of similar task content” (Lammlein, 1987:9) and labeling each of those piles. Factor analyzing a 124 by 124 matrix of similarities among tasks. Labeling the four interpretable factors—(I) preparing and processing messages (69 tasks), (II) setting up equipment (23 tasks), (III) maintaining equipment (21 tasks), and (IV) handling secure messages (11 tasks). 3 Identify task criticality by: Using job analysis survey to define criticality along dimensions of (1) importance for mission success, (2) percentage of time each task is performed, (3) how complicated each task is to perform correctly, and (4) how often the task is performed incorrectly. Defining respondents for survey: (a) first-term radiomen and (b) supervisors. Stratified random sampling of 1,042 incumbents and experts to collect criticality data (53% usable return rate). 4 Select 15 critical tasks based on: Mean criticality rating. Expert judgment as to whether the tasks fit within operational requirements, call for available equipment, fit within testing time, and reflect future operational needs. actually performed on the job. Such trade-offs, however, may compromise content representativeness. In the JPM Project, each of the Services eliminated certain tasks from its definition of the domain of tasks defining the job. For example, all Services eliminated tasks involving live fire because of safety, cost, or some combination of the two; only the Marine Corps retained a live-fire task. Likewise, all Services eliminated tasks that job experts judged to be redundant,

OCR for page 128
Performance Assessment for the Workplace too time-consuming, or trivial. The Navy and the Air Force eliminated hands-on performance of certain tasks performed on jet aircraft engines because of the potential cost of an examinee's error. They substituted incumbents' oral explanations of how they would perform those tasks. Some restrictions of the universe of possible tasks are less problematic than others. Proper wiring down of screws on an aircraft engine is critical to aircraft safety, but the skill need be demonstrated but once, not the 20 or 30 times a maintenance task might actually call for. However, some omissions, although understandable or even necessary, are inherently threatening to the concept of representativeness— for example, the hands-on performance measure for a grenade launcher that does everything but test the accuracy with which the enlistee can actually fire a grenade. When the full task domain of interest is reduced to a convenient task domain, content representativeness may be seriously compromised. At the very least, such reductions must be clearly documented and explained to users of the performance data with accompanying cautions regarding the interpretation of a hands-on performance score as representing on-the-job performance. Sampling Issues There are two schools of thought on drawing samples from a domain of tasks that define the job: one advocates purposive and the other random sampling. We examine each in turn. Purposive Sampling The purposive sampling school holds that samples should be chosen by job experts exercising informed judgment as to which tasks should be included in a hands-on performance measure that, because of cost and time constraints, can contain only a small number of tasks from a considerably larger domain (e.g., Guion, 1975, 1979). This approach was put into practice widely in the JPM Project, as it is in the private sector. Purposive sampling is justified on a number of grounds (see, e.g., Wigdor and Green, 1986:49-51; see also Guion, 1979). One justification is that because hands-on performance measures contain a very limited sample of tasks (e.g., 15), each task must be carefully selected to reflect an important (or critical or difficult or frequent), nonredundant job task. To leave task selection to a haphazard procedure, or to a random sampling procedure, according to this argument, would be to risk creating a test that does not cover the essential job elements. When one considers that military jobs contain up to 800 tasks, that job tasks appear to be markedly heterogeneous, and that only 15 to 30 tasks could be tested in the 6 to 8 hours allotted, the intuitive appeal of the argument is clear.

OCR for page 128
Performance Assessment for the Workplace A second reason for the popularity of purposive sampling is the face validity of the resulting test. Adherents argue that random sampling is too likely to produce an instrument that policy makers will reject as not looking like the job. This reasoning was powerful for the designers of the JPM Project. The degree of congressional interest and the enormous investment of resources by the Services convinced most of the JPM researchers that the stakes were simply too high to risk anything other than purposive sampling. The untoward character of certain tasks might also incline one toward purposive sampling. Some tasks are simply too lengthy, intricate, dangerous, or costly to include in a hands-on performance measure. Leaving such tasks to chance selection through a process of random sampling strikes many as unwise and unwarranted. Those involved in the JPM Project advanced another argument for purposive sampling that was related to their goal of validating the ASVAB. By selecting tasks of moderate difficulty that are frequently performed on the job and that are judged to be important by experts, the variance of hands-on performance scores would be maximized and the potential correlation between predictor (e.g., the AFQT) and criterion (job performance measurement) increased.1 Random Sampling The second school of thought holds that the better scientific ground for arguing content representativeness is provided by random selection of tasks from the job domain, because only random sampling permits one to make, with known margins of error, statements that can be generalized to the entire domain of tasks. In other words, random sampling techniques provide a strong inferential bridge from test performance to performance on the job. The key consideration is that each task have a known and nonzero probability of selection into the sample (Wigdor and Green, 1986:46). To increase precision, an initial stratification of tasks (e.g., manual versus nonmanual) might be employed prior to selection. Expert advice can also be incorporated into a random sampling scheme through stratified random sampling. The strata reflect experts' judgments as to the most salient content areas of the job. In this way, the best of purposive sampling and the best of random sampling can be merged. The random sampling school points out that purposive sampling falls prey to selection bias. That is, what experts judge to be representative of the job may be more a function of their most recent experience supervising the job, or it may conform to their conception of the job, which may not encom 1   Note that, although the correlation may increase, it may not provide the correct estimate if it is based on an unrepresentative sample of tasks/behaviors.

OCR for page 128
Performance Assessment for the Workplace pass the job's full range. Put another way, the random sampling school holds that human judgment may contain predictable errors, and job experts are human. This concern gains credence from reports from the Army and the Navy that panels of job experts disagreed substantially on their judgments of important or critical tasks or samples of tasks for hands-on performance measures (e.g., Lammlein et al., 1987). Adherents of random sampling hold, in contrast to the purposive sampling school, that the purpose of a hands-on performance measure goes beyond rank ordering individuals in correlational analyses (relative decisions). The purpose also includes interpretations about the levels of performance or competency represented by scores on the hands-on performance measure (absolute decisions). As Cronbach (1971:453) pointed out: Content validation . . . looks on the test as an instrument for absolute measurement, though a test validated in this way may have differential [relative-decision] uses also. From an absolute point of view the score on a task indicates that the person does or does not possess, in conjunction, all the abilities required to perform it successfully. The committee was convinced that, on balance, the stratified random sampling approach to the construction of hands-on performance measures is preferable. The conviction was partly a matter of encouraging the most defensible procedures scientifically, but it was also linked to our espousal of a competency approach to the measurement of military performance. We argued (largely unsuccessfully) for absolute scores, which are interpretable in terms of the level of an incumbent's performance, not in terms of how well he or she performed compared with others. We recommended this approach because it corresponds to decisions policy makers must address (e.g., What is the least costly way to fill Service personnel needs and still maintain readiness?) and to the development of personnel assignment algorithms that attempt to maximize some performance-level criterion (discussed in Chapter 9; see also Green et al., 1988). Although the Marine Corps adopted random sampling techniques and attempted to construct its hands-on tests to permit a competency interpretation of test scores (Mayberry, 1987, 1989), purposive sampling is far more prevalent in practice. Thus, the committee turned its attention to the problem of building an interpretive framework for purposive samples. How Representative Are Purposive Samples? Having espoused a random sampling perspective for arguing content representativeness and recognizing that most hands-on performance measures are constructed with purposive samples, we are in a position to ask: From

OCR for page 128
Performance Assessment for the Workplace this perspective, how representative is a purposive sample? Our approach to answering the question is based on the following conceptualization of the problem. To begin with, the purposive sample of tasks/behaviors used for a hands-on performance measure is but one of many possible samples that might be chosen from the domain of interest. The sample could be characterized by its critical features, such as the average importance of the tasks contained in the hands-on performance measure, by the average frequency with which the hands-on performance tasks are performed on the job, by the average difficulty of the tasks, by the average number of errors made while performing the tasks, and so on. A 2nd, 3rd, 4th, . . . 2,000th sample of tasks could also be chosen from the job domain and their critical features characterized. Now, focusing on the average importance of each of the 2,000 job samples drawn from the domain, a frequency distribution can be constructed with mean importance on the x-axis and frequency on the y-axis. This frequency distribution can be called a sampling distribution of means, or sampling distribution for short. The next step is to consider where the mean importance of the purposive sample falls within this sampling distribution. If it falls in the center of the distribution, the sample can be considered representative, at least in terms of the importance feature. If it falls within plus or minus two standard deviations (standard errors) from the mean of the sampling distribution (roughly at the mean for the entire domain of tasks if 2,000 samples were really drawn), it can still be considered representative. However, if it is more than two standard deviations from the mean, then it is among the 5 percent least probable samples with respect to the characteristic being evaluated. In that case, it would be considered an extreme sample, not representative of the job on the particular feature (importance). This process could be repeated for each of the other critical features of the job; a decision could then be made as to how representative the purposive sample is from a random sampling perspective. The above process could be simulated on a computer, but there is no need. It has a long history and a straightforward analytic solution. The sampling distribution of means described above will be normally distributed, especially with increasing task sample size. It will have a mean equal to the domain mean and a standard deviation equal to the domain standard deviation divided by the square root of the sample size. Finally, recognizing that the critical features of a job are most likely correlated, the features could be characterized not only one at a time, but also simultaneously. In this case, a set of mean feature scores (one each for importance, frequency, difficulty, and errors) would be sampled. Each hands-on measure, then, could be characterized by a set of four mean scores, or by a point in a multivariate space that corresponds to the set of four mean scores. By drawing repeated hands-on measures, a multivariate frequency

OCR for page 128
Performance Assessment for the Workplace distribution—a multivariate sampling distribution—could be constructed. This sampling distribution can be modeled by the multivariate normal distribution, just as the normal distribution characterized the sampling of mean scores for a single feature. And just as areas of the normal distribution can be marked off to characterize representative samples, so too can areas in the multivariate normal distribution be marked off. Consequently, if one had data on each of the salient features of the domain of tasks that constitute a job, one could then determine just how representative a purposive sample is for each feature and for the features taken together. Data collected for the Navy (Lammlein, 1987) can be used to illustrate the univariate portion of the analysis. Lammlein collected experts ' (incumbents' and supervisors') ratings of the salient features of the tasks that defined the radioman job domain. For each of the 124 tasks, job incumbents indicated whether they had performed the task (PCTPERF) and rated the frequency with which they performed each (FREQ) and how complicated each is to perform correctly (COMP). For each task, supervisors indicated whether they had supervised it (PCTSUP) and rated its importance for mission success (IMPORT) and how often it is performed incorrectly (ERROR). Using these data, the mean and standard deviation were calculated for each of the six features; the correlations among the features are reported in Table 7-2. Domain parameters are presented above the main diagonal; purposive sample statistics are presented below. TABLE 7-2 Means, Standard Deviations (SD), and Correlations Among the Salient Features of the Navy Radioman Job   Feature   PCTSUP PCTPERF IMPORT ERROR FREQ COMP Mean SD   Domain (124 Tasks) PCTSUP   .94 .11 −.10 .59 −.49 42.31 21.30 PCTPERF .88   −.03 −.06 .70 −.57 36.19 21.47 IMPORT −.37 −.37   −.20 −.07 .16 3.42 0.56 ERROR .25 .31 −.25   −.04 .43 1.50 0.27 FREQ .40 .51 .04 .27   −.65 3.00 0.78 COMP −.40 −.44 .14 .48 −.46   1.80 0.27   Purposive Sample (22 Tasks) Mean 65.27 59.68 3.54 1.56 3.53 1.71   SD 11.95 15.26 0.51 0.29 0.56 0.24   NOTE: Domain parameters are presented above the main diagonal, purposive sample statistics below.

OCR for page 128
Performance Assessment for the Workplace The job domain correlations among the salient features of the radioman tasks follow a predictable pattern. The features that reflect how often the tasks are performed correlate highly with each other (correlations among PCTSUP, PCTPERF, and FREQ). The correlation between supervisors ' ratings of how often an error is made on a task and the incumbents ' ratings of how complicated the task is to perform (.43) seems reasonable. And the negative correlations between performance frequency and the error and complicated-to-perform variables suggest either that complicated tasks occur less frequently than do easier ones or that tasks are complicated because they are not performed frequently. Finally, the importance feature tends to be unrelated to the other features. The purposive sample correlations preserve the relationship among the frequency-of-performance task characteristics, somewhat underestimating their magnitude in the job domain. The same is true for correlations between performance frequency and the complicated-to-perform variable. However, the relationship between frequency variables and error are in the opposite direction of the domain relationships. This flip-flop may be due, in part, to small sample size (n = 22) and to small magnitudes of the domain correlations (−.10, −.06, and −.04). The sample accurately reflects the magnitude of the domain correlation between error and complicated-to-perform. Finally, the sample correlations between importance and the other features do not accurately reflect the domain correlations due, in part, to the low magnitude of the domain correlations and small sample size. In sum, the correlational structure of the purposive sample tends to follow that in the universe. The exceptions can be explained by sampling error arising from small sample size and low domain correlations. The question still remains: How representative is the purposive sample from the random sampling perspective? To complete the answer, we next measured the distance between the features of the sample tasks on the hands-on performance measure and the features of all tasks in the job domain. To do this, we calculated, for each feature, the difference between the sample mean (over 22 tasks) and the domain mean. We then divided this difference by the standard deviation of the sample means, the standard error. This produced a measure of the distance between the purposive sample and what would be expected with random sampling, in standard deviation (standard error) units (Table 7-3). The results of these calculations for the six features characterizing the radioman's tasks are presented in column 1 of Table 7-3. The distance scores indicate that the purposive sample, as intended, included tasks that very much look like the job if “look like” is defined as “performed frequently.” Put another way, the purposive sample is not representative of the job domain; it disproportionately contains frequently performed tasks. Another important piece of information is that the purposive sample tended to in

OCR for page 128
Performance Assessment for the Workplace TABLE 7-3 Evaluation of Purposive Sampling from the Perspective of Random Sampling: Navy Radioman Rating   Domain/Sample Center Task Feature Infinite, Simple Random Finite, Simple Random Finite, Stratified Random PCTSUP 5.06 6.29 3.11 PCTPERF 5.06 6.25 3.25 IMPORT 1.00 1.20 1.06 ERROR 1.00 1.20 0.29 FREQ 3.12 4.08 1.50 COMP −1.50 −1.80 −0.89 NOTE: Distance between purposive and random samples in standard deviation units. clude tasks that incumbents rated as less complicated to perform than the average task on the job. Also included in the table is information based on somewhat different assumptions about the size of the job domain and the sampling process. The second column of data provides distance scores assuming that a simple random sample was drawn from a finite domain (124 tasks) rather than from an indefinitely large (infinite) universe. Under this assumption, the interpretation does not change; the increase in the magnitudes of the distances emphasizes that the purposive sample contains “unrepresentative” tasks in terms of frequency.2 The last column in Table 7-3 provides distance scores based on the assumption that tasks were selected by stratified random sampling from a finite domain. The stratification reflects the process used by the Navy in creating content categories to ensure that the full range of critical tasks was included in the hands-on performance measure (see Table 7-1, Step 2). Once again, the magnitudes change because the content categories are weighted proportionally to the number of tasks in that category with respect to the number of tasks in the domain. Nevertheless, the story told by the last column, although somewhat less dramatic, remains unchanged. From a random sampling perspective, then, the hands-on performance measure is not representative of the job. It consistently overemphasizes frequent tasks, as measured by the percentage of supervisors supervising the tasks, the percentage of incumbents performing the tasks, and the incumbents' ratings of how frequently the tasks are performed. One way to re 2   The magnitude is greater because the domain standard deviation is reduced in magnitude by a factor relating the sample size (n) to the domain size (N): (N − n)/N.

OCR for page 128
Performance Assessment for the Workplace solve the representativeness problem is to stratify on frequency as well. (From a purposive sampling perspective, the hands-on performance measure may well do just what it was intended to do. It overemphasizes frequent tasks so that, from a face validity perspective, the hands-on performance measure “looks like” the job.3) Rapprochement This evaluation of a purposive task sample from a random sampling perspective leads to a possible rapprochement between the two methods. If the critical features of the domain of tasks are known, random or stratified random samples can be drawn from the domain of interest, and the representativeness of each of a large number of samples can be evaluated, as we have just done. This procedure could easily be performed on a computer, and unrepresentative samples could be eliminated. At this point, several options arise for incorporating the judgments of subject matter experts. One option is to have experts choose a particular sample for the hands-on test from the set of representative samples. The limitation of this approach is that it may not be possible to make probability statements about inferences to performance on the total job. A second option is to ask experts to remove from the representative set those samples that they find unacceptable, up to some percentage of the samples (say 10 percent). Then, a single sample would be randomly selected for the hands-on test. This approach would allow a probability statement, but the exact formulation of the statement might be difficult to determine. Either approach significantly reduces the risk of erroneously inferring total job performance from a sample of tasks, however. With these options then, the complementary strengths of both the purposive and random approaches can be used to reduce the weakness of each. Even if a rapprochement is reached with regard to task sampling, however, content representativeness is still limited. By virtue of translating job tasks into assessment devices, some aspects of the job are ignored, which is the subject of the next section. PERFORMANCE MEASUREMENTS AS JOB SIMULATIONS Job performance measurements attempt to replicate job tasks as faithfully as possible within constraints imposed by time, cost, and assessment 3   That purposive samples tend to overrepresent frequently performed tasks in a job performance measure makes sense in light of the literature on judgment bias. One heuristic that is frequently used in judging representativeness is how easily something is recalled. Frequent tasks will be easily recalled, thus giving the impression that the performance measure is representative of the job (see Tversky and Kahneman, 1974).

OCR for page 128
Performance Assessment for the Workplace techniques and contexts. But since a hands-on performance measure does not replicate tasks and their variation as encountered day to day on the job, it is a simulation of the job, albeit as concrete a simulation as possible. Consequently, hands-on performance measures can be thought of as job simulations (see Guion, 1979). In characterizing hands-on performance measures and other measurement devices as simulations of a job, interest centers on whether it is helpful to distinguish simulations along two dimensions: fidelity and abstractness (e.g., Shavelson, 1968). Fidelity refers to how closely the simulation incorporates the real-world variables of interest. Abstractness refers to how concrete or abstract the simulation is. A mathematical model of the trajectory of a missile provides a high-fidelity simulation because it incorporates relevant real-world variables. Likewise, a hands-on performance measure provides a high-fidelity simulation by including content-representative measurements. The mathematical model of the missile trajectory, however, is quite abstract, and looks nothing like the actual missile trajectory. In contrast, the hands-on performance measure is quite concrete; an attempt is made to replicate the job task on the test. Hands-on performance measures tend to be concrete representations, the fidelity of which is reduced in a number of ways. First, military hands-on performance measures are carried out under peacetime conditions even though decision makers ultimately want to know how incumbents would perform in combat. Second, hands-on performance measures are standardized so that, to the extent possible, the tasks presented to one incumbent are the same as those presented to a second incumbent. But on the job, a single task is performed under a myriad of conditions by one incumbent, and it is performed in myriad ways by different incumbents. Third, hands-on performance measures, by their very nature, place incumbents under the watchful eye of examiners, a condition rarely encountered on the job. Consequently, one assumes incumbents are motivated to perform the hands-on tasks; one does not know what they actually would do on the job. Fourth, hands-on performance measures present a sequence of tasks to incumbents that may not fit the sequence of tasks typically encountered on the job. Changing the typical sequence of tasks, although necessary to sample job tasks adequately and to standardize the job performance measurement, introduces an artificiality into the hands-on performance measure. Fifth, hands-on performance measures sometimes remove incumbents from direct performance of a sequence of tasks because of the cost, time, or danger involved. Finally, the job performance of individuals, not units of individuals, as normally occurs on the job, is the focus of the hands-on performance measures. If hands-on performance measures are simulations of the job that vary in fidelity, even more so do their surrogates, such as pencil-and-paper tests, supervisory ratings, computer simulations, and the like. For example, cogni-

OCR for page 128
Performance Assessment for the Workplace tive paper-and-pencil tests are highly abstract simulations of the job with arguably low fidelity. Pencil-and-paper job knowledge tests move closer to the real world, but only moderately so. Computer simulations and walkthrough tests move along the abstractness continuum toward the concrete pole; they tend to be of higher fidelity than paper-and-pencil tests. Because hands-on performance measures can provide high fidelity, concrete representations of jobs, the JPM Project considers them to have a certain inherent credibility. In interpreting hands-on performance measures, however, it must be remembered that the measure is itself an approximation—a simulation—of the reality one wants to know about. Although it is the closest thing to the job, it still is an abstraction from the job.