Read "Performance Assessment for the Workplace, Volume II: Technical Issues" at NAP.edu

Page 27 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Work Samples as Measures of Performance

Frederick D. Smith

INTRODUCTION

The ability to predict future performance of incumbent employees or candidates for employment has been one of the main contributions of psychologists to industry. Traditionally, this process has involved interviews, psychological testing, and biographical information as predictors of on-the-job performance. The predictive validities of these methods, while often good enough to satisfy legal requirements, are not as high as could be expected given the long history of research and development in testing and interviewing.

One method that appears to overcome some of the problems of these traditional selection techniques is the work sample. Work samples measure job skills by requiring an individual to demonstrate competency in a situation parallel to that at work, under realistic and standardized conditions. Their primary purpose is to evaluate what one can do rather than what one knows (Cascio and Phillips, 1979).

This paper is a review of the work sample literature, with particular attention paid to the theory underlying work sample testing, the use of work samples as criterion measures, the adverse impact of work samples, and measurement issues associated with such tests. In order to understand the work sample in its nontraditional use as a criterion measure, however, it was felt that some of the research employing work sample tests as predic-

Page 28 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

tors would be illuminating. For this reason, a section of the paper is devoted to some representative studies using work samples as predictors.

A work sample, whether used as a predictor or a criterion, is constructed to allow a measure of performance on a structured task that is directly reflective of the types of behaviors required in the job situation. Consequently, regardless of its use in the literature, either as a predictor of future performance or as a measure of present ability, the work sample by its very nature can be considered a criterion measure.

In a similar vein, criterion work samples that appear in the literature are most often used to gauge success in training. To the extent that success in training is felt to be a predictor of eventual performance on the actual job, the work samples used in this manner serve the dual purpose of measuring current performance and predicting future success.

For both of these reasons, the author feels that the predictor/criterion distinction is superficial in the case of work sample testing, and that including certain predictive work samples in this review will be beneficial for an understanding of these types of tests.

Theoretical Bases for Work Sample Testing

Theoretically, work samples should possess high validity since the test itself is a subset or a sample of the criterion domain. Wernimont and Campbell (1968) have suggested that performance prediction based on work samples would be fruitful. They propose a behavioral consistency model founded on the tenet that the best predictor of future performance is past performance. In applying their approach, Wernimont and Campbell recommend searching an applicant's work experience for specific examples of required job behaviors, and if these do not exist, using a work sample or simulation. In proposing this view, Wernimont and Campbell (1968:372) hope to overcome what they see as “the unfortunate marriage ” of the “classic validity model with the use of tests as signs, or indicators of predispositions to behave in certain ways, rather than as samples of the characteristic behavior of individuals.”

Similarly, Asher (1972) and Asher and Sciarrino (1974) suggest that predictive power is enhanced when there is a point-to-point correspondence between the predictor and the criterion space. For this reason, tests of a single dimension are less powerful predictors than more complex tests such as work samples, which are designed to be miniature replicas of the criterion task. While the validities reviewed later in this paper seem to bear out the predictive power of work samples, Asher and Sciarrino themselves mention several other possible explanations. One is an interaction hypothesis. A complex task may elicit an interaction among aptitudes rather than a simple additive effect, so that the criterion will be poorly predicted based on an additive model using measures of single aptitudes or traits. If a prediction

Page 29 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

model is built by combining aptitude test scores additively, the model may in fact overlook aptitudes that interact. A work sample, by its design, allows these interactions to occur naturally, and would therefore be expected to predict future performance to a greater degree than a series of individual tests.

Another possible explanation for the high validities of work samples is that of work methods. The work sample may elicit realistic work habits that individuals use to solve specific problems, and these work methods may account for a greater amount of individual differences than combinations of basic motor or verbal abilities tapped by paper-and-pencil tests. By having individuals perform actual work-related behaviors, they are able to demonstrate ability more specific to the job itself rather than some generalized aptitude. The point of a work sample is, after all, to reduce the inferential leap that must be made between performance in a standardized testing situation (be it motor or written or verbal) and actual job performance. There is less of a leap needed between behavior in a work sample and behavior in the actual job situation than between performance or problem solving on a paper-and-pencil test and actual job behavior.

A final point made by Asher and Sciarrino (1974), and one somewhat related to that of work methods, is that experience may play a role in the high validities found in work sample testing. The work sample may be measuring prior experience that has transferred to the criterion task, and may therefore be identifying people with more experience but not necessarily more aptitude.

In addition to these comments regarding the point-to-point theory, Gordon and Kleinman (1976) suggest that the face validity of work samples may influence motivation among testees, which is related to the interest in and motivation for a particular job. Therefore, a less identifiable set of elements than point-to-point correspondence between predictor and criterion may be responsible for high validities reported. The Gordon and Kleinman study is examined in further detail below.

Many of these same points can be made for work samples as criterion measures. If the purpose of a work sample is to obtain an accurate measure of current job performance, then the work sample must accurately reflect job behaviors critical for success. Rather than rating employees on a number of gross performance dimensions, the work sample allows appraisal on a specific, standardized, and possibly complex task. For example, an appraisal on the independent dimension of troubleshooting is a poor measure compared to appraisal on a complex task that requires troubleshooting for successful completion. In traditional performance appraisal, correlation between dimensions is halo and is often thought of as error; in a work sample task, the natural and perhaps critical correlation between several abilities or dimensions can occur as part of the testing process.

Page 30 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Similarly, work methods can be accurately reflected in a work sample criterion. The techniques that an employee actually uses in the job situation can be observed and measured, rather than some generalized performance dimension of mechanical ability or problem solving.

Experience would also be expected to play a role in work sample criterion measures. This would be especially true if the task required a substantial amount of training or job specific knowledge. More experienced employees should be expected to perform better than those less experienced. For this reason, care must be taken that a work sample meant to measure job performance note the ability level of testees. A single work sample may not be appropriate for all employees, or different performance standards may be needed.

The theoretical underpinnings of work samples just reviewed imply that they are superior measures of performance. Empirical results seem to support this contention. What follows is a summary of several reviews of work samples as predictors, and then a more detailed look at the work sample as a method of measuring performance.

Previous Reviews of Work Sample Testing

There have been a number of reviews containing validity data on work sample tests. Asher and Sciarrino (1974) classified over 60 work sample studies into either motor or verbal tasks. A work sample was considered motor if it involved the physical manipulation of things, such as operating a sewing machine, tracing a complex electrical circuit, or repairing a gear box. Verbal work sample tests involved language-oriented or people-oriented tasks. These included tests of common facts in law for students, inbasket exercises, role plays for making telephone contacts with customers, and skill in writing business letters. The criteria in the studies reviewed were either job proficiency or success in training, and criterion measures were generally supervisor ratings, output (number of items produced), completion of training, or grade in training. For some of the verbal work samples, criteria also included salary, promotions, job level, sales, or number of leadership offices held.

As can be seen in Table 1, with job proficiency as the criterion, motor work samples were second only to biographical information in terms of predictive validity. Forty-three percent of the motor work samples reviewed had validity coefficients greater than .50, and 70 percent of the motor work samples had validity coefficients exceeding .40. Verbal work samples fared less well when job proficiency was the criterion, with only 21 percent of the validity coefficients exceeding .50, and 41 percent exceeding .40. However, with success in training as the criterion measure, verbal work samples were superior to motor work samples. Thirty-nine percent of the verbal work

Page 31 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

TABLE 1 Proportions of Work Sample Validity Coefficients Exceeding Particular Levels

	Criterion
	Job Predictor			Success in Training
Predictor	r ≥ .50	r ≥ .40	r ≥ .30	r ≥ .50	r ≥ .40	r ≥ .30
Motor work samples	.43	.70	.78	.29	.47	.79
Verbal work samples	.21	.41	.60	.39	.65	.81
SOURCE: Asher and Sciarrino (1974).

samples had validity coefficients in excess of .50, while only 29 percent of the motor work samples' validities were larger than .50. Sixty-five percent of the verbal work samples reviewed had validity coefficients greater than .40, while only 47 percent of the motor work samples' validities were greater than .40. Asher and Sciarrino (1974) conclude from their review that work samples fare well when compared to other predictors, being second only to biographical information in terms of predictive power.

A more recent review of the work sample was performed by Robertson and Kandola (1982). They divided 60 work sample tests into four categories: psychomotor, individual situational decision making, job-related information, and group discussions/decision making. The psychomotor category corresponds to Asher and Sciarrino's motor tests, while the other three could be considered more verbal. Criteria included job performance, job progress, and training. It was found that psychomotor tests had a median validity of .39 (78 coefficients), job-related information tests a median of .40 (27 coefficients), situational decision making a median of .28 (53 coefficients), and group discussion tests a median of .34 (27 coefficients).

Table 2 was constructed from data appearing in the Robertson and Kandola (1982) review, and allows some comparison to the Asher and Sciarrino results. Table 2 shows the proportions of validity coefficients exceeding particular levels for the four types of work sample predictors with job performance and training performance as the criteria. (The only predictive validities reported for the criterion training were for psychomotor tests.) A pattern of results similar to those of Asher and Sciarrino can be found. When job performance is the criterion, psychomotor work samples outperform work samples that are more verbal. Comparing psychomotor validities across the two criteria, it can be seen that they predict less well for training than for job performance.

An important difference between the earlier reviews by Asher and Sciarrino and Robertson and Kandola, which is of particular relevance to this paper, is the use of a work sample as a criterion measure. The Robertson and

Page 32 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

TABLE 2 Proportions of Work Sample Validity Coefficients Exceeding Particular Levels

	Criterion
	Job Proficiency			Training
Predictor	r ≥ .50	r ≥ .40	r ≥ .30	r ≥ .50	r ≥ .40	r ≥ .30
Psychomotor	.31	.69	.88	.16	.50	.75
Job-related information	.09	.27	.36
Situational decision making	.08	.27	.50
Group discussion	.30	.40	.80
SOURCE: Based on Robertson and Kandola (1982).

Kandola review is not complete, since they only include studies in which a work sample was used both as the predictor and the criterion. As a criterion, the work samples reviewed were usually similar to but longer than the predictor work sample. Robertson and Kandola report a median validity of .49 between psychomotor predictors and work samples consisting of psychomotor tasks (based on 10 validity coefficients). A median validity of .75 (based on 7 validity coefficients) was found between situational decision making predictors and a work sample criterion of performance in an assessment center.

While these validities are impressive, Robertson and Kandola caution that the idea of increasing the similarity between predictor and criterion (as per point-to-point theory, for instance) may have been pushed beyond a reasonable limit. These correlations can be interpreted as measures of reliability rather than validity. By comparing one job-related test with another job-related test, the relationship between the two tests may be discovered, but inferences of how this relationship relates to job performance will still have to be made. It is precisely this inference that work sample testing is supposed to reduce. Robertson and Kandola (1982) caution that researchers should not attempt to increase validity by simply developing criteria that are likely to relate closely to the predictor. Rather, care should be taken that the criteria themselves are job performance measures.

Meta-analytic Reviews of Validity Studies Involving Work Samples

Two recent articles reviewed validities of predictors of job performance. Schmitt et al. (1984) performed a number of meta-analyses on validity studies published between 1964 and 1982. Their analyses revealed that work samples, assessment centers, and superior/peer evaluations yielded validities superior to general mental ability tests or special aptitude tests. When

Page 33 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

work samples are used as predictors, the average validity coefficient was .378 (based on 18 coefficients); when work samples were the criterion, the average validity was .401 (24 coefficients).

Hunter and Hunter (1984), using meta-analytic techniques, found that for entry level jobs for which training will occur after hiring, combined cognitive ability and psychomotor ability test scores had a mean correlation of .53 with performance on the entry level job (425 validity coefficients), while a job tryout had a mean correlation of .44 with the same criterion (20 coefficients). For selection on the basis of current job performance, the work sample was slightly better than the ability composite, with average validity coefficients of .54 and .53, respectively. In all these cases, the work sample served as a predictor. Hunter and Hunter also report a meta-analysis involving studies in which work sample performance was used as a criterion and a job knowledge test was the predictor. A mean validity of .78 was obtained (based on 11 coefficients). The authors note, however, that job knowledge tests can be used for prediction only if the examinees are already trained for the job.

Hunter and Hunter's (1984) results differ somewhat from those of Schmitt et al. (1984), and this could be due to the fact that Hunter and Hunter (1984) include a large portion of unpublished data in their study, while Schmitt et al. drew data from published studies in the Journal of Applied Psychology and Personnel Psychology. In any case, again it appears that work samples have validities comparable to, and in many cases, superior to other predictors. While their use as criteria has been more limited, these two meta-analytic reviews do report rather impressive average validity coefficients for work samples as criteria.

WORK SAMPLES AS PREDICTORS

Work samples have traditionally been used as predictors of future job performance. This section is intended as a brief review of these types of studies.

Campion (1972) developed a work sample test for maintenance mechanics in a food processing company. The work sample consisted of four tasks, each broken down into the number of steps required to complete it. The tasks were: installing pulleys and belts, disassembling and repairing a gearbox, installing and aligning a motor, and pressing a bushing into a sprocket and reaming it to fit a shaft. In addition to these work sample tasks, each mechanic was given several paper-and-pencil tests: the Test of Mechanical Comprehension, Form AA; the Wonderlic Personnel Test, Form D; and the Short Employment Tests. Criterion measures were supervisor evaluations of three factors: use of tools, accuracy of work, and overall mechanical ability. It was found that performance on the work sample was significantly and positively correlated with supervisor evaluations of work performance on

Page 34 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

all three criteria, but that none of the validity coefficients for the paper-and-pencil tests was statistically significant.

Gordon and Kleinman (1976) also compared a work sample test to a paper-and-pencil test, with the criterion being training scores. Three classes of recruits in a police training academy were given a work sample test including firearms and defense tactics (motor tests), and a written work sample addressing the relationship of the police department to other civic agencies, department rules and regulations, and an introduction to law enforcement. A general intelligence test, the Otis-Lennon Mental Ability Test: Form J, was also administered. For all three classes of recruits, the work sample scores predicted overall training scores, while the intelligence test was significantly correlated with the criterion for only one class. As mentioned previously, Gordon and Kleinman suggest that the face validity of work sample tests may influence the motivation of testees, and this is also related to the interest in the job.

A study that failed to find any correlation between the work sample and a performance criterion was reported by Inskeep (1971). Three work sample tasks used to select and place sewing machine operators were examined using a concurrent validation design. The tests were developed to reflect actual shirt-making operations. The clipboard test uses a table with a sliding center board on which are mounted a number of metal clips. When the center board is moved into proper position, a clip may be opened by a foot pedal linked to the table top. The subject is provided with two piles of cloth rectangles. The subject must pick up a rectangle from each pile, align them, and place them in a clip. Then he or she slides the center board to align the next clip and repeats the procedure until all clips are filled. Performance score is the total time to fill all clips.

The needle board is the second work sample task. Ten spindles of thread and a metal crossbar are mounted on a table. In the crossbar are 10 needle holes corresponding to the 10 spindles. The subject is required to pass the thread through both a vertical and a horizontal needle hole. The score for the test is total time required to complete all 10 threadings.

The final test is called the hurdles. This involves using a standard production sewing machine geared down to a lower operating speed. The subject must sew along a specified pattern and complete a certain number of stitches. Test score is the number of seconds required to complete the sewing exercise.

Inskeep (1971) used a performance criterion of piece-rate earnings. It was found that the correlation between the clipboard test and earnings was −.02, between the needle board test and earnings was −.06, and between the hurdles and earnings was −.08, all nonsignificant. The Inskeep findings are somewhat surprising in that the work sample tasks are almost identical to some of the actual job behaviors required of incumbent sewing machine

Page 35 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

operators. This may reflect a problem with the performance criterion of piece-rate earnings, although Inskeep did not offer possible reasons for the negative findings.

A work sample test in the form of a minicourse for telephone switching repairmen was examined by Reilly and Manese (1979). The minicourse was a short (about 40 to 60 hours) training program designed to be a content valid sample of a 6-month electronic switching system (ESS) course. Predictors were total time to complete the minicourse and test performance based on seven self-paced lessons on electronic switching system fundamentals, plus the score on an ESS minicourse summary test. The criteria were total time to complete the full electronic switching system course, which consisted of two separate self-paced courses, one containing four modules, and the other containing five modules. It was found that minicourse test scores were significantly and negatively correlated with time to complete the full course, and that time to complete the short electronic switching system course was significantly correlated with time to complete the full course. Reilly and Manese (1979) comment that since the average cost per trainee for the long electronic switching system course is $25,000, the cost benefit of a valid selection procedure can be substantial.

Assessing Trainability Using Work Samples

It appears, then, that a work sample can be a valid means of assessing trainability of job candidates. Robertson and Downs (1979) distinguish between work sample tests and trainability tests: trainability tests include a structured and controlled period of learning and are used to select personnel for training rather than to choose people who are already competent. The procedure usually involves three steps:

Using a standardized form of instruction and demonstration, the instructor teaches the applicant the task, during which time the applicant is free to ask questions.
The applicant performs the task unaided.
The instructor records the applicant's performance and also makes a rating of the applicant's likely performance in training.

They review 16 studies, in which 24 validities are reported. The criterion in most is training success. Of the 24 correlations, 20 are significant, with coefficients in excess of .50 found in 10 cases. Robertson and Downs (1979) conclude that trainability tests display high content and face validity and allow the applicant to get a clear understanding of the job in question (a realistic job preview, in a sense), but that they are very job-specific and need to be redesigned and revalidated as jobs change, as well as being expensive to administer.

Page 36 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Robertson and Mindel (1980) examined the correlations between trainability tests and performance after 3 weeks of training for six craft trades: bricklaying, capstan, carpentry, milling, welding, and center lathe. They found that for three of the crafts, scores on a trainability test correlated significantly with training performance. Robertson and Mindel caution that the lack of predictive validity for some tests illustrates that although there is a generalized procedure for designing and administering the tests, each must be validated individually.

It would seem to be a short methodological step from using a work sample as a predictor of training ability to using a work sample as a measure of training success and also as a job performance criterion. The next section begins by examining a study that used a work sample both during training and to evaluate on-the-job performance. This will be followed by a review of studies in which work samples are primarily the criterion measures.

WORK SAMPLES AS CRITERIA

Relatively few studies have employed work samples as criterion measures, and those that do generally measure training achievement. Work samples used as criteria are useful because they provide a standardized testing situation in which to evaluate employees, and would seem to lend themselves well to jobs that are highly structured or jobs for which a core of representative behaviors could be identified and developed into a work sample.

One particular study nicely bridges the gap between work samples as predictors and the use of work samples as criteria. Siegel and Bergman (1975) developed what they called the miniature job training and evaluation approach. This is similar to trainability testing in that the examinee is trained, through demonstration and practice, to perform a particular task. The examinee is then scored on how well he or she performs what was taught with regard to following proper procedures, safety, and care and use of tools. The approach is based on demonstration of the ability to learn parts of the job as predictive of total job success.

Subjects in the Siegel and Bergman (1975) study were low aptitude U.S. Navy recruits who had failed a standardized paper-and-pencil test for admission into machinist's mate school. The paper-and-pencil test had three parts: a general classification test, an arithmetic test, and a mechanical test.

In developing their training program, the authors identified six behaviors as most representative of those performed by a journeyman machinist's mate: tool identification and use, gasket cutting, meter reading, trouble shooting, equipment operation, and assembly. Training sessions of 15 to 30 minutes were built around these behaviors. Once training was completed, each subject was tested on the amount learned during the training phase. The test was a procedural review of what was taught during the training session.

Page 37 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Following the completion of training, subjects were assigned to the fleet for duty.

Criterion measures were taken 9 and 18 months following completion of the training program. The criterion tasks were developed based on the opinions of experienced Navy chief machinist's mates. These reflected a diversified sample of the range of behaviors involved in the job of journeyman machinist's mate, and included the following: standing messenger watch, breaking and making a flange, packing a valve, demonstrating procedures in common malfunction and in emergency situations, knowledge of use and names of common equipment and tools, manifesting general alertness and common sense in the work situation, and adequacy of technical job knowledge. These criteria were administered individually to each of the subjects at the 9- and 18-month follow-ups. At the first criterion follow-up, 54 of the original sample of 99 subjects were available for testing, and 34 of the original 99 subjects were available for the second criterion follow-up. Siegel and Bergman do not say whether any of the subjects were the same from the 9-and 18-month follow-ups.

In order to compare the Navy paper-and-pencil predictors with their work sample predictors, Siegel and Bergman created a composite criterion score. The three work sample predictors with the highest zero correlation with the composite criterion (gasket cutting, trouble shooting, and assembly) were then used to determine the multiple correlation with each of the criterion tests. Siegel and Bergman reasoned that since only three predictors were used in the standard Navy selection technique, they would employ only three predictors.

For the 9-month criterion test, significant multiple correlations were found between the work sample and the standing messenger watch, knowledge of equipment and tools, and alertness and common sense criteria. The Navy predictors were correlated with knowledge of equipment and tools and with alertness and common sense. Disregarding significance levels, five of the seven performance criteria were predicted better by the training work samples than by the Navy tests, and Siegel and Bergman find this directional difference significant using a sign test.

At 18 months, directly opposite results were found. There were significant multiple correlations between the Navy tests and all but one of the criteria (alertness and common sense), while none of the criteria were predicted by the work sample training scores.

Siegel and Bergman conclude that the miniature job training and evaluation concept possesses merit for predicting performance of low aptitude applicants. They suggest that the lower predictive power of the work sample scores over time may be due to basing predictions on specific training scores rather than general abilities. While the job training work samples may be adequate for predicting success at initial job entry, over time

Page 38 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

continued success depends on generalized verbal and conceptual factors. In other words, the work sample training was very specific and applied in nature, while the Navy predictors measure a more generalized ability that remains stable over time. While this is a possible explanation for the results, it certainly does not help their case for using the miniature job training approach in place of traditional Navy selection tests. In effect they are saying that while their type of work samples are useful for getting low aptitude candidates started in the craft positions, these individuals are still lacking in some important basic aptitudes that make continued job success problematic.

In addition to this conceptual problem, Cohen and Penner (1976) identify several methodological problems with the Siegel and Bergman study, among them the improper use of a sign test, the lack of cross validation, and the fact that Siegel and Bergman (1975) performed a discriminant analysis and a validation study on the same sample. Siegel (1983), in a follow-up study employing larger samples and a greater number of Navy job specialties, again found a modest number of significant validities between miniature job training predictors and performance at 9 and 18 months. The criterion measure in this study, however, was not a work sample but commanding officer ratings of a subject 's performance on technical aspects of his or her work. The ratings were on a 7-point scale ranging from “very poorly” to “very well.” Siegel again concludes that the miniature job training approach shows good predictive validity for the 9-month period, less so for the 18-month period, and that the approach has merit compared to traditional paper-and-pencil testing.

Physical Ability as a Predictor of Work Sample Performance

A work sample was developed to validate selectors for filling steelworking positions on the basis of physical ability (Arnold et al., 1982). Work samples for entry level positions in the general labor pool were developed based on job analyses and interviews with managers and incumbent laborers. Some of the work sample activities were shoveling slag, lifting and moving 75-pound bags, carrying jackhammers, and wheelbarrowing. An abstracted work sample was then developed that tapped the general physical abilities required to perform the tasks in the work sample. This abstracted work sample drew on the work of Fleishman (1964) and included static strengths, dynamic strengths, balance, and flexibility. These abilities were the predictors in the study, and were measured by arm, leg, and back dynamometers, balance beam, leg lifts, push ups, squat thrusts, pull ups, and a step test. It was found that the correlations between the abstract work sample items tapping strength and the work sample performance measures were consistently high across three worksites: 82 percent of the correlations were above

Page 39 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

.40. The arm dynamometer was found to have particularly high correlations with work sample performance: average correlation across three work sites with a composite work sample measure was .84. Using multiple regression analyses, Arnold et al. (1982) conclude that the arm dynamometer measure alone is sufficient for selection purposes. They also conclude that using a common regression line would have only a slight bias toward men. Thus, a work sample was successfully used to validate the strength test, and Arnold et al. report that using the arm dynamometer as a selection device could potentially save the company over $9 million a year (using the Hunter et al., 1979, utility techniques).

Another study that used physical ability tests as predictors of work sample performance was by Reilly et al. (1979). In testing telephone company craft jobs, they found that dynamic arm strength and reaction time correlated .34 and −.33, respectively, with an overall work sample performance score. The work sample included pole testing, climbing stepped poles, placing ladders on a cable, placing ladders on a building, climbing unstepped poles, and climbing unstepped poles and removing a drop wire. A common regression line could be used for both males and females.

Paper-and-Pencil Tests as Predictors of Work Sample Performance

Frank and Wilcox (1978), in a study of 22 firemen, used a 6-hour work sample of firefighting skills as a criterion to cross-validate the Strong-Campbell Interest Inventory's moderating effect on the Raven's Progressive Matrices (short form) intelligence test. The work sample covered seven major areas: handling hose lines, ladders, ropework, ventilation procedures, first aid, small tool knowledge, and use of oxygen masks. However, due to lack of variability, the last four tasks were eliminated from the test battery. A single fire captain, who had no previous contact with the trainees, evaluated individual segments of each task as either being performed right or wrong, and then assigned an overall score to each area. It was found that for subjects above and below the median on the Strong-Campbell, the correlation of the Progressive Matrices with the criterion was significantly different. No racial bias was found in either the moderator or the predictor.

In a study of 211 minority and 219 nonminority telephone company repairmen and installers, Grant and Bray (1970) validated five aptitude tests against proficiency measures obtained from a learning assessment program. The learning assessment program is organized into seven levels of training, in ascending order of difficulty, and includes basic electricity, basic telephone, Bell System practices, station circuits, advanced circuits, and trouble location. The learning assessment program is programmed, and a trainee continues through the seven levels until he or she fails to meet the requirements of a particular section. Grant and Bray found that all of the aptitude

Page 40 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

tests were predictive of success in the learning assessment program, and that correlation coefficients between the minority and nonminority samples were comparable. In addition, regression equations for the two samples were compared, and it was found that the slopes were almost identical but that the intercepts differed slightly.

A series of studies by Gael and others (Gael and Grant, 1972; Gael et al., 1975a, 1975b) used work sample criteria to validate employment tests for several telephone company occupations. In the 1972 study, minority and nonminority service representatives completed a general learning ability test (the Bell System Qualification Test I), five clerical aptitude tests (spelling, number comparison, arithmetic, number transcription, and filing), and a role-play interview modeled after actual service representative contacts with customers. There were two general criteria. A paper-and-pencil achievement test was used to measure comprehension and retention of company policies and job procedures and practices. The second criterion was a work sample composed of typical calls that a service representative would encounter, plus the associated clerical work. Performance measures obtained from the work sample were: record preparation, a comparison of records prepared to a model set of records; verbal contact, a sum of the ratings of verbal interaction with the customers; and filing, the sum of records not in the proper location when the work sample was ended. A composite criterion was also used, consisting of the sum of the paper-and-pencil test, record preparation, and verbal contact, minus filing, since filing was in effect an error score.

Gael and Grant (1972) found that six of the seven predictors were significantly related to the composite score for the total and the nonminority samples (number comparison was not predictive), and that three tests were significantly related for the minority sample. The Bell System Qualification Test I, number transcription, and role-play interview were significantly related to the composite score for both the minority and nonminority samples. The Bell System Qualification Test I was the best predictor of both the paper-and-pencil criterion alone (r = .40), and the composite score (r = .33).

A composite predictor score consisting of the Bell System Qualification Test I, number transcription, and the role-play interview score was compared to the composite criterion score. A multiple correlation of .37 was obtained for the total sample, with a multiple correlation of .39 for the nonminority sample and .28 for the minority sample. Regression line slopes and intercepts for the two samples were not significantly different, indicating that the composite predictor was unbiased.

The studies by Gael et al. (1975a, 1975b) used work samples to validate 10 tests of intellectual ability and perceptual speed for two occupations, telephone operators and clerks. The d10 predictors used in both studies were: The Bell System Qualification Test I, spelling, number comparison, arith-

Page 41 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

metic, number transcription, filing, perceptual speed (circling pairs of like numbers that appear together in rows of a 40 × 25 matrix of random digits), area codes (a table of cities and area codes is presented along with a randomly arranged list of cities; the task is to associate correct area codes with the randomly listed cities), marking (numbered boxes must be marked that correspond to a 10-digit telephone number appearing above the boxes), and coding (sets of three letters are presented, and a code must be associated with each set, depending on whether the three letters are the same, whether two are the same, or all are different).

For the telephone operators (Gael et al., 1975a), the work sample consisted of handling a steady stream of incoming calls for one hour. Each activity to be performed on each call was listed on an evaluation form. Supervisors observed the subjects and underlined each activity performed incorrectly, not in accordance with trained procedures, or not at all. In addition, the overall effectiveness of each call was rated on a 5-point scale. A composite criterion was used that included the proportion of activities correct (a ratio of activities completed correctly to activities completed), cumulative work units (the number of calls completed and the complexity of the call-associated activities), and the average rating per call (averaging the 5-point ratings assigned to each call processed).

Gael et al. (1975a) compared the mean scores for white and black telephone operators on all measures. The white sample had a significantly higher mean score on every predictor but filing, and significantly higher mean scores on each criterion and the composite criterion. Every predictor was significantly related with the composite criterion for both the black and white samples. In comparing the two samples, the authors found that a common regression line overpredicts black operator proficiency and underpredicts white operator proficiency for scores below the total sample composite predictor mean. This study, then, found that the paper-and-pencil tests were valid predictors of work sample performance for both white and black operators, and that the possibility of adverse impact is more likely for nonminority than for minority candidates.

In the second study involving clerical positions (Gael et al., 1975b), the same 10 predictors mentioned above were used. Eight separate tests comprised the work sample: filing, classifying, posting, checking, coding, toll fundamentals, punched card fundamentals, and plant repair service. The first five tasks involved a variety of standardized forms that were to be processed according to specific instructions, while the last three tasks were programmed instruction booklets typically used in training courses for certain clerical jobs. The eight scores were standardized and combined into an overall proficiency score for each subject. Black, Spanish-surnamed, and white samples of newly hired clerical employees were used in the study.

In examining the mean scores for each sample on the predictors and

Page 42 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

criteria, it was found that whites scored significantly higher than blacks on 17 of 19 measures, whites scored higher than Spanish-surnamed subjects on 12 of 19 measures, and on 7 of 19 measures Spanish-surnamed subjects obtained a higher mean than black subjects. The number of significant validity coefficients differed to a small degree across the three samples. For blacks, all 10 predictors were significantly correlated with the composite criterion, and for the white sample, only number comparison was not significantly related to the criterion. Arithmetic, number comparison, and perceptual speed were not predictive of the criterion for the Spanish-surnamed sample. Gael et al. (1975b) correlated the composite predictor of the Bell System Qualification Test I, filing, area codes, and marking with the composite criterion. The three sample regression lines had significantly different intercepts, but the slopes were not significantly different. The total sample regression line does not underpredict minority criterion scores. The authors conclude that the composite predictor is highly valid for all samples and that success in clerical work seems best predicted by tests of intellectual ability and perceptual speed and accuracy.

Performance Ratings Validated Against Work Samples

Using a slight procedural twist, Olson et al. (1981) use a functional job analysis (Fine and Wiley, 1971) to develop a work sample test for heavy equipment operators. This work sample was then used to validate the performance ratings of 360 operators, who were divided into four skill levels: high, average, low, and apprentice. The operators were tested on five different pieces of equipment, and required to perform a number of tasks on each. It was found that about 80 percent of the work sample tasks discriminated among the pre-judged operator skill levels.

Behavior Modeling Measured by Work Sample Performance

A number of studies of behavior modeling have used role playing as a form of work sample to evaluate the effectiveness of training. Moses and Ritchie (1976) had 90 managers receive supervisory relations training in such problems as reducing absenteeism, providing performance feedback, quality and quantity of work produced, insubordination, and handling discrimination complaints. A second group of 93 supervisors matched on biographical variables received no training. The work sample then consisted of three problems. Two were related to trained material: excessive absence and a discrimination complaint. The third problem, a case of suspected theft, was designed to test transfer and application of concepts learned in training. It was found that the trained group's performance on each of the three tasks was rated as significantly higher than the group that received no training. In

Page 43 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

a similar study, Burnaska (1976) trained 62 managers in nine interpersonal skills areas. These 62 trained managers, and an additional 62 managers who had not attended the interpersonal skills training course, were then evaluated 1 month and 5 months after training. The evaluation was in the form of role play with three problems: a performance problem discussion, a work assignment discussion, and giving recognition to an average employee. A judge took the role of the employee and rated the manager on four dimensions: maintaining the employee's self-esteem, establishing open and clear communication, maintaining control of the situation, and accomplishing the objective of the discussion. Trained managers outperformed untrained managers for all three problems both at the 1- and 5-month evaluations, and the 5-month ratings were higher than the 1-month ratings.

ASSESSMENT CENTERS

An assessment center is a process that uses multiple techniques for evaluating employees for selection, promotion, placement, or special training and development (Thornton and Byham, 1982). The technique has generally been applied to managerial jobs. It seems to have its greatest value when the participant is being considered for a position very different from the one currently held, since the assessment center allows for the evaluation of skills that may not be available from observation on the current job.

Individuals are usually assessed in groups, and the assessment center staff usually consists of trained management personnel, professional psychologists, or both. The ratio of assessees to staff is usually low. The techniques used in an assessment center allow for evaluation of the assessee individually and in settings involving peer interaction.

Assessment techniques of course differ from center to center, but Thornton and Byham (1982), in reviewing approximately 500 centers, found that in-basket exercises are used by 95 percent of the centers. Some other assessment exercises and their frequency of use include: assigned-role leaderless group discussions (85 percent), interview simulations (75 percent), nonassigned-role leaderless group discussions (45 percent), management games (10 percent), reading, math, and personality tests (1 percent).

Several of the studies mentioned earlier used assessment center kinds of evaluation techniques, but for the most part there was only one type of exercise (interview simulations). A study by Petty (1974) used a leaderless group discussion as the criterion measure of the effect of training and experience on initiating structure in the group, consideration toward others in the group, and overall effectiveness in the group. One hundred ROTC students were assigned to one of four experimental conditions: experience and training (participation in a leaderless group discussion plus a 15-minute lecture on the leaderless group discussion), experience and no training, training and no experience, and

Page 44 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

no experience and no training. Several days later the subjects were randomly assigned to groups of four students who were to discuss and prepare a complete plan of attack for an offensive tactics problem. Each group was observed by two senior ROTC students, who rated each subject on a 14-item behavioral checklist. Petty (1974) found that the training effect was statistically significant across all three criteria, but that experience had no effect. There was a significant interaction effect for training and experience on overall effectiveness. It was also found that most of the variance in the overall effectiveness rating was accounted for by the initiating structure score. Consideration provided only a negligible increase in variance accounted for, and Petty asserts that the leaderless group discussion is therefore more a measure of initiating structure than of consideration.

Ritchie and Boehm (1980) evaluated the use of biographical data and personality tests as prescreening devices for an assessment center. Eighty assessees completed a biographical information form, the Gordon Personal Profile, and the Gordon Personal Inventory. Of 10 subscales, 7 correlated significantly with final assessment center ratings. The authors applied a composite of the prescreening scores to another sample to estimate the pass rate that could be expected from using the pretests. It was found that the pass rate was raised from 44 to 48.5 percent, which would save an estimated $80,000 a year in assessment center operation costs.

MEASUREMENT ISSUES

Few work sample studies report the reliability of behavioral observations obtained by raters. In many cases, only a single rater or observer evaluates testees, and rarely is there a follow-up evaluation to provide any kind of test-retest reliability. However, three studies do address the issue.

Reliability

In Petty's (1974) study, two senior ROTC students observed each of the four subject leaderless group discussions. Each of the subjects participated in two leaderless group discussions, about 2 days apart. Three criterion measures were obtained: initiating structure, consideration, and an overall effectiveness rating. Split-half reliabilities for initiating structure and consideration were .90 and .77, respectively. Test-retest reliabilities were .62 for initiating structure, .38 for consideration, and .57 for overall effectiveness. Interrater reliabilities for the first leaderless group discussion were .74 for initiating structure, .54 for consideration, and .71 for overall effectiveness. For the second leaderless group discussion, interrater reliabilities were .65, .23, and .65 for the initiating structure, consideration, and overall effectiveness criteria, respectively.

Page 45 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Moses (1973) used prediction of managerial potential in a 2-day assessment center to validate a 1-day assessment center. The 2-day assessment center had previously been shown to be valid. Correlations exceeding .70 were obtained between the 2- and 1-day assessment center evaluations, indicating that the shorter, less expensive center could be used as a substitute for the 2-day center.

A laboratory study of 60 undergraduates examined the concurrent and predictive validity of a simple work sample task, as well as its test-retest reliability (Mount et al., 1977). Three predictors were used: assembling a 40-piece erector set model, the Bennett Test of Mechanical Comprehension, and the Wonderlic Personnel Test. The criterion was assembling an 80-piece erector set model. The concurrent validity group built the 40- and 80-piece models in a single session; the predictive validity group built the 40-piece model, and 9 weeks later built the 80-piece model; and the test-retest group built the 40-piece model and 9 weeks later built the 40-piece model again. For the concurrent validity group, the work sample and criterion were significantly correlated, but neither of the paper-and-pencil tests correlated significantly with the criterion. The work sample and the Bennett were both predictive of the criterion measure (.67 and .62, respectively). Finally, there was a test-retest reliability of .86 for the work sample and the criterion.

A metal trades skills work sample was designed by Schmidt et al. (1977) to emphasize oral over written instructions and tests. For the performance criteria of total tolerance and total finish, interrater reliabilities were .95 and .89, respectively. Coefficient Alpha for total tolerance was .50, for total finish was .59, and for the criterion of total work speed was .61.

Response Formats for Work Sample Evaluations

In general, work sample evaluations can use three types of response formats. Global ratings are very general evaluations of behavior and are usually on a Likert-type scale with anchors such as “performs safely or unsatisfactorily” or “performs very well or better than expected. ” These global ratings can be for a number of specific tasks within a work sample or for the work sample as a whole. Quite often evaluations of individual tasks are summed to obtain an overall evaluation. But again, these ratings are very nonspecific and not necessarily tied to specific behaviors observed. Assessment centers and work samples that have a pass/fail criterion quite often use this technique.

A second type of response format used in work sample evaluations is behavioral recording forms. These allow the assessor to rate work sample performance using specific examples of good and poor task behavior. Anchors are developed by job experts and indicate the specific tasks that a

Page 46 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

testee must perform. The rater then makes a judgment as to what degree the testee exhibited the behavior required. While more specific than global ratings, this response format still requires the rater to make an evaluation along some continuum of performance. This response format is probably the most common and was used in studies by Olson et al. (1981), Frank and Wilcox (1978), and Reilly et al. (1979), for example.

A third type of response format is behavioral checklists. These are distinct from the global ratings or behavioral recording forms because the rater describes rather than evaluates the testee's behavior. A standardized checklist is developed that consists of scoring weights for each behavior, and the behaviors are particularly observable and independent of other behaviors. This method is most applicable to jobs that have a definite sequence of steps that must be performed in a particular task. This method was used by Campion (1972) in developing a work sample test for mechanics.

ADVERSE IMPACT

Work samples appear to have less adverse impact against minority groups than do paper-and-pencil tests (Howard, 1983). Two studies directly compared the adverse impact of work sample predictors to paper-and-pencil predictors. Field et al. (1977) compared a minority sample of 52 production workers with 48 nonminority workers in a boxboard container plant. The paper-and-pencil tests used were the Personnel Tests for Industry-Numerical (Form A) and Personnel Tests for Industry-Oral Directions Test (Form S), which measured basic math skills and general mental ability, respectively. Two short work samples were designed to test use of a ruler in measuring various dimensions of a three-dimensional figure and the ability to read and decipher computer printout specifications for making a box. Two criteria were used: a supervisor performance rating, requiring the supervisor to rate each employee on six dimensions of the job; and a productivity measure of the number of boxboard containers produced. Field et al. found that the mean score on the four predictors and the two criteria was higher for nonminority than for minority employees. However, validity coefficients for the two samples showed no adverse impact. Both work samples were significantly related to the two criterion measures for the two samples of employees. For the two paper-and-pencil predictors, the numerical test was significantly related to the performance appraisal criterion. All other validities failed to reach statistical significance. The work samples in this study, therefore, showed no adverse impact and better predictive validity than the two paper-and-pencil tests.

Kesselman and Lopez (1979) compared a paper-and-pencil predictor (Personnel Classification Test, yielding a verbal, numerical, and total score) with a written, accounting job knowledge test. (The personnel classification test was

Page 47 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

chosen prior to a detailed job analysis for the accountant position, while the job knowledge test was designed specifically from the job analysis. The following results must be considered with this in mind.) The two criteria were composite job knowledge proficiency, measured by an 18-item behavioral observation scale completed by the employee 's supervisor, and an overall job performance rating, also completed by the employee's supervisor.

The sample of 52 accountants was analyzed according to sex (27 male and 25 female), and race (28 minority and 24 nonminority). On the job knowledge test, no significant differences were found between the means of the two ethnic groups and the male-female groups. The personnel classification test means showed differences between minorities and nonminorities and between sexes: a higher average score was obtained by whites and by males. No differences were found in the group means for the proficiency criterion, but on overall job performance, females were rated significantly higher than males.

The job knowledge test was found to be a valid predictor of the proficiency criterion for all groups except the minority sample. The job knowledge test was predictive of overall job performance only for the female sample. No validity coefficients for the personnel classification test reached significance. The job knowledge test showed no significant minority-nonminority differences for the slopes and intercepts of the regression lines for the two criteria. While the slopes for the personnel classification test are not significantly different for the two racial groups, the intercepts are different, and a common regression line would underpredict minority criterion values. Kesselman and Lopez (1979) conclude that while the paper-and-pencil predictor showed adverse impact for the minority sample, a job knowledge test carefully constructed from a job analysis eliminated this problem. It should be emphasized again, however, that Kesselman and Lopez chose the personnel classification test prior to a job analysis. The findings of this study would be more impressive had some effort been made to use a standardized predictor that tapped abilities and aptitudes uncovered by the job analysis.

Grant and Bray (1970) found that slopes of the regression lines were equal for minority and nonminority telephone company repairmen in a job training situation, and that the difference in intercepts was actually slightly biased against nonminority candidates. Arnold et al. (1982) found that a strength test for selecting steelworkers would have, at most, a slight adverse impact against males.

The series of studies by Gael, Grant, and Ritchie (Gael and Grant, 1972; Gael et al., 1975a, 1975b) all specifically compared the validities of paper-and-pencil predictors and work sample criteria for minority and nonminority employees. All three studies found that nonminority employees scored significantly higher than minority employees on the predictor and the criterion measures, but that validity coefficients were comparable. Also, in each case

Page 48 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

the authors found that a common regression line did not underpredict minority employee proficiency.

In a study of 87 metal trades apprentices, Schmidt et al. (1977) found that all five subscores of a written job knowledge test (Machine Trades Achievement Test) showed large differences between minorities and non-minorities. However, two of three job sample subscores (tolerance and finish) that required completing a workpiece with oral instructions showed no significant subgroup differences.

Cascio and Phillips (1979) compared white, black, and Latin raters and ratees on 10 verbal and 11 motor work sample tests. By systematically training raters, clearly defining performance standards, and using content valid tests, the authors found average interrater reliabilities of .93 for promotional tests, .87 for entry level tests, .91 for motor tests, and .89 for verbal tests. No evidence for disparate impact was found for any of the rater and ratee race combinations, leading Cascio and Phillips (1979) to term performance tests as “a rose among thorns.”

Brugnoli et al. (1979) examined racial bias in a work sample test for maintenance mechanics. Fifty-six white, male maintenance mechanics evaluated a videotape of a black job applicant and a white job applicant performing a relevant task of laying out, drilling, and tapping, and an irrelevant task of indexing drill bits. The raters then used a highly specific behavioral recording form, a global evaluating form, or both. The only condition in which bias was found involved global evaluations of irrelevant behavior. No bias was found when a behavioral recording form or a global form was used for relevant behaviors, nor for global evaluations made following behavioral recordings. The authors conclude that work samples based on performance that is critical to success or failure on the job, especially when combined with behavioral recordings, will have little potential for racial bias.

One study that did find race and sex bias in a work sample was reported by Hamner et al. (1974). Undergraduate college students acting as managers rated all eight combinations of male/female and black/white job performers. This laboratory study's work sample task was stocking a grocery shelf with large cans. Performance was systematically varied: high performers stocked 48 cans in 3 minutes, while low performers stocked 24 cans in 3 minutes. Global performance was rated on a 15-point scale ranging from weak in overall performance to exceptionally good in overall performance. It was found that 30 percent of the variance in ratings was due to performance, but the higher ratings were given to performers of the same race, higher ratings were given to females, high performing females were rated higher than high performing males, high performing blacks were rated only slightly higher than low performing blacks, while high performing whites were rated much higher than low performing whites. Twenty-three percent of the variance in ratings was due to sex/race combinations. While this study did find some

Page 49 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

instances of race and sex bias, it was an extremely simple task using global evaluation. Both of these study characteristics could explain the results, especially in light of Brugnoli et al.'s (1979) study.

It appears that work sample tests offer an opportunity for reducing adverse impact while at the same time obtaining comparable or even better validities than more traditional predictors or criteria. A thorough job analysis, which normally precedes the development of any predictor or criterion measure, is particularly advantageous in the case of a work sample. By tying the work sample closely to the knowledge, skills, and abilities actually required in a job, any racial differences that do appear should be no greater than actual job performance differences. This approach of course supposes that unfair bias will not enter into the performance appraisal process through global evaluations of irrelevant job behavior (Brugnoli et al., 1979).

CONCLUSION

The research concerning work sample testing suggests that they can produce high predictive validities, and that when used as criteria they compare favorably with supervisor ratings and productivity measures. Work samples appear to be particularly relevant in training situations, as both a measure of training success and as a means of assessing the trainability of individuals prior to a full-length training program. Also of considerable importance is the fact that work sample tests seem to reduce adverse impact, particularly if the ratings concentrate on relevant job tasks. In the few studies that address reliability issues, work samples show good test-retest and interrater reliabilities.

Unfortunately, a large gap exists in the literature with regard to work samples as measures of incumbent employees' performance. This use of work samples as a criterion measure apart from other forms of performance appraisal may be beneficial, however. In jobs that require a high degree of specific technical skills, or in which a core of critical job behaviors can be identified, work samples would be an additional method of obtaining performance data. If properly constructed, they eliminate rating biases by requiring the evaluator to describe the employee's behavior on a standardized form rather than to evaluate the behavior observed. This may reduce or eliminate some of the more common rating errors, such as halo, leniency, or central tendency, since the rater's only judgment is whether a behavior has in fact occurred, and not to what degree or how appropriate that behavior is.

Because of their standardized nature, and the fact that rating occurs while the behavior is taking place, work samples are less prone to the errors arising from a time lag between observation and rating. As mentioned early in this paper, because of their close tie to actual work behaviors, work

Page 50 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

samples also allow an interaction of abilities and skills to occur, an interaction that is often artificially eliminated by rating forms with generalized dimensions of work behavior.

Work samples, however, are not directly substitutable for all forms of performance appraisal. While appropriate for testing certain skills, other, more traditional supervisory evaluations may provide data about interpersonal skills, initiative, etc. if they are indeed critical for job success. These types of skills may be evaluated in an assessment center setting. However, the direct link between assessment center behaviors and job behaviors might not be as clear as the link between motor work samples, for example, and job behaviors. Nonetheless, work sample evaluations can provide an additional source of criterion data that can be thought of as more objective and standardized than supervisory performance ratings. In fact, as Dunnette and Borman (1979) note, we should perhaps not expect high agreement between differing sources of performance evaluation information. While traditional organizational structure makes it the responsibility and even the right of the supervisor to evaluate his or her employees, this practice does not automatically define the supervisor's view as reality. Multiple sources of criterion data should more accurately define an employee 's performance, and work samples appear to be extremely useful in this regard.

REFERENCES

Arnold, J.D, J.M. Rauschenberger, W.G. Soubel, and R.M. Guion 1982 Validation and utility of a strength test for selecting steelworkers , Journal of Applied Psychology 67:588-604.

Asher, J.J. 1972 The biographical item: can it be improved? Personnel Psychology 25:251-269.

Asher, J.J., and J.A. Sciarrino 1974 Realistic work sample tests: a review. Personnel Psychology 27:519-533.

Brugnoli, G.A., J.E. Campion, and J.A. Basen 1979 Racial bias in the use of work samples for personnel selection. Journal of Applied Psychology 64:119-123.

Burnaska, R.F. 1976 The effects of behavior modeling training upon managers' behaviors and employees' perceptions. Personnel Psychology 29:329-335.

Campion, J.E. 1972 Work sampling for personnel selection. Journal of Applied Psychology 56:40-44.

Cascio, W.F., and N.F. Phillips 1979 Performance testing: a rose among thorns? Personnel Psychology 32:751-766.

Cohen, S.L., and L.A. Penner 1976 The rigors of predictive validation: some comments on “A job learning approach to performance prediction.” Personnel Psychology 29:595-600.

Dunnette, M.D., and W.C. Borman 1979 Personnel classification systems. Annual Review of Psychology 30:477-525.

Field, H.S., G.A. Bayley, and S.M. Bayley 1977 Employment test validation for minority and nonminority production workers. Personnel Psychology 30:37-46.

Page 51 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Fine, S.A., and W.W. Wiley 1971 An Introduction to Functional Job Analysis, Methods for Manpower Analysis. Monograph No. 4. Kalamazoo, Mich.: W.E. Upjohn Institute.

Fleishman, E.A. 1964 The Structure and Measurement of Physical Fitness. Englewood Cliffs, N.J.: Prentice-Hall.

Frank, H., and C. Wilcox 1978 Development and preliminary cross-validation of a two step procedure for firefighter selection. Psychological Reports 43:27-36.

Gael, S., and D L. Grant 1972 Employment test validation for minority and nonminority telephone company service representatives. Journal of Applied Psychology 56:135-139.

Gael, S., D.L. Grant, and R.J. Ritchie 1975a Employment test validation for minority and nonminority telephone operators. Journal of Applied Psychology 60:411-419.

1975b Employment test validation for minority and nonminority clerks with work sample criteria. Journal of Applied Psychology 60:420-426.

Gordon, M.E., and L.S. Kleinman 1976 The prediction of trainability using a work sample test and an aptitude test: a direct comparison. Personnel Psychology 29:243-253.

Grant, D.L., and D.W. Bray 1970 Validation of employment tests for telephone company installation and repair occupations. Journal of Applied Psychology 54:7-14.

Hamner, W.C., J.S. Kim, L. Baird, and W.J. Bigoness 1974 Race and sex as determinants of ratings by potential employers in a simulated work sampling task. Journal of Applied Psychology 59:705-711.

Howard, A. 1983 Work samples and simulations in competency evaluations. Professional Psychology: Research and Practice 14:780-796.

Hunter, J.E., and R.F. Hunter 1984 Validity and utility of alternative predictors of job performance Psychological Bulletin 96:72-98.

Hunter, J.E., F.L. Schmidt, and R. Hunter 1979 Differential validity of employment tests by race: A comprehensive review and analysis. Psychological Bulletin 86:721-735.

Inskeep, G.C. 1971 The use of psychomotor tests to select sewing machine operators: some negative findings. Personnel Psychology 24:707-714.

Kesselman, G.A., and F.E. Lopez 1979 The impact of job analysis on employment test validation for minority and nonminority accounting personnel. Personnel Psychology 32:91-108.

Moses, J.L. 1973 Assessment center for the early identification of supervisory and technical potential. In W.C. Byham and D. Bobin, eds., Alternatives to Paper and Pencil Testing. Proceedings of a conference at the Graduate School of Business, University of Pittsburgh.

Moses, J.L., and R.J. Ritchie 1976 Supervisory relations training: a behavioral evaluation of a behavior modeling program. Personnel Psychology 29:337-343.

Mount, M.K., P.M. Muchinsky, and L.M. Hanser 1977 The predictive validity of a work sample: a laboratory study. Personnel Psychology 30:637-645.

Page 52 Cite

Suggested Citation:"Work Samples as Measures of Performance." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Olson, H.C., S.A. Fine, D.D. Myers, and M.C. Jennings 1981 The use of functional job analysis in establishing performance standards for heavy equipment operators. Personnel Psychology 34:351-364.

Petty, M.M. 1974 A multivariate analysis of the effects of experience and training upon performance in a leaderless group discussion. Personnel Psychology 27:271-282.

Reilly, R.R., and W. Manese 1979 The validaton of a minicourse for telephone company switching technicians Personnel Psychology 32:83-90.

Reilly, R.R., S. Zedeck, and M.L. Tenopyr 1979 Validity and fairness of physical ability tests for predicting performance in craft jobs. Journal of Applied Psychology 64:262-274.

Ritchie, R., and V. Boehm 1980 Reducing costs by prescreening assessment center candidates. Assessment and Development 7(2):5.

Robertson, I., and S. Downs 1979 Learning and the prediction of performance: development of trainability testing in the United Kingdom. Journal of Applied Psychology 64:42-50.

Robertson, I.T., and R.S. Kandola 1982 Work sample tests: validity, adverse impact, and applicant reaction Journal of Occupational Psychology 55:171-183.

Robertson, I.T., and R.M. Mindel 1980 A study of trainability testing. Journal of Occupational Psychology 53:131-138.

Schmidt, F.L., A.L. Greenthal, J.E. Hunter, J.G. Berner, and F.W. Seaton 1977 Job sample vs. paper-and-pencil trades and technical tests: adverse impact and examinee attitudes. Personnel Psychology 30:187-197.

Schmitt, N., R.Z. Gooding, R.A. Noe, and M. Kirsch 1984 Metaanalyses of validity studies published between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology 37:407-422.

Siegel, A.I. 1983 The miniature job-training and evaluation approach: additional findings Personnel Psychology 36:41-56.

Siegel, A.I., and B.A. Bergman 1975 A job learning approach to performance prediction. Personnel Psychology 28:325-339.

Thornton, G.C., III, and W.C. Byham 1982 Assessment Centers and Managerial Performance. New York: Academic Press.

Wernimont, P.F., and J.P. Campbell 1968 Signs, samples, and criteria. Journal of Applied Psychology 52:372-376.