Read "Performance Assessment for the Workplace: Volume I" at NAP.edu

Page 116 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

6 Evaluating the Quality of Performance Measures: Reliability

Previous chapters have discussed the development and administration of formal measures of job performance in the psychometric tradition. Much of the emphasis has been on building quality into the measures and into the measurement process. The discussion turns now to the results—the performance scores—and to a variety of analyses used to provide evidence that the measurements mean something. Traditionally this has meant examining the reliability and validity of the test or measure—which, broadly put, means in the first instance whether the test measures anything at all, and in the second, the extent to which it measures what it was intended to measure. This chapter looks at reliability, which is a property of the measurement process itself. Our discussion of the quality of job performance measures continues in Chapter 7 with an examination of content representativeness, and in Chapter 8 with analyses of the predictive validity and construct validity of the measures, which entail statistical analysis of their relations to other variables.

RELIABILITY

The first question to ask of any assessment instrument is whether the scores can be relied on, or whether they are so haphazardly variable that they cannot be said to signify anything. If a test taker is tested, and then tested again with an equivalent test, we would expect his or her score to be

Page 117 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

about the same. In principle, we could contemplate testing an individual repeatedly, with measures considered to be equivalent, so that we could then examine the resulting distribution of test scores for that person. Ideally the scores would be identical, but few measures are that good. Even measures of a person's height will typically vary by a few millimeters. With complex, performance-based measures, the scores might vary because of the choice of tasks for the tests, because of differences between test scorers, as well as other, chance factors. Still, the scores would be useful if their standard deviation were small relative to the score distribution of the population of incumbents, or relative to the potential score range of the test.

The Classical Formulation

The standard deviation of the potential scores of an individual on a test is called the standard error of measurement; it is an important psychometric property of the test. Expressed in the units of the score scale, it indicates how much the score would be expected to change if the measurement were repeated countless times. Assuming normally distributed measurement errors, roughly two-thirds of the scores for a person would be within one standard error of measurement from his or her mean score. Of course repeated testing is usually out of the question, but one repetition is often possible, especially in a research setting. The correlation between the scores on two equivalent testings for a representative group of test takers is called the test reliability. If one is willing to assume that errors are independent of score level—that is, all test scores are subject to about the same kind and amount of potential variation—then a formula can be developed relating the test reliability to its standard error of measurement.

The relation of test reliability to the standard error of measurement is based on the very simple model of classical test theory, in which an individual's test score, x, is made up of a systematic or consistent (“true”) part, t, that is invariant over equivalent tests, and an error part, e, that varies independently of t: x = t + e. Across individuals, the standard deviation of e is the standard error of measurement. It is assumed to be the same for all individuals. On an equivalent test (which might be the very same test on another occasion or a different version of the test), the theory says that the individual would have exactly the same true score, t, but a different and independent error, which we might here call e′ . It is assumed that the standard deviation of e′ is also the same for all individuals. It then follows that

where σ_x is the standard deviation of the representative population of test takers and r is reliability—the correlation of the two equivalent testings in

Page 118 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

the same population. The details of this classical model can be found in Lord and Novick (1968) or Gulliksen (1950), along with many variations, and need not concern us here.

For our purposes it is sufficient to say that the reliability coefficient provides a way of estimating the standard error of measurement. Reliability has the disadvantage of being population-dependent, whereas, in theory, the standard error of measurement is independent of the population being tested. Also, the standard error of measurement, being expressed in the units of the test score scale, is therefore useful for interpreting test scores. It is, however, inconvenient as an index, just because it is expressed in score units. The test reliability is a much more useful index, being a unitless correlation coefficient varying between 0 and 1.0. Accepted standards in the field are vague and depend on the characteristic being measured: generally speaking, reliabilities of .6 to .7 are considered marginal, .7 to .8 acceptable, .8 to .9, very good, and above .9 excellent.

Approaches to Reliability Analysis

Test reliability and the standard error of measurement are both global concepts. In practice, the reliability of a test or measure can be examined in a number of ways, each of which involves repeated observations on the same group of people to determine how much their scores fluctuate. For example, the correlation of scores obtained by giving a group the same test on two separate occasions provides an index of test-retest reliability, sometimes called test stability. This index indicates how much error is due to occasion-to-occasion fluctuations. Consistency in a person's scores on the two test occasions represents the true part of the performance score. Fluctuations in the person's scores from one occasion to the next are attributed to error. If the pair of scores is very close, error is small. If this holds for all people in the group, then the differences in scores from one person to another are considered to be due to consistent performance differences between them—that is, due to true-score differences, not error. If each person's pair of scores fluctuates widely, however, the differences among all peoples' scores are attributed primarily to measurement error.

Similarly, if two raters view and score the performance of each examinee on a job performance test, then the consistency of the raters can be evaluated by intercorrelating their scorings, providing an index called interrater reliability. A third approach to reliability analysis is to compute the correlation of scores on two equivalent or “parallel ” versions of a test given to a group of examinees, either at the same time or within a short time span. Differences between scores on two parallel forms of a test or measure provide an index of how much error is introduced by the particular set of items used in a given form.

Page 119 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

If resources do not permit repeated test administrations, multiple scorers, or the administration of parallel forms, so that the data are limited to just one occasion, one rater, and one form, it is still possible to get a reliability estimate by looking at the consistency of responses to items or tasks internal to the test. In essence, the average interitem score correlations can be manipulated to provide an estimate of parallel-form reliability in what has come to be called an index of internal consistency reliability. This procedure depends on all the items being interchangeable; i.e., it assumes that all items are measuring the same thing, and that any set of items is equivalent to any other set. Item heterogeneity is interpreted as error. Of course, when performance is legitimately multidimensional, not all item inconsistency is error, and the internal consistency reliability index can be a misleading tool.

Plainly, each method of assessing reliability addresses a different aspect of measurement error. Parallel-form reliability assesses error due to selection of items. The test-retest method assesses differences introduced by the occasion of testing. Interrater reliability investigations consider errors deriving from the observers who score the performance, and internal consistency focuses on item-to-item variation as a possible source of error.

Reliability Analysis in the JPM Project

Each of these approaches was used by one or more Services in the JPM Project to learn about the reliability of hands-on job sample tests. Given the central role played by observers in this assessment format, the question of consistency in scoring among raters was of particular interest. And the problem of small numbers of test items (tasks) that characterizes performance assessments made some sort of parallel-forms reliability analysis especially salient.

In the Navy study, for example, the reliability of hands-on performance scores for machinist's mates was evaluated in a separate study by having two examiners observe the performance of a subset of 26 machinist 's mates as they carried out 11 tasks in the engine room of a ship similar to that on which they serve. The examiners, retired Navy machinist's mates, observed each incumbent perform all 11 tasks. The examiners had been given extensive training (see Bearden, 1986; Kroeker et al., 1988) and were accompanied by other trained personnel at all times. The hands-on tasks included reading gauges, operating equipment, and carrying out casualty control. Each task had multiple steps, ranging in number from 12 to 49, that were scored in a go/no go format by each of the two observers. A task score was the proportion of steps correctly completed, and the total performance score was the average across tasks.

The consistency of the observers' scoring was calculated by correlating

Page 120 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

the machinist's mates scores given by examiner 1 with their scores from examiner 2, after averaging over all 11 tasks. Table 6-1 presents a schematic of the data collected. Based on the full data set for the two examiners, 26 machinist's mates and 11 tasks, the Navy reported a median interrater reliability coefficient of .99, with a range of .97 to .99. This extraordinarily high reliability reflects the fact that there was near-perfect agreement in the total scores assigned by the observers. (The agreement among raters using performance appraisal rating forms, a far more common assessment technique, typically has not been nearly so close.)

The Navy study also focused on tasks, although the development costs and the large amount of time required by hands-on performance testing dictated an internal consistency analysis rather than the more satisfying parallel forms approach. The internal consistency reliability was calculated by averaging over the two examiners' scores to produce an average proportion correct for each machinist's mate on each of the 11 tasks. The machinist's mate's performance on the 11 tasks was then intercorrelated for all task pairs: r_1,2, r_1,3. . ., r_10,11. The average internal consistency reliability of the hands-on performance measure for single pairs of tasks was .19, and the reliability based on all 11 tasks was .72. This relatively modest value means that some machinist's mates performed well on certain tasks and some performed well on other tasks. It could be taken to mean that the measurement instrument is of rather low quality, with scores fluctuating haphazardly, or it could indicate that the tasks in the hands-on performance measure are heterogeneous—that they measure different facets of performance on which the job incumbents would exhibit stable differences in skill or ability over time.

In the JPM Project, internal consistency was used as the main index of reliability. Table 6-2 summarizes the internal consistency reliabilities obtained by the Services for their various hands-on job performance tests. By

TABLE 6-1 Schematic of the Actual Data Collected on Machinist's Mates' Hands-On Performance

Machinist Mate/Examiner	Task
	1		2		11
	E1	E2	E1	E2	E1	E2
01	.93	.93	.93	.93	.78	.78
02	.57	.64	.67	.80	.78	.67
.	.	.	.	.	.	.
.	.	.	.	.	.	.
.	.	.	.	.	.	.
26	.79	.79	.60	.60	.67	.67

Page 121 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

TABLE 6-2 Internal Consistency Reliability of Hands-On Performance Measures

Service	Number of Specialities	Median Reliability	Range of Reliabilities
Marine Corps	4	.87	.82 to .88
Army	9	.85	.75 to .94
Navy	2	.81	.77 to .85
Air Force	8	.75	.65 to .81

contemporary standards, the reliabilities of most of these tests are very good, but not outstanding. Widely used paper-and-pencil tests like the ASVAB have similar reliabilities of the subtest scores. According to classical test theory, a test can be made more reliable by increasing the number of items, since true-score variance grows faster than error variance. This theoretical result is easily verified in practice; for example, the composites of ASVAB test scores used for selection and placement, like the AFQT, are scores from many more items and have reliabilities over .90. Here, the job performance tests would have had to be lengthened—including up to twice as many tasks—to reach reliabilities of .90. Since the tests were already very long, one task sometimes taking 30 minutes, adding more tasks was generally seen as impractical.

UNDERSTANDING MULTIPLE SOURCES OF ERROR¹

Traditional approaches to reliability analysis present researchers with an obvious dilemma: Which reliability coefficient should be used to characterize the performance measurement? The Navy analyses cited above, showing interrater reliabilities of .98 and an internal consistency coefficient of .72, tell quite different stories about hands-on job performance measures. Which one is the more compelling? The former gives a more optimistic accounting than the latter, but magnitude is no reason for preferring it. Classical test theory provides no help since it lumps all error together. Total test variance is simply taken to be the sum of true and error variance, and reliability is simply the ratio of true-score variance to total test variance. What is needed is a method for taking the variability of both raters and tasks into account simultaneously. Cronbach et al. (1972) extended classical test theory to the consideration of multiple sources of error by the

¹	This section draws on the work of Noreen Webb and Weichang Li at the University of California, Los Angeles. We are indebted to them for providing results of a number of analyses prior to their publication.

Page 122 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

simple and elegant use of the analysis of variance paradigm, following earlier work by Lindquist (1953). This procedure, although straightforward, is not widely known nor much used in the testing community. Shavelson (Vol. II) provided the committee and the JPM Project with a detailed account of this method, which Cronbach et al. called generalizability theory, or simply G theory.

G-Theory Analysis

The term generalizability stems from the premise that the purpose of assessing reliability is to determine to what other circumstances the scores can be generalized. If one wants to generalize beyond the particular tasks on the test, then variability of individuals ' scores across tasks (items) should be considered a component of error. If one wants to generalize beyond the particular examiners used, then variability of scores across examiners should be considered to be error. If time of day were thought to contribute to error, then each test taker would have to be tested at two or more different times of the day, to generate a temporal error component. Instead of asking how accurately observed scores reflect their corresponding true scores, G theory asks how accurately observed scores permit generalization about people's behavior in a defined universe of generalization —in our example, generalization of an individual's score across tasks, examiners, and time. More specifically, it examines the generalization of a person's observed score to a “universe score”—the average score for that individual in the universe of generalization.

Just as classical test theory decomposes an observed score into true-score and error components (x = t + e), G theory decomposes an observed score into a number of components. For the machinists' mate hands-on performance data, G theory would (roughly) define an individual's observed score as follows:

Observed score = universe score effect + examiner effect + task effect + [universe-score effect by examiner effect] + [universe-score effect by task effect] + [examiner effect by task effect] + [universe-score effect by examiner effect by task effect].

The multiplicative terms, or interactions, reflect the unique contributions of a combination of components.

Each of the effects underlying observed scores has a distribution. One can calculate a variance associated with each of the components that defines the expected observed-score variance (for technical details, see Shavelson, Vol. II). Generalizability theory permits a reliability index (called a generalizability coefficient) to be constructed as the ratio of the sum of

Page 123 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

universe-score variance components to the expected observed-score variance. The statistical mechanism used to estimate each of the variance components underlying a person's observed score is a variance component model of the expected mean squares in a standard analysis of variance (ANOVA). Usually the random-levels model for variance components is used.

It is necessary here to distinguish between random error and systematic measurement error. Very little measurement error can be simply interpreted as random error in the strict sense of mathematical statistics. If some examinees are relatively better at some tasks whereas other examinees perform better on others, a person-by-task interaction will be found; this is not random error, but would be considered measurement error in most contexts. A person's score depends on the particular task sampled on a test, something systematic but not predictable since tasks are randomly sampled. If one wishes to generalize an individual's score on a performance test to other tasks in the job, which one will almost always want to do, then a person-by-task interaction contributes to measurement error. A large variance component for this term reduces the generalizability of the scores to other tasks. Likewise, generalizing to other raters is problematic if the analysis reveals a main effect of rater, indicating that some raters are more lenient than others. A substantial variance component for the rater-by-task interaction would indicate that raters vary in their lenience in scoring different tasks, which would certainly be considered measurement error in most score interpretations.

None of these effects is simply random error, but all of the effects can be considered to be measurement error. If one's score depends on which tasks were done, or which raters evaluated which tasks, then generalizing to other tasks and other raters can be done only with considerable uncertainty. The score represents effects due not only to job performance but also to other unwanted effects.

Generalizability theory allows for different interpretations of what constitutes measurement error. For example, a main effect of rater is of no consequence if all raters rate every test taker, provided that the score is interpreted only in a relative (“norm-referenced ”) sense (comparing one test taker to another). If the score is to be interpreted in an absolute—or competency—sense, then rater effects do contribute to measurement error. That is, had there been different raters, the level of competency might have appeared to be different. Thus it is common in a G-theory analysis to identify all sources of variance and then to decide which variance components are to be considered part of the measurement error.

In constructing the G coefficient, it is necessary to recognize that the variance components are initially determined at the level of the item or, in the JPM case, at the level of the task. When, as is usual, the total score is

Page 124 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

the average of several pieces of data, the variance components in the reliability index must be weighted inversely by the number of such items in the average score. The amount of error in one task or item is typically large. If one neglects to account for the fact that the test score is an average of several item scores, then the reliability will seem very small, as indeed the reliability of a one-item test would be.

JPM Applications of G-Theory Analysis

Because of the character of hands-on job sample tests, the committee was especially interested in raters and tasks as probable sources of measurement error. At the committee's urging, two Services conducted G-theory analyses of such effects with some surprising results. The Navy used ANOVA techniques to look again at the machinist's mate data described above and the Marine Corps did some fairly elaborate G-theory analyses of the measurement of infantry performance (Webb et al., 1989).

The list of terms in the analysis of variance of task scores for Navy machinist's mates is shown in Table 6-3. Note that there were 26 mates (M), 2 examiners (E), and 11 tasks (T) in a completely crossed analysis of variance. The levels of all three variables were considered to be a few out of many possible, leading to the random-effects model for variance components. The table shows the variance components calculated from the mean squares in the analysis (multiplied by 1,000 for convenience.) From these component estimates, the theory permits calculation of the average reliability of a task score as the ratio of the M component to the sum of M + ME + MT + MET components. For an absolute score, the effect of the particular task means, T, and examiners, E, and the ET interaction would have to be considered measurement error as well and added to the denominator. For the test as constituted, all terms except M are divided by the number of items, since the test score is, in effect, an average of task and examiner scores. For two raters, terms involving raters would be divided by the number of examiners to be used in the actual test. In this instance, nothing would change because the raters introduced no measurement error. It is possible to extrapolate to the potential use of more tasks, assuming that new tasks would be sampled from the same universe. The reliability for relative (norm-referenced) interpretation of scores was .80 using 18 tasks rather than 11.

The Marine Corps study involved 150 infantrymen at two sites, Camp Pendleton and Camp Lejeune. The Marine Corps hands-on test had 35 scorable units. For this study, performance was rated by two raters, who were retired Marines. The study was quite complicated, with many special design features (see Shavelson et al., 1990). For our purposes, a simplified version of the results is shown in Table 6-4. In this version, the Marine

Page 125 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

TABLE 6-3 Estimated Variance Components and Generalizability Coefficients for the Hands-On Job Performance Test of Machinist's Mates by Examiner by Task

Source of Variation	Degrees of Freedom	Estimated Variance Component (× 1,000)
Machinist's mates (M)	25	6.26
Examiners (E)	1	0.00
Tasks (T)	10	9.70
M × E	25	0.00
M × T	250	25.84
E × T	10	0.03
M × E × T (error)	250	1.46
Generalizability coefficients
	One Examiner	Two Examiners
	One Task	Eleven Tasks
Relative	0.19	0.72
Absolute	0.14	0.65
Generalizability is the ratio of true to true plus error variance. The true variance component is M in all cases. The error variance components are: Relative, 1E, 1T: M × E + M × T + M × E × T Absolute, 1E, 1T: M × E + M × T + M × E × T + E + T + E × T Relative, 2E, 11T: (M × E)/2 + (M × T)/11 + (M × T × E)/22 Absolute, 2E, 11T: (M × E + E)/2 + (M × T + T)/l l + (M × E × T + E × T)/22 SOURCE: Webb et al. (1989).

TABLE 6-4 Estimated Variance Components in Generalizability Study of Performance Test Scores for Marine Infantrymen, Replicated at Camps Lejeune and Pendleton

Source of Variation	Variance Components
	Lejeune	Pendleton
Marines (M)	11.69	9.13
Examiners (E)	0.00	0.00
Tasks (T)	33.05	35.56
M × E	0.35	0.28
M × T	72.91	67.38
E × T	0.07	0.02
M × E × T	11.69	12.70
Generalizability Coefficients for 35 Tasks
Relative	.81	.78
Absolute	.76	.72

Page 126 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

Corps study was parallel to the Navy study: the similarity of the Marine Corps and Navy results is startling. Reliability for a 35-item test, for relative scores, was .83 for Camp Pendleton and .80 for Camp Lejeune. There was marked heterogeneity in the tasks and virtually no disagreement between the raters.

In both studies the size of the reliability is satisfactory, but the main result of interest is the relative size of the variance components. Note that the variance components for raters or interactions involving raters are so small as to be essentially zero. The Marine Corps study is replicated at two sites, Camp Pendleton and Camp Lejeune. There can be no denying that the result is real. Some observers worried that the raters were not independent but were adjusting their ratings to be consistent. Others felt that monitoring by project staff and by occasional observers avoided collusion between the raters. Probably some inadvertent or intentional cooperation occurred, especially in the tight quarters on shipboard—but it was not widespread. What appears to be happening in these studies is that careful development of scorable items, daily monitoring of results by the research team, and careful and detailed scoring criteria left almost nothing to chance. In each case, tasks were comprised of several steps, each of which was scored in the go/ no go format, and the steps occurred in a predefined order. With such a format, either a step is done successfully or not, and there appears to be little room for disagreement. In achievement tests such as the College Board English Essay Test, when graders have to give a qualitative grade for each essay, more disagreement is generally found. With care, discussion, and occasional negotiation, such disagreement can be managed. But apparently the JPM design of hands-on task tests is such that discrepancies among raters can be eliminated.

The Services evolved different strategies for obtaining qualified raters. A careful comparison of the Service results would be instructive. A comparison of the reliabilities of the hands-on tests with the various surrogates would likewise be instructive.

Although more elaborate studies of error sources were done, with complex results, the main findings continued to emerge. Raters appeared to introduce very little measurement error. Tasks, however, regularly turned up as large contributors to measurement error, indicating the need for more tasks to get a clearer picture of the stable performance differences among the Service personnel in the study.

It should be noted that some committee members were surprised that the hands-on tests turned out to be as reliable as they are. Their expectation was that 10 to 15 tasks would be woefully inadequate. That would probably have been true had each task been scored dichotomously. But each task score is the proportion of steps completed satisfactorily, so the score for a

Page 127 Cite

Suggested Citation:"6. Evaluating the Quality of Performance Measures: Reliability." National Research Council. 1991. Performance Assessment for the Workplace: Volume I. Washington, DC: The National Academies Press. doi: 10.17226/1862.

×

task has more information than a dichotomously scored multiple-choice test item.

It should also be noted that some of the tests were assessed for other sources of error. For example, the Marine Corps found a test-retest reliability of about .90, and a parallel-form reliability of .78 for their infantryman performance test that did not include live fire; with live fire the parallel form reliability was .70. Note that these results are entirely consistent with the earlier conclusion that item heterogeneity, which contributes to error in the parallel-form assessment but not in the test-retest assessment, is the main contributor to measurement error.

The Navy and Marine Corps studies of multiple sources of measurement error briefly reported here show the promise of the G-theory approach for getting a better appreciation of the properties of complex, performance-based measures. As this measurement method seems to be becoming more prevalent for educational and employee assessment purposes, the traditional approaches to reliability and validity will not suffice. These early G-theory analyses indicate that, with care and good design, quite reliable hands-on performance measurements can be made.