Skip to main content

Currently Skimming:

6. Evaluating the Quality of Performance Measures: Reliability
Pages 116-127

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 116...
... This chapter looks at reliability, which is a property of the measurement process itself. Our discussion of the quality of job performance measures continues in Chapter 7 with an examination of content representativeness, and in Chapter 8 with analyses of the predictive validity and construct validity of the measures, which entail statistical analysis of their relations to other variables.
From page 117...
... The relation of test reliability to the standard error of measurement is based on the very simple model of classical test theory, in which an individual's test score, x, is made up of a systematic or consistent ("true") part, t, that is invariant over equivalent tests, and an error part, e, that varies independently of t: x = t + e.
From page 118...
... If each person's pair of scores fluctuates widely, however, the differences among all peoples' scores are attributed primarily to measurement error. Similarly, if two raters view and score the performance of each examinee on a job performance test, then the consistency of the raters can be evaluated by intercorrelating their scorings, providing an index called interrater reliability.
From page 119...
... that characterizes performance assessments made some sort of parallel-forms reliability analysis especially salient. In the Navy study, for example, the reliability of hands-on performance scores for machinist's mates was evaluated in a separate study by having two examiners observe the performance of a subset of 26 machinist's mates as they carried out 11 tasks in the engine room of a ship similar to that on which they serve.
From page 120...
... (The agreement among raters using performance appraisal rating forms, a far more common assessment technique, typically has not been nearly so close.) The Navy study also focused on tasks, although the development costs and the large amount of time required by hands-on performance testing dictated an internal consistency analysis rather than the more satisfying parallel forms approach.
From page 121...
... UNDERSTANDING MULTIPLE SOURCES OF ERROR1 Traditional approaches to reliability analysis present researchers with an obvious dilemma: Which reliability coefficient should be used to characterize the performance measurement? The Navy analyses cited above, showing interrater reliabilities of .98 and an internal consistency coefficient of .72, tell quite different stories about hands-on job performance measures.
From page 122...
... Instead of asking how accurately observed scores reflect their corresponding true scores, G theory asks how accurately observed scores permit generalization about people's behavior in a defined universe of generalization in our example, generalization of an individual's score across tasks, examiners, and time. More specifically, it examines the generalization of a person's observed score to a "universe score" the average score for that individual in the universe of .
From page 123...
... The statistical mechanism used to estimate each of the variance components underlying a person's observed score is a variance component model of the expected mean squares in a standard analysis of variance (ANOVA)
From page 124...
... The table shows the variance components calculated from the mean squares in the analysis (multiplied by 1,000 for convenience.) From these component estimates, the theory permits calculation of the average reliability of a task score as the ratio of the M component to the sum of M + ME + MT + MET components.
From page 125...
... . TABLE 6-4 Estimated Variance Components in Generalizability Study of Performance Test Scores for Marine Infantrymen, Replicated at Camps Lejeune and Pendleton Source of Variance Components Variation Lejeune Pendleton 11.69 0.00 33.05 0.35 72.9 1 0.07 11.69 Marines (M)
From page 126...
... Note that the variance components for raters or interactions involving raters are so small as to be essentially zero. The Marine Corps study is replicated at two sites, Camp Pendleton and Camp Lejeune.
From page 127...
... For example, the Marine Corps found a test-retest reliability of about .90, and a parallel-form reliability of .78 for their infantryman performance test that did not include live fire; with live fire the parallel form reliability was .70. Note that these results are entirely consistent with the earlier conclusion that item heterogeneity, which contributes to error in the parallel-form assessment but not in the test-retest assessment, is the main contributor to measurement error.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.