*Richard J. Shavelson*

This paper sketches a statistical theory of the multifaceted sources of error in a behavioral measurement. The theory, generalizability (G) theory (Cronbach et al., 1972), models traditional measurements such as aptitude and achievement tests. It provides estimates of the stability of a measurement (“testretest” reliability in classical test theory), the consistency of responses to parallel forms of a test (“equivalent-forms ” reliability), and the consistency of responses to test items (“internal-consistency” reliability). Each type of classical reliability coefficient defines measurement error somewhat differently. One of G theory's major achievements is that it simultaneously estimates the magnitude of the errors influencing all three classical reliabilities. Hence, we speak of G theory as a theory of the multifaceted sources of error.

Performance measurements may contain the same sources of error as traditional pencil-and-paper measurements: instability of responses from one occasion to the next, nonequivalence of supposedly parallel forms of a performance measurement, and heterogeneous subtask responses. And more. Two additional, pernicious sources of error are inaccuracies due to scoring, where observers typically score performance in real time, and inaccuracies

The author gratefully acknowledges helpful and provocative comments provided by Lee Cronbach and the graduate students in his seminar on generalizability theory. The author alone is responsible for the contents of this paper.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Generalizability Theory and Military Performance Measurements: I. Individual Performance
Richard J. Shavelson
INTRODUCTION
This paper sketches a statistical theory of the multifaceted sources of error in a behavioral measurement. The theory, generalizability (G) theory (Cronbach et al., 1972), models traditional measurements such as aptitude and achievement tests. It provides estimates of the stability of a measurement (“testretest” reliability in classical test theory), the consistency of responses to parallel forms of a test (“equivalent-forms ” reliability), and the consistency of responses to test items (“internal-consistency” reliability). Each type of classical reliability coefficient defines measurement error somewhat differently. One of G theory's major achievements is that it simultaneously estimates the magnitude of the errors influencing all three classical reliabilities. Hence, we speak of G theory as a theory of the multifaceted sources of error.
Performance measurements may contain the same sources of error as traditional pencil-and-paper measurements: instability of responses from one occasion to the next, nonequivalence of supposedly parallel forms of a performance measurement, and heterogeneous subtask responses. And more. Two additional, pernicious sources of error are inaccuracies due to scoring, where observers typically score performance in real time, and inaccuracies
The author gratefully acknowledges helpful and provocative comments provided by Lee Cronbach and the graduate students in his seminar on generalizability theory. The author alone is responsible for the contents of this paper.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
due to unstandardized testing conditions, where performance testing is typically carried out under widely varying laboratory and field conditions.1 G theory's ability to estimate the magnitude of each of these sources of error, individually and in combinations, enables this theory to model human performance measurement better than any other.
The next section provides an example of how generalizability theory can be applied to military job performance measurements, using hypothetical data. The third section presents G theory formally, but with a minimum of technical detail. Key features of the theory are illustrated with concrete numerical examples. The fourth section presents applications of the theory. These applications were chosen to highlight the theory 's flexibility in modeling a wide range of measurements. The fifth section concludes the paper by discussing some limitations of the theory.
APPLICATION OF GENERALIZABILITY THEORY TO THE MEASUREMENT OF MILITARY PERFORMANCE
Background
Military decision makers, ideally, seek perfectly reliable measures of individuals' performance in their military occupational specialities. 2 Even with imperfect measures, the decision maker typically treats as interchangeable measures of an individual's performance on one or another representative sample of military occupational specialty tasks (and subtasks) that were carried out at any one of many test stations, on any of a wide range of occasions, as scored by any of a large number of observers. Because he wants to know what the person 's performance is like, rather than what he did on one particular moment of observation, he is forced to generalize from a limited sample of behavior to an extremely large universe: the individual's job performance across time, tasks, observers, and settings. This inference is sizable. Generalizability theory provides the statistical apparatus for answering the question: Just how dependable is this measurement-based inference?
To estimate dependability, an individual's performance needs to be observed on a sample of tasks/subtasks, on different occasions, at different stations, with different observers. A generalizability study (G study), then, might randomly sample five E-2s,3 who would perform a set of tasks (and subtasks) on two different occasions, at two different stations, with four
1
By design, traditional pencil-and-paper tests control for scoring errors by using a multiplechoice format with one correct answer, and testing conditions are standardized by controlling day, time of day, instructions, etc.
2
“Military occupational specialty” is used generically and applies to Air Force specialties and Navy ratings as well as to Army and Marine Corps military occupational specialties.
3
Large samples should be used. For illustrative purposes, small samples are more instructive.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
observers scoring their performance. An individual would be observed under all possible combinations of these conditions or a total of 16 times (2 occasions × 2 stations × 4 observers) on the set of tasks/subtasks.
If performance is consistent across tasks, occasions, stations, and observers—i.e., if these characteristics of the measurement do not introduce systematic or unsystematic variation in the measurement —the measurement is dependable and the decision maker's ideal has been met. More realistically, however, if the individual 's score depends on the particular sample of tasks to which he was assigned, on the particular occasion or station at which the measurement was taken, and/or on the particular observer scoring the performance, the measurement is less than ideally dependable. In this case, interest attaches to determining how to minimize the impact of different sources of measurement error.
Performance Measurement: Operate and Maintain Caliber .38 Revolver
To make this general discussion concrete, an example is in order. One of the Army's military occupational specialty-specific performance measures involves operating and maintaining a caliber .38 revolver. The soldier is told that this task covers the ability to load, reduce a stoppage in, unload, and clean the caliber .38 revolver, and that this will be timed. The score sheet for this measurement is presented in Table 1. Note that there are two measurements taken: time and accuracy.
In the G study, suppose that each of five soldiers performed the revolver test four times: on two different occasions (e.g., week 1 and week 2) at two different test stations.4 The soldiers' performance on each of the three tasks and subtasks (see Table 1) was independently scored by four observers. Also, each task as a whole is independently timed. Hypothetical results of this study are presented in Table 2 for the time measure. Note that time is recorded for each of three tasks and not for individual subtasks (Table 1); hence, subtasks are not shown in Table 2.
Classical Theory Approach
With all the information provided in Table 2, how might classical reliability be calculated? With identical performance measurements taken on
4
There is good reason to worry about an order effect. This is why “tuning” subjects before they are tested is strongly recommended (e.g., Shavelson, 1985). “Tuning” is familiarizing subjects with the task before they are tested. (If a subject can “fake” the task in a performance test, this means that she can perform it.) Nevertheless, soldiers would be counterbalanced such that half would start at station 1 and half at station 2. Finally, as will be seen, an alternative design with occasions nested within stations might be used.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
TABLE 1 Caliber .38 Revolver Operation and Maintenance Task
Score
Task
Subtask
Go
No Go
Load the weapona
(1) Held the revolver forward and down
—
—
(2) Pressed thumb latch and pushed cylinder out
—
—
(3) Inserted a cartridge into each chamber of the cylinder
—
—
(4) Closed the cylinder
—
—
(5) Performed steps 1-4 in sequence
—
—
Time to load the weapon
_______________
Reduce a stoppageb
(6) Recocked weapon
—
—
(7) Attempted to fire weapon
—
—
(8) Performed steps 6-7 in sequence
—
—
Time to reduce stoppage
_______________
Unload and clear the weaponc
(9) Held the revolver with muzzle pointed down
—
—
(10) Pressed thumb latch and pushed cylinder out
—
—
(11) Ejected cartridges
—
—
(12) Inspected cylinder to ensure each chamber is clear
(13) Performed steps 6-9 in sequence
—
—
Time to unload and clear the weapon
_______________
NOTES: Instructions to soldier:
aThis task covers your ability to load the revolver; we will time you. Begin loading the weapon.
bYou must now apply immediate action to reduce a stoppage. Assume that the revolver fails to fire. The hammer is cocked. Begin.
cYou must now begin unloading the weapon.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
TABLE 2 Caliber .38 Revolver Operation and Maintenance Task: Time to Complete Tasks
Observer
Station
Occasion
Task
1
2
3
4
1
84
85
86
87
82
84
85
85
1
91
92
92
94
83
82
84
85
75
76
78
78
76
76
77
77
75
84
75
76
1
2
83
81
83
81
77
78
76
77
69
70
70
70
94
95
96
97
91
92
93
94
3
99
99
99
99
93
94
94
95
83
83
84
85
* * *
2
80
81
81
82
78
78
81
80
1
84
84
84
85
80
81
80
82
73
74
74
75
73
73
74
76
74
73
74
75
2
2
77
75
76
75
73
74
72
77
69
70
70
71
90
89
90
92
90
89
90
91
3
89
91
93
93
87
87
89
89
83
84
85
84

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
two occasions, a test-retest reliability can be calculated. By recognizing that tasks are analogous to items on traditional tests, an internal consistency reliability coefficient can be calculated.
A test-retest coefficient is calculated by correlating the soldiers ' scores at occasion 1 and occasion 2, after summing over all other information in Table 2. The correlation between scores at the two points in time is .97. If soldiers' performance times are averaged over two occasions to provide a performance time measure, the reliability is .99, following the SpearmanBrown prophecy formula.
An internal-consistency coefficient is calculated by averaging, for each task, soldiers' performance times across stations, occasions, and observers. The soldiers' average task performance times would then be intercorrelated: r(task 1,task 2), r(task 1,task 3), and r(task 2,task 3). The average of the three correlations would provide the reliability for a single task, and the Spearman-Brown formula could be used to determine the reliability for performance times averaged over the three tasks. The reliability of performance-time measures obtained on a single task is .99, and the reliability of scores averaged across the three tasks is .99.
Generalizability Theory Approach
Two limitations of classical theory are readily apparent. The first limitation is that a lot of information in Table 2 is ignored (i.e., “averaged over”). This information might contain measurement error that classical theory assumes away. This could lead to false confidence in the dependability of the performance measure. The second limitation is that separate reliabilities are provided; which is the “right one”? G theory overcomes both limitations. The theory uses all of the information obtained in the G study, and it provides a coefficient that includes a definition of error arising from each of the sources of error in the measurement. Finally, G theory estimates each source of variation in the measurement separately so that improvements can be made by pinpointing which characteristics of the performance measurement gave rise to the greatest error.
Generalizability theory uses the analysis of variance (ANOVA) to accomplish this task. A measurement study (called a generalizability study) is designed to sample potential sources of measurement error (e.g., raters, occasions, tasks) so that their effects on soldiers ' performance can be examined. Thus soldiers and each source of error can be considered factors in an ANOVA. The ANOVA, then, can be used to estimate the effects of soldiers (systematic, “true-score” variation), each source of error, and their interactions. More specifically, the ANOVA is used to estimate the variance components associated with each effect in the design (“main effects” and “interactions”). As Rubin (1974:1050) noted, G theory concentrates on mixed models analysis

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
of variance designs, that is, designs in which factors are crossed or nested and fixed or random. Emphasis is given to the estimation of variance components and ratios of variance components, rather than the estimation and testing of effects for fixed factors as would be appropriate for designs based on randomized experiments.
Variance Components
The statistical machinery for analyzing the results of a G study is the analysis of variance. The ANOVA partitions the multiple sources of variation into separate components (“factors” in ANOVA terminology) corresponding to their individual main effects (soldiers, stations, occasions, tasks, and judges) and their combinations or interactions. The total variation in performance times (shown in Table 2) is partitioned into no less than 31 separate components—five individual components and all their possible combinations (Cartesian products)—accounting for the total variation in the performance-time data (see Table 3).
Of the 30 sources of variation, 1 accounts for performance consistency: the soldier (or P for person) effect represents systematic differences in the speed of performance among the five soldiers (variance component for soldiers in Table 3). By averaging the time measure across observers, tasks, occasions, and stations, we find that soldier 5 performed the task the fastest and soldier 3 performed the task the slowest. The other three soldiers fell in between. This variation in mean performance can be used to determine systematic differences among soldiers, called true-score variance in classical test theory and universe-score variance in generalizability theory. This universe-score variance—variance component for P = 14.10 (Table 3)—is the signal sought through the noise created by error. It is the “stuff” that the military decision maker would like to know as inexpensively and as feasibly as possible.
The 29 other sources of variation represent potential measurement error. The first four sources of variation are attributable to each source of error considered singly (“main effects” in ANOVA terminology). The station effect (variance component for station in Table 3) shows whether mean performance times, averaged over all other factors, systematically vary as to the location at which the measurement was taken. Apparently performance time did not differ according to station (variance component for station = 0). This is not surprising; unlike many other performance measurements, the revolver task appears self-contained. The occasion effect shows whether performance times, averaged over all other factors, change from one occasion to the next. Relative to other variance components, performance appears stable over occasions. The task effect shows whether performance times differed over tasks 1-3. Since task 2 contained fewer subtasks (three)

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
TABLE 3 Generalizability Study for a Soldier (P) × Station (S) × Occasion (O) × Task (T) × Judge (J) Design
Source of Variation
df
Mean Squares
Variance Components
Soldiers (P)
4
1020.80
14.10
Stations (S)
1
1.00
0.00
Occasions (O)
1
1273.00
7.40
Tasks (T)
2
1659.80
20.00
Judges (J)
3
349.80
2.45
PS
4
1.00
0.00
PO
4
239.00
9.55
PT
8
9.80
0.00
PJ
12
106.80
8.75
SO
1
1.00
0.00
ST
2
1.00
0.00
SJ
3
1.00
0.00
OT
2
59.80
1.25
OJ
3
97.80
3.20
TJ
6
1.80
0.00
PSO
4
1.00
0.00
PST
8
1.00
0.00
PSJ
12
1.00
0.00
POT
8
9.80
1.00
POJ
12
1.80
0.00
PTJ
24
1.80
0.00
SOT
2
1.00
0.00
SOJ
3
1.00
0.00
STJ
6
1.00
0.00
OTJ
6
1.80
0.00
PSOT
8
1.00
0.00
PSOJ
12
1.00
0.00
PSTJ
24
1.80
0.00
SOTJ
6
1.00
0.00
PSOTJ (residual)
24
1.00
1.00
than tasks 1 and 3 (five each), performance time on task 2, averaged over all other sources of variation, should be shorter. The task effect reflects this characteristic of the performance measurement (variance component for task = 20). And variation across judges shows whether observers are using the same criterion when timing performance. From a measurement point of view, main-effect sources of error influence absolute decisions about the

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
speed of performance (regardless of how other soldiers performed; called “absolute decisions ”). The soldiers' performance times will depend on whether they are observed by a “fast” or “slow” timer, at a “fast” or “slow” station, and so on.
The remaining sources of variation in Table 3 reflect combinations or “statistical interactions” among the factors. Interactions between persons and other sources of error variation represent unique, unpredictable effects; the particular performance times assigned to soldiers have one or more components of unpredictability (error) in them. As a consequence, different tasks, observers, or occasions might rank order soldiers differently and unpredictably.5 The soldier × judge effect (variance component = 8.75), for example, indicates that observers did not agree on the times they assigned to each soldier. If observer 1, for example, were used in the performance measurement, soldier 1 might be timed as faster than soldier 4. If observer 4 were used, the rank ordering would be reversed. The soldier × task interaction indicates that soldiers who performed quickly on task 1 also performed quickly on the other tasks, compared to their peers. The rank ordering of soldiers apparently does not depend on the task they performed. This is why the internal consistency coefficient, based on classical theory, was so high (.99). The soldier × occasion × judge interaction indicates judges disagreed on performance times they assigned each soldier, and the nature of this disagreement changed from one occasion to the next (negligible, Table 3). The most complex interaction, soldiers × stations × occasions × tasks × observers, reflects the effect of an extremely complex combination of error sources and other unmeasured and random error sources. It is the residual that accounts for the remaining variation in all performance times.
The remainder of the interactions do not involve persons. As a consequence, they do not affect the rank ordering of soldiers. However, they do affect the absolute performance-time score received by each soldier. For example, a sizable occasion × judge interaction would indicate that the performance times received by soldiers depend both on who observes them and on what occasion that observation occurs. A sizable task × judge interaction would indicate that the performance times received by soldiers depends on the particular task and observer. In doing task 1, for example, the soldiers would want judge 3 because she assigns the fastest times on this task while, in performing task 3, they might want judge 1 because he assigns the fastest times on that task.
5
Technically, an interaction could also occur when soldiers have identical rank orders across, say, occasions and the distance between soldiers' performance times on each occasion is different (an ordinal interaction). An interaction with reversals in rank order (a disordinal interaction) is more dramatic and, for simplicity, is used to describe interpretations of interactions in this paper.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Improvement of Performance Measurement
Just as the Spearman-Brown prophecy formula can be used to determine the number of items needed on a test to achieve a certain level of reliability, the magnitudes of the sources of error variation can also be used to determine the number of occasions, observers, and so on that are needed to obtain some desired level of generalizability (reliability). For example, the effects involving judges (soldier × judge, judge × task, judge × task × occasion, etc.) can be used to determine whether several judges are needed and whether different judges can be used to score the performance of different soldiers, or whether the same judges must rate all soldiers due to disagreements among them. The analysis of the performance-time data in Table 3 suggests, based on the pattern of the variance component magnitudes, that several judges are needed and that the same set of judges should time all soldiers (e.g., variance components for PJ and OJ).
Generalizability of the Performance Measurement
Generalizability theory provides a summary index representing the consistency or dependability of a measurement. This coefficient, the “generalizability coefficient,” is analogous to the reliability coefficient in classical theory. The coefficient for relative decisions reflects the accuracy with which soldiers have been rank ordered by the performance measurement, and is defined as:
where n′ is the number of times each source of error is sampled in an application of the measurement. For the data in Table 3, with n = 1 station, occasion, task, and judge:
The G coefficient for absolute decisions is defined as:
where n′ is the number of times each source of error is sampled in an application of the measurement. For the data in Table 3, with n = 1 station, occasion, task, and judge:

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Regardless of whether relative or absolute decisions are to be made on the basis of the performance measurement, the dependability of the measure based on the G theory analysis is considerably different than the analysis based on classical theory. In these examples, it is especially important to sample occasions and judges extensively for relative decisions and to sample tasks extensively as well for absolute measurements.
Summary: Revolver Test With Accuracy Scores
Recall that both time and accuracy were recorded by four observers judging soldiers' performance in the caliber .38 revolver performance test. By way of reviewing the application of G theory to performance measurements, hypothetical data on accuracy is presented. This is not merely a repeat of what has gone before. The accuracy data call for a somewhat different analysis than the performance-time data.
Design of the Revolver Test Using Accuracy Scores
In the generalizability study, each of five soldiers performed the revolver test four times: on two different occasions (O) at two different test stations. The soldiers' (P) performance on each of the three tasks (T) and subtasks (S) (see Table 1) was independently judged by four observers (J). Hypothetical accuracy scores for this G-study design are presented in Table 4. The data in Table 4 have been collapsed over stations. This seemed justifiable. Because of the nature of the revolver task, stations did not introduce significant measurement error. Further, to simplify the analysis, only two of the three tasks were selected: loading and unloading/cleaning the revolver. Including the stoppage removal task would have created an “unbalanced” design, with five subtasks for tasks 1 and 3 each and only three subtasks for task 2. (See the later discussion of unbalanced designs.)
The data in Table 4 represent a soldiers × occasion × task × subtask:task × observer (P × O × T × S:T × J) design. Notice that each of the two tasks—loading and unloading—contain somewhat different subtasks. So identical subtasks do not appear with each task and we say that subtasks are nested within tasks (cf. a nested analysis of variance design). The consequence of nesting can be seen in Table 5, where not all possible combinations of P, O, T, S:T, and J appear in the source table as was the case in Table 3. This is because all terms that include interactions of T and S:T together cannot be estimated due to the nesting (see the later discussion of nesting).

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
averaged over the two occasions and ignoring the effect of platoon and company, the reliability is .64.
Clearly, this reliability coefficient is influenced by leniency of different observers, the difficulty of the terrain or terrains on which the missions were conducted, the differences between missions, the time of day (day or night), the day that the performance was observed, and so forth. However, the importance of these possible sources of measurement error cannot be estimated using classical theory, even if the measurement facets had been systematically identified. Furthermore, performance might be influenced by the policies and leadership skills within particular companies or platoons. Classical reliability is mute on how to treat these hierarchical data.
Generalizability Theory Approach
The generalizability analysis proceeded along the lines suggested by symmetry:
Choose the facets of measurement and compute mean squares.
Estimate variance components.
Specify the facet (or combination of facets) that is the focus of measurement, and specify the sources of error.
Examine alternative D-study designs.
Steps 1 and 2 are shown in Table 15 for the Company (C) × Platoon: Company (P:C) × Crew:Platoon:Company (Cr:P:C) × Occasion (O) partially nested design.
Interpretation of Variance Components In theory, a variance component cannot be negative, yet a negative estimate occurred (as indicated in
TABLE 15 Variance Components for the Study of TankCrew Performance Measurement a
Source of Variation
Mean Squares
Estimated Variance Component
Companies (C)
55461
0b
Platoons:C (P:C)
78636
1607.19
Crews:P:C (Cr:P:C)
45383
15967.50
Occasions (O)
244505
3573.21
C ×
83711
3538.79
P:C × O
30629
3436.17
Cr:P:C × O
31448
13448.20
aThe design is crews nested in platoons nested in companies crossed with occasions.
bNegative variance component set to 0.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Table 15). With sample Table VIII data, a negative variance component can arise either due to sampling error or misspecification of the measurement model. If the former, the most widely accepted practice is to set the variance coefficient to 0, as was done in Table 15. If the latter, the model should be respecified and variance components estimated with the new model. The rationale for setting the company variance component to 0 was the following. First, the difference in the mean performance of the three companies was small: 770.90, 763.33, and 692.93. Variation among company means accounted for only 0.3 percent of the total variation in the data. The best estimate of the variance due to companies, then, was 0. (See the concluding section for additional discussion on estimating variance components.)
The largest variance component in Table 15 is for crews: the universescore variance. Crew performance differs systematically, and the measurement procedure reflects this variation. The next largest component is associated with the residual, indicating that error is introduced due to inconsistency in tank-crew performance from one occasion to the next, and other unidentified sources of error (e.g., inconsistency due to time of day, observer, terrain, and the like). The remaining variance components are roughly onefourth the size of the residual, with the exception of the component for companies. Since the variance component for companies is 0 and the variance component for platoons is the smallest one remaining, neither sufficiently influences variation in performance enough to have an important influence if they are considered part of the universe-score variance.
Generalizability Coefficients. Since decision makers are interested in the generalizability of unit performance, one possible method for calculating the G coefficient for crews is:
The generalizability of tank crew performance, averaged over the two observation occasions, is .65. If, however, the decision maker is interested in the generalizability of the score of a single tank crew selected randomly and observed on a single occasion, the coefficient drops to .48 due to the large residual variance component.
The principle of symmetry states that the universe-score variance is comprised of all components that give rise to systematic variation among crews. In this case, variation due to companies and platoons, as well as variation due to crews, must be considered universe-score variation. Characteristics of companies and platoons, such as leadership ability, contribute to systematic variation among crews. Following symmetry, the G coefficient for crews, averaged over two occasions, is:

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
We write to distinguish this coefficient from the one above.
Surprisingly, by increasing universe-score variance, the G-coefficient decreased, for two reasons. The increase in universe-score variance by incorporating systematic variation due to companies was negligible:
And the additional error introduced () by considering variation due to companies and platoons as universe-score variance, while not large relative to other sources of variation (e.g., ), were large relative to the systematic variability of companies and platoons.
Finally, if the decision maker is interested in the dependability of platoon performance, the generalizability of the measurement was estimated (aggregating over crews within platoons and occasions) as follows:
Notice here that crews is considered a source of error; variability in crews introduces uncertainty in estimating the performance of the entire platoon— the average of the performance of a platoon's individual crews. The low generalizability coefficient, then, reflects the fact that there was greater variability among crews within a platoon than among platoons.
CONCLUDING COMMENTS ON GENERALIZABILITY THEORY: ISSUES AND LIMITATIONS
In the preceding sections, I argued that generalizability theory was the most appropriate behavioral measurement theory for treating military performance measures and showed how the theory could be used to model and improve performance measures. Even the best of theories have limitations in their applications, and generalizability theory is no exception. In concluding, I address the following topics: negative estimated variance components; assump-

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
tion of constant universe scores; and dichotomous data (for a more extensive treatment, see Shavelson and Webb, 1981; Shavelson et al., 1985).
Small Samples and Negative Estimated Variance Components
Two major contributions of generalizability theory are its emphasis on multiple sources of measurement error and its deemphasis of the role played by summary reliability or generalizability coefficients. Estimated variance components are the basis for indexing the relative contribution of each source of error and the undependability of a measurement. Yet Cronbach et al. (1972) warned that variance-component estimates are unstable with usual sample sizes of, for example, a couple of occasions and observers. While variance-component estimation poses a problem for G theory, it also afflicts all sampling theories. One virtue of G theory is that it brings estimation problems to the fore and puts them up for examination.
Small Samples and Variability of Estimated Variance Components
The problem of fallible estimates can be illustrated by expressing an expected mean square as a sum of population variances. In a two-facet, crossed (p × i × j), random model design, the variance of the estimated variance component for persons—of the estimated universe-score variance—is
With all of the components entering the variance of the estimated universescore variance, the fallibility of such an estimate is quite apparent, especially if n(i) and n(j) are quite modest. In contrast, the variance of the estimated residual variance has only one variance component,

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
In a crossed design, then, the number of components and hence the variance of the estimator increase from the highest-order interaction component to the main effect components. Consequently, sample estimates of the universe-score variance—estimates of crucial importance to the dependability of a measurement—may reasonably be expected to be less stable than estimates of components of error variance.
Negative Estimates of Variance Components
Negative estimates of variance components can arise because of sampling errors or because of model misspecification (Hill, 1970; see also previous discussion). With respect to sampling error, the one-way ANOVA illustrates how negative estimates can arise. The expected mean squares are:
and
where E MSWithin is the expected value of the mean square within groups and E MSBetween is the expected value of the mean square between groups. Estimation of the variance components is accomplished by equating the observed mean squares with their expected values and solving the linear equations. If MSWithin is larger than MSBetween, the estimate of will be negative.
Realizing this problem in G theory, Cronbach et al. (1972:57) suggested that
a plausible solution is to substitute zero for the negative estimate, and carry this zero forward as the estimate of the component when it enters any equation higher in the table of mean squares.
Notice that by setting negative estimates to 0, the researcher is implicitly saying that a reduced model provides an adequate representation of the data, thereby admitting that the original model was misspecified. Although solutions such as Cronbach et al.'s are reasonable, the sampling distribution of the (once negative) variance component as well as those variance components whose calculation includes this component is more complicated and the modified estimates are biased. Brennan (e.g., 1983) provides an alternative algorithm that sets all negative variance components to 0. Each variance component, then, “is expressed as a function of mean squares and sample sizes, and these do not change when some other estimated variance component is negative” (Brennan, 1983:47). Brennan's procedure produces unbiased estimated-variance components, except for negative components set to 0.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Bayesian methods provide a solution to the problem of negative variance-component estimates (e.g., Box and Tiao, 1973; Davis, 1974; Fyans, 1977; Shavelson and Webb, 1981). Consider a design with two sources of variation: within groups and between groups. The Bayesian approach includes the constraint that MS(between groups) is greater than or equal to MS (within groups) so that the between-groups variance component cannot be negative. Unfortunately, the computational complexities involved and the distributional-form assumptions make these procedures all but inaccessible to practitioners.
An attractive alternative that produces nonnegative estimates of variance components is maximum likelihood (Dempster et al., 1981). Maximum likelihood estimators are functions of every sufficient statistic and are consistent and asymptotically normal and efficient (Harville, 1977). Although these estimates are derived under the assumption of a normal distribution, estimators so derived may be suitable even with an unspecified distribution (Harville, 1977). Maximum likelihood estimates have not been used extensively in practice because they are not readily available in popular statistical packages. However, researchers at the University of California, Los Angeles, (Marcoulides, Shavelson, and Webb) are examining a restricted maximum likelihood approach that, in simulations so far, appears to offer considerable promise in dealing with the negative variance component problem.
Assumption of Constant Universe Scores
Nearly all behavioral measurement theories assume that the behavior being studied remains constant over observations; this is the steady-state assumption made by both classical theory and G theory. Assessment of stability is much more complex when the behavior changes over time. Among those investigating time-dependent phenomena are Bock (1975), Bryk and colleagues (Bryk and Weisberg, 1977; Bryk et al., 1980), Rogosa and colleagues (Rogosa, 1980; Rogosa et al., 1982, 1984).
Rogosa et al. (1984) consider generalizability theory as one method for assessing the stability of behavior over time. Their approach is to formulate two basic questions about stability of behavior: (1) Is the behavior of an individual consistent over time? (2) Are individual differences among individuals consistent over time?
For individual behavior, consistency is defined as absolutely invariant behavior over time. They characterized inconsistency in behavior in several ways: unsystematic scatter around a flat line, a linear trend (with and without unsystematic scatter), and a nonlinear trend (with or without scatter). Changing behavior over time has important implications in generalizability theory for the estimation of universe scores. When behavior changes systematically over time, the universe-score estimate will be time dependent.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
The second, and more common, question about stability is the consistency of individual differences among individuals. Perfect consistency occurs whenever the trends for different individuals are parallel, whether the individuals' trends are flat, linear, or nonlinear.
A generalizability analysis with occasions as a facet is described by Rogosa et al. (1984) as one method for assessing the consistency of individual differences over time. The variance component that reflects the stability of individual differences over time is the interaction between individuals and occasions. A small component for the interaction (compared to the variance component for universe scores) suggests that individuals are rank-ordered similarly across occasions; that is, their trends are parallel. It says nothing about whether individual behavior is changing over time. As described above, the behavior of all individuals could be changing over time in the same way (a nonzero main effect for occasions). A relatively large value of the component for the individuals × occasion interaction (compared to the universe-score variance component) shows that individuals are ranked differently across occasions. This could be the result of unsystematic fluctuations in individual behavior over time, the usual interpretation made in G theory under the steady-state assumption. But it could also reflect differences in systematic trends over time for different individuals. The behavior of some individuals might systematically improve over time, while that of others might not. Furthermore, the systematic changes could be linear or nonlinear.
Clearly, it is necessary to specify the process by which individual military performance changes in order to model this change. Rogosa et al. provide excellent steps in that direction by describing analytic methods for assessing the consistency of behavior of individuals and the consistency of differences among individuals. At the least, their exposition is valuable for clarifying the limited ability of G theory to distinguish between real changes in behavior over time and random fluctuations over time that should be considered error.
Although the analytic models for investigating time-dependent changes in behavior are important, they do not alleviate the investigator 's responsibility to define the appropriate time interval for observation. In studying the dependability of a measurement, it is necessary to restrict the time interval so that the observations of behavior can reasonably be expected to represent the same phenomenon.
There are other developments in the field that examine changing behavior over time, such as models of change based on Markov processes (e.g., Plewis, 1981). However, since these developments do not follow our philosophy of isolating multiple sources of measurement error, and do not provide much information about how measurement error might be characterized or estimated, they are not discussed here.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Dichotomous Data
Analysis of variance approaches to reliability, including G theory, assume that the scores being analyzed represent continuous random variables. When the scores are dichotomous, as they were in the earlier example with observers' “go-no go” scores for soldiers' performance on the revolver task, analysis of variance methods produce inaccurate estimates of variance components and reliability (Cronbach et al., 1972; Brennan, 1980). In analyses of achievement test data with dichotomously scored items, L. Muth én (1983) found that the ANOVA approach for estimating variance components tended to overestimate error components and underestimate reliability. She found that a covariance structure analysis model (see B. Muth én, 1978, 1983; Jöreskog, 1974), specifically designed to treat dichotomous data as a manifestation of an underlying continuum (B. Muthén, 1983), produced estimates of variance components and generalizability coefficients that were closer to the true values than those from the ANOVA.
Concluding Comment
Used wisely, none of the foregoing limitations invalidates G theory. They simply point to the care needed in designing and interpreting the results of G studies.
In spite of its limitations, generalizability theory does what those seeking to determine the dependability of performance measures want a theory of behavioral measurement to do. G theory:
models the sources of error likely to enter into a performance measurement,
models the ways in which these errors are sampled,
provides information on where the major source of measurement error lies,
provides estimates of how the measurement would improve under alternative plans for sampling and thereby controlling sources of error variance, and
indicates when the measurement problem cannot be overcome by sampling, so that alternative revisions of the measurement (e.g., modifications in administration, training of observers, or both) might be considered.
REFERENCES
Bell, J.F. 1985 Generalizability theory: the software problem. Journal of Educational Statistics 10:19-30.
Bock, D. 1975 Multivariate Statistical Methods in Behavioral Research. New York: McGraw-Hill.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Box, G.E.P., and G.C. Tiao 1973 Bayesian Inference in Statistical Analysis. Reading, Mass.: Addison-Wesley.
Brennan, R.L. 1980 Applications of generalizability theory. In R.A. Berk, ed., Criterion-Referenced Measurement: The State of the Art. Baltimore, Md.: The Johns Hopkins University Press.
1983 Elements of Generalizability Theory. Iowa City, Iowa: American College Testing Publications.
Bryk, A.S., and H.I. Weisberg 1977 Use of the nonequivalent control group design when subjects are growing Psychological Bulletin 84:950-962.
Bryk, A.S., J.F. Strenio, and H.I. Weisberg 1980 A method for estimating treatment effects when individuals are growing Journal of Educational Statistics 5:5-34.
Cardinet, J., and L. Allal 1983 Estimation of generalizability parameters. Pp. 17-48 in L.J. Fyans, Jr., ed., Generalizability Theory: Inferences and Practical Applications. San Francisco: Jossey-Bass.
Cardinet, J., and Y. Tourneur 1974 The Facets of Differentiation [sic] and Generalization in Test Theory Paper presented at the 18th congress of the International Association of Applied Psychology,
Montreal, July-August. 1977 Le Calcul de Marges d'Erreurs dans la Theorie de la Generalizabilite. Neuchatel, Switzerland: Institut Romand de Recherches et de Documentation Pedagogiques.
Cardinet, J.Y. Tourneur, and L. Allal 1976a The generalizability of surveys of educational outcomes. Pp. 185-198 in D.N.M. DeGruijter and L.J. Th. van der Kamp, eds., Advances in Psychological and Educational Measurement. New York: Wiley.
1976b The symmetry of generalizability theory: applications to educational measurement. Journal of Educational Measurement 13:119-135.
1981 Extension of generalizability theory and its applications in educational measurement. Journal of Educational Measurement 18:183-204.
Cronbach, L.J. 1976 Research on Classrooms and Schools: Formulation of Questions, Design, and Analysis. Occasional paper, Stanford Evaluation Consortium. Stanford University (July).
Cronbach, L.J., G.C. Gleser, A.N. Nanda, and N. Rajaratnam 1972 The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York: Wiley.
Davis, C.E. 1974 Bayesian Inference in Two-way Analysis of Variance Models: An Approach to Generalizability. Unpublished doctoral dissertation. University of Iowa.
Dempster, A.P., D.B. Rubin, and R.K. Tsutakawa 1981 Estimation in covariance components models. Journal of the American Statistical Association 76:341-353.
Erlich, O., and R.J. Shavelson 1976 The Application of Generalizability Theory to the Study of Teaching Technical Report 76-9-1, Beginning Teacher Evaluation Study. Far West Laboratory, San Francisco.
Fyans, L.J. 1977 A New Multi-Level Approach for Cross-Cultural Psychological Research Unpublished doctoral dissertation. University of Illinois at Urbana-Champagne.
Hartley, H.O., J.N.K. Rao, and L. LaMotte 1978 A simple synthesis-based method of variance component estimation. Biometrics 34:233-242.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Harville, D.A. 1977 Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72:320-340.
Hill, B.M. 1970 Some contrasts between Bayesian and classical influence in the analysis of variance and in the testing of models. Pp. 29-36 in D.L. Meyer and R.O. Collier, Jr., eds., Bayesian Statistics. Itasca, Ill.: F.E. Peacock.
Jöreskog, K.G. 1974 Analyzing psychological data by structural analysis of covariance matrices. In D.H. Krantz, R.C. Atkinson, R.D. Luce, and P. Suppes, eds., Contemporary Developments in Mathematical Psychology, Vol. II. San Francisco: W.H. Freeman & Company.
Kahan, J.P., N.M. Webb, R.J. Shavelson, and R.M. Stolzenberg 1985 Individual Characteristics and Unit Performance: A Review of Research and Methods. R-3194-MIL. Santa Monica, Calif.: The Rand Corporation.
Muthén, B. 1978 Contributions to factor analysis of dichotomous variables. Psychometrika 43:551-560.
1983 Latent variable structural equation modeling with categorical data Journal of Econometrics 22:43-65.
Muthén, L. 1983 The Estimation of Variance Components for the Dichotomous Dependent Variables: Applications to Test Theory. Unpublished doctoral dissertation, University of California, Los Angeles.
Office of the Assistant Secretary of Defense (Manpower, Reserve Affairs, and Logistics) 1983 Second Annual Report to the Congress on Joint-Service Efforts to Link Standards for Enlistment to On-the-Job Performance. A report to the House Committee on Appropriations. U.S. Department of Defense, Washington, D.C.
Plewis, I. 1981 Using longitudinal data to model teachers' ratings of classroom behavior as a dynamic process. Journal of Education Statistics 6:237-255.
Rogosa, D. 1980 Comparisons of some procedures for analyzing longitudinal panel data Journal of Economics and Business 32:136-151.
Rogosa, D., D. Brandt, and M. Zimowski 1982 A growth curve approach to the measurement of change. Psychological Bulletin 90:726-748.
Rogosa, D., R. Floden, and J.B. Willett 1984 Assessing the stability of teacher behavior. Journal of Educational Psychology 76:1000-1027.
Rubin, D.B., reviewer 1974 The dependability of behavioral measurements: theory of generalizability for scores and profiles. Journal of the American Statistical Association 69:1050.
Shavelson, R.J. 1985 Evaluation of Nonformal Education Programs: The Applicability and Utility of the Criterion-Sampling Approach. Oxford, England: Pergamon Press.
Shavelson, R.J., and N.M. Webb 1981 Generalizability theory: 1973-1980. British Journal of Mathematical and Statistical Psychology 34:133-166.
Shavelson, R.J., N.M. Webb, and, L. Burstein 1985 The measurement of teaching. In M.C. Wittrock, ed., Handbook of Research on Teaching, 3rd ed. New York: Macmillan.

OCR for page 207

Performance Assessment for the Workplace: VOLUME II
Tourneur, Y. 1978 Les Objectifs du Domaine Cognitif. 2me Partie: Theorie des Tests. Ministere de l'Education Nationale et de la Culture Francaise, Universite de l'Etat a Mons, Faculte des Sciences Psycho-Pedagogiques.
Tourneur, Y., and J. Cardinet 1979 Analyse de Variance et Theorie de la Generalizabilite: Guide pour la Realization des Calculs. Doc. 790.803/CT/9. Universite de l'Etat a Mons, France.
U.S. Department of Labor 1972 Handbook for Analyzing Jobs. Washington, D.C.: U.S. Department of Labor.
Webb, N.M., and R.J. Shavelson 1981 Multivariate generalizability of general educational development ratings. Journal of Educational Measurement 18:13-22.
Webb, N.M., R.J. Shavelson, J. Shea, and E. Morello 1981 Generalizability of general educational development ratings of jobs in the U.S. Journal of Applied Psychology 66:186-191.
Webb, N.M., R.J. Shavelson, and E. Maddahian 1983 Multivariate generalizability theory. Pp. 67-82 in L. J. Fyans, Jr., ed., Generalizability Theory: Inferences and Practical Applications. San Francisco: Jossey-Bass.
Wittman, W.W. 1985 Multivariate reliability theory: principles of symmetry and successful validation strategies. Pp. 1-104 in R.B. Cattell and J.R. Nesselroade, eds., Handbook of Multivariate Experimental Psychology, 2nd ed. New York: Plenum Press.