Page 79 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

4

Reliability of the Achievement Levels

Chapter 3 focused on the standard setting process used for NAEP. That process began with a general policy description of what each level is intended to represent (e.g., mastery over challenging subject matter) and a set of items that had been used to assess the knowledge and skills elaborated in the assessment frameworks.¹ The standard setting process produces two key outcomes: the detailed achievement-level descriptors (ALDs) and cut scores, which are inextricably linked (see Chapter 3). In this chapter and the next, we focus on their reliability and validity.

When evaluating any standard setting, it is important to separate the standard setting process used to establish cut scores from the achievement-level descriptors used to provide meaning to those cut scores. Even though the two are intertwined, it is possible to set cut scores through procedures that meet current criteria for best practices and yet create ALDs that are not aligned with those cut scores, and vice versa.

In this chapter, we examine measures of internal consistency reported for the outcomes of the 1992 standard settings. We consider this information in light of the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1985, 1999, 2014) and the best practices that were in place in 1992 and that are now followed.

__________________

¹ We note that this sequence does not reflect current best practices, which are that detailed ALDs should be developed at the same time as the framework or test blueprint and should guide test development. After the 1992 standard settings, NAEP adjusted its procedures to be more in line with best practices.

Page 80 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

SOURCES OF EVIDENCE

The information in this chapter is drawn from two primary sources: (1) reports prepared by ACT that document the standard setting process and outcomes and (2) reports prepared under the auspices of the National Academy of Education (NAEd) by its Panel on the Evaluation of the NAEP Trial State Assessment Project.

The ACT-produced documentation includes four reports:

Design Document for Setting Achievement Levels on the 1992 National Assessment of Educational Progress in Mathematics, Reading, and Writing: Final Version (ACT, Inc., 1992)
Description of Mathematics Achievement Levels-Setting Process and Proposed Achievement-Level Descriptions: 1992 National Assessment of Educational Progress, Volumes 1 and 2 (ACT, Inc., 1993a, 1993b)
Setting Achievement Levels on the 1992 National Assessment of Educational Progress in Reading: Final Report, Volumes 1 and 2 (ACT, Inc., 1993d, 1993e)
Setting Achievement Levels on the 1992 National Assessment of Educational Progress in Mathematics, Reading, and Writing: A Technical Report on Reliability and Validity (ACT, Inc., 1993c)

These summary documents were prepared by ACT in its role as the contractor for the standard settings, but it is important to note that decisions about the actual design of the standard setting (choice of methods, selection of panelists, step-by-step procedures, described in Chapter 3) were made in consultation with the National Assessment Governing Board (NAGB) and its advisors.² In addition, ACT appointed its own technical advisory team.

NAEd appointed a panel to lead the evaluations, headed by two cochairs (Robert Glaser and Robert Linn) and a principal investigator (Lorrie Shepard). All of the standard setting data collected by ACT were shared with this panel. In some cases, the NAEd evaluators reviewed and replicated analyses conducted by ACT; in other cases, they conducted new analyses of the data. The NAEd panel also designed and carried out its own studies. Some of this work was carried out by researchers at the American Institutes of Research (AIR) overseen by a designated project

__________________

² NAGB’s staff led this work, with advice from board members and advisers, including the standing Achievement Levels Committee. ACT’s external Technical Advisory Committee on Standard Setting provided ongoing advice for the work.

Page 81 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

director (George Bohrnstedt), and some by content-area specialists in reading and mathematics.³

The NAEd-produced reports consist of two volumes:

Setting Performance Standards for Student Achievement: A Report of the National Academy of Education Panel on the Evaluation of the NAEP Trial State Assessment: An Evaluation of the 1992 Achievement Levels (National Academy of Education, 1993a)
Setting Performance Standards for Student Achievement: Background Studies (National Academy of Education, 1993b), which included the following studies:
- An evaluation of the 1992 reading achievement levels, by David Pearson and Lizanne DeStefano (1993), which had three parts:
  - Report 1: a commentary on the process
  - Report 2: an analysis of the achievement-level descriptions
  - Report 3: comparison of cutpoints for the 1992 achievement levels with those set by alternate means
- Validity of the achievement-level setting process, by Don McLaughlin (1993c)
- Order of Angoff ratings in multiple simultaneous standards, by Donald McLaughlin (1993a)
- Rated achievement levels of completed NAEP mathematics books, by Donald McLaughlin (1993b)
- Expert panel review of the 1992 mathematics achievement levels, by Edward Silver and Patricia Ann Kenney (1993)
- Comparison of teachers’ and researchers’ ratings of students’ performance in mathematics and reading with NAEP measurement of achievement levels, by Donald McLaughlin, Phyllis A. DuBois, Marian S. Eaton, Dey E. Ehrlich, Fran B. Stancavage, Catherine A. O’Donnell, Jin-Ying Yu; Lizanne DeStefano, David Pearson, Diane Bottomley, Cheryl Ann Bullock, Matthew Hanson, and Cindi Rucinski (1993)
- Comparisons of student performance on NAEP and other standardized tests, by Elizabeth Hartka (1993)
- Comparison of the NAEP trial state assessment results with the IAEP international results, by Albert Beaton and Eugenio Gonzalez (1993)

__________________

³ NAEd had originally been charged with evaluating the trial state assessment. Evaluation of the achievement levels was added to the charge in 1992, and the first report on this topic was issued in 1993 (Shepard et al., 1993). NAEd continued to monitor the trial state assessment and issued reports in 1990, 1992, and 1994 (see Glaser et al., 1997, for final report).

Page 82 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

Unless otherwise noted, the results discussed in the rest of this chapter are from these reports; the committee did not do any first-hand analyses.

RELIABILITY

Reliability as a Measure

Reliability is a measure of the precision of test scores: that is, the degree to which test scores for a group of test takers are consistent over repeated administrations of the test. Reliability reflects the degree to which the scores are inferred to be dependable and consistent for an individual test taker and the degree to which scores are free of random errors of measurement for a given group, as specified in testing standards (American Educational Research Association et al., 2014, pp. 222-223).

In the context of standard setting, reliability can be thought of as the degree to which the same cut scores would be obtained if the process was repeated. Usually, one wants to know the extent to which results might vary if simply repeated again with the same panelists, the same items, and the same cut-score setting method. It is also typical to consider the stability of the results over different conditions, such as with different panelists, different sets of items, and/or different cut-score setting methods. Using the terminology developed by Hambleton and Pitoniak (2006) and Hambleton et al. (2012), we refer to these indicators as evidence of “internal consistency.” Internal consistency evidence can be referred to as reliability, but some refer to it as evidence of validity (e.g., Hambleton and Pitoniak, 2006; Kane, 2001, 2012). As Kane stated (2001, p. 70):

Consistency in the results does not provide compelling evidence for the validity of the proposed interpretation of the cut score, but it does rule out one challenge to validity. Results that are not internally consistent do not justify any conclusions.

Hambleton and Pitoniak (2006, pp. 460-461) classified internal consistency evidence into three types: interpanelist consistency; intrapanelist consistency; and consistency within method (which we term consistency across replications). We discuss these below. First, however, we consider the nature of the judgment task that panelists perform with the modified Angoff procedure and the resulting data.

The Judgment Task

For the modified Angoff procedure that was used in the NAEP standard setting for dichotomous items, panelists make judgments about performance at the borderline of each achievement level: that is, at the borderline of Basic and Below Basic, the borderline of Proficient and

Page 83 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

Basic, and the borderline of Advanced and Proficient. Panelists are asked to imagine a “hypothetical borderline examinee” and indicate the probability (between 0 and 1) that this student will correctly answer a given question.⁴ Panelists estimate the probability of a correct response for each question and for each of the borderlines.⁵ For each of the borderlines, an individual panelist’s probabilities are summed, which is treated as the panelist’s estimate of the total score needed to score at each achievement level (Jaeger, 1989; Morgan and Perie, 2004). The panelists’ cut-score estimates are then averaged, and the result is the cut score for that round of estimates. The standard deviation of panelists’ estimates around this average is also calculated and represents the extent of variability in the panelists’ cut-score judgments. Initially, the average is in the metric of a raw score (or percentage correct), but it is usually converted to a scaled score for reporting to the panelists.

Each panelist must make one judgment for each item and each cut score. Thus, if a test has 100 items and performance is divided into four levels (Below Basic, Basic, Proficient, and Advanced), three cut scores are needed, so each panelist must make 300 judgments. With the modified Angoff method,⁶ this process is usually done three times (or in three rounds). Between the first and second rounds and between the second and third rounds, panelists are given feedback that allows them to compare their cut-score estimates to those of other panelists and to the overall average and to discuss the differences.

In addition, panelists usually receive feedback called “consequence data,” which shows the percentages of students that would be placed into each achievement level based on the cut scores for that round. Panelists then work separately to review their probability estimates for each item and make changes as needed. Thus, the cut scores after round 1 reflect independent judgments, while those after rounds 2 and 3 reflect judgments informed by feedback and discussion. The cut scores that result from the third round are the ones that are recommended to the NAGB policy board, which has the authority to approve or modify the cut scores. The standard deviations for the average cut scores are given as much

__________________

⁴ Another way to consider this task is to picture 100 borderline students and determine how many would answer each question correctly (Morgan and Perie, 2004, p. 11).

⁵ For example, panelists are asked to consider the hypothetical student on the borderline of Basic and Below Basic and then to estimate the probability that the student would answer each of the questions correctly. For a given panelist, the sum of the probabilities is his or her cut-score judgment for the Basic level. Panelists would then do the same thing for students on the borderline of Basic and Proficient, and, again, the sum of each panelist’s probabilities is his or her cut-score judgment for Proficient. Again, this task is repeated to obtain each panelist’s judgment about a cut score for the borderline between Proficient and Advanced.

⁶ The original Angoff method does not use this process of iterations and feedback.

Page 84 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

attention as the cut scores because they reflect the extent to which the panelists reached consensus.

INTERPANELIST CONSISTENCY

One expectation for a standard setting is that panelists’ judgments will begin to converge over the rounds and thus that the variation among their estimates will be less from one round to the next. Interpanelist consistency reflects the extent of agreement among panelists on their judgments for each of the cut scores. It is measured by calculating the standard deviation of the distribution of cut-score judgments (along with the means). There are no absolute standards for the extent of agreement that is considered acceptable. Instead, the analyses focus on two patterns: (1) the improvement in panelist agreement from one round to the next and (2) the extent of agreement among panelists by the final round of ratings. Although there is no expectation of perfect agreement among panelists, there should not be extreme disagreement.

The ACT reports we considered used several different statistics to estimate interpanelist consistency, which we determined were statistically robust. However, they were generated through procedures that are very technically complex and difficult to explain to nonexperts, and the statistics were similarly reported in metrics that are also difficult to explain to nonexperts.⁷ To make this information more accessible, the NAEd evaluators, in reporting interpanelist consistency statistics, converted the ACT statistics to the NAEP scale score metric. We judged the NAEd metric to be more straightforward to explain and thus we draw from the NAEd report. The rest of this section discusses the mathematics and reading cut scores from the NAEd reports.

__________________

⁷ They were analyzed in two different ways and reported in two different metrics. First, the percentage correct estimates were placed on the IRT [item response theory] theta scale and then converted to a scale with a mean of 75 and a standard deviation of 15 (see ACT, Inc., 1993c). The documentation does not clearly explain why this scale was chosen or exactly how to interpret the standard deviations: it is described simply as a linear transformation. Second, generalizability analyses were used to estimate the variance components associated with panelists and items. This approach is based on an analysis-of-variance procedure. It is a robust though complex way to estimate the variance attributable to different sources (e.g., panelists and items). The details appear in the ACT technical report on validity and reliability (ACT, Inc., 1993c, Ch. 4). The results indicated that the variance associated with panelists decreased across rounds for both reading and mathematics, as did the error variance. This result was interpreted as evidence that panelists’ judgments were converging. The variance associated with items did not show the same pattern, which was interpreted as evidence that the panelists were becoming more attentive to differing item properties across rounds, a positive finding: see above section, “The Judgment Task” and following discussion.

Page 85 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

Mathematics

Table 4-1 shows the average cut scores for mathematics for each achievement level and grade at the end of each of the three rounds, and Table 4-2 shows the associated standard deviations. For the most part, the cut scores did not vary by more than a few points across the rounds, sometimes being higher and sometimes being lower. The exception was the cut score for the advanced level in 4th grade, which decreased by 9.5 points from round 1 (292.9) to round 3 (283.4). In all cases but one, the standard deviations decreased across rounds, which suggest that agreement among panelists was increasing. (The only exception was for the advanced level in 12th grade, for which the standard deviation increased from round 1 to round 3, from 11.6 to 12.7). However, the decreases were not large.

Overall, the standard deviations for third-round cut scores ranged from 12.7 to 18.5. On the NAEP scale, the within-grade standard deviations are approximately 40 points. Thus, they represent between 31.8 and 46.3 percent of the standard deviation for the test. To help interpret these values, we draw from an explanation in the NAEd report (Shepard et al., 1993, p. 69). A standard deviation of 14 points means that it takes a range of 28 points to represent the middle two-thirds of the panelists (plus or minus 1 standard deviation). Therefore, standards set by even the middle group of judges differed by as much as the actual performance of students at roughly the 25th and 50th percentiles.

Table 4-3 shows the consequence data, that is, the percentage of students scoring at or above each of the recommended cut scores. As shown in the first column (Basic), more than 60 percent of the test takers scored at the Basic level or higher for each grade in mathematics (61%, 63%, and 64%, respectively, for the three grades). However, this also means that nearly 40 percent scored Below Basic. Only small percentages reached the Proficient level (18%, 25%, and 16%, respectively); and fewer than 5 percent scored in the Advanced category.

Reading

Table 4-4 shows the average cut scores for reading for each achievement level and grade at the end of each of the three rounds, and Table 4-5 shows the associated standard deviations. The cut scores varied across rounds slightly more than they did for mathematics. The largest change was for grade-4 Advanced, which decreased by 14 points (287.5 for round 1 to 273.4 for round 3). For grades 4 and 8, the cut scores decreased from round 1 to round 3, but they increased for grade 12. The standard deviations did not consistently show the desired pattern of decreasing across rounds. In five of the nine cases, they decreased (all of grade 4 and for

Page 86 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-1 Mathematics Achievement-Level Cut Scores by Round

Round	Grade 4			Grade 8			Grade 12
Round	1	2	3	1	2	3	1	2	3
Basic	215.8	207.4	212.8	248.3	250.4	256.4	285.3	288.8	292.4
Proficient	257.6	248.0	251.1	294.2	294.2	300.7	333.8	332.4	335.0
Advanced	292.9	280.5	283.4	329.5	329.3	337.1	366.6	363.4	365.9

NOTES: The cut scores are determined by averaging panelists’ judgments. The values shown are the means. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.8M). Reprinted with permission from the National Academy of Education.

TABLE 4-2 Mathematics Achievement-Level Cut-Score Standard Deviations by Round

Round	Grade 4			Grade 8			Grade 12
Round	1	2	3	1	2	3	1	2	3
Basic	21.0	21.8	18.5	28.8	19.1	16.5	25.7	19.0	18.4
Proficient	16.9	16.9	14.7	20.7	17.5	16.6	14.0	14.1	13.9
Advanced	18.3	15.9	13.2	19.5	15.9	17.0	11.6	12.4	12.7

NOTE: The standard deviations are for the means in Table 4-1. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.9M). Reprinted with permission from the National Academy of Education.

Page 87 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-3 Percentages of 1992 Students Scoring at or Above Each Achievement-Level Cut Score on the NAEP Scale in Mathematics

Grade	Assessment Year	Percentage of Students at or Above Each Achievement Level
Grade	Assessment Year	Basic	Proficient	Advanced
4	1992	61	18	2
8	1992	63	25	4
12	1992	64	16	2

SOURCE: Adapted from ACT, Inc. (1993a, Table 5, p. 54).

Basic and Advanced in grade 12). In the other four cases, they increased (all of grade 8 and the Proficient level of grade 12). These outcomes suggest somewhat less agreement among panelists in reading than in mathematics. Overall, the standard deviations ranged from 7.3 to 17.1 and thus represented between 18.3 and 42.8 percent of the standard deviation for the test.

Table 4-6 shows the consequence data for reading, that is, the percentage of students scoring at or above each of the recommended cut scores. As with mathematics, the majority of students scored Basic or above (59%, 69%, and 75%, respectively, for the 4th, 8th, and 12th grades). Between about one-quarter and one-third scored Proficient or above (25%, 28%, and 37%). Only 5 percent or fewer scored in the Advanced category for the three grades. The percentage of students Below Basic was sizable but not quite as large as for mathematics: 41 percent for 4th grade, 31 percent for 8th grade, and 25 percent for 12th grade.

INTRAPANELIST CONSISTENCY

Intrapanelist consistency is a measure of the extent to which a given panelist’s judgments are consistent across various conditions and criteria. It is intended to reflect the extent to which panelists have “internalized” the ALDs and the assessment framework they are based on. That is, an individual panelist’s probability estimate for each item should not be affected by characteristics of the test questions: That is, if panelists fully understand the framework and ALDs, they will apply their conception of it consistently, regardless of item type or difficulty. Again, absolute consistency is not expected, but each panelist’s judgments should not vary widely for different conditions.

Intrapanelist consistency can be examined in relation to a number of different features of the items, such as their difficulty, the order in which

Page 88 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-4 Reading Achievement-Level Cut Scores by Grade and Round

Round	Grade 4			Grade 8			Grade 12
Round	1	2	3	1	2	3	1	2	3
Basic	212.5	206.1	207.0	242.4	239.6	240.8	254.8	260.6	264.8
Proficient	247.2	241.0	240.5	285.3	280.6	282.9	296.5	302.6	304.6
Advanced	287.5	275.7	273.4	337.5	328.5	330.5	349.2	347.1	350.3

NOTES: The cut scores are determined by averaging panelists’ judgments. The values shown are the means. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.8R, p. 68). Reprinted with permission from the National Academy of Education.

TABLE 4-5 Reading Achievement-Level Cut-Score Standard Deviations by Grade and Round

Round	Grade 4			Grade 8			Grade 12
Round	1	2	3	1	2	3	1	2	3
Basic	19.6	12.9	9.40	15.5	14.7	17.1	13.5	12.7	12.8
Proficient	13.9	9.30	7.30	13.5	13.8	15.1	9.30	8.10	11.2
Advanced	17.5	15.5	14.2	13.9	11.8	14.7	24.2	11.5	16.3

NOTE: The standard deviations are for the means shown in Table 4-4. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.8R, p. 68). Reprinted with permission from the National Academy of Education.

Page 89 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-6 Percentages of Students Scoring at or Above Each Achievement-Level Cut Score on the NAEP Scale for Reading

Grade	Percentage of Students at or Above Each Achievement Level
Grade	Basic	Proficient	Advanced
4	59	25	5
8	69	28	2
12	75	37	3

SOURCE: Adapted from ACT, Inc. (1993d, Table 4.4, pp. 4-19).

they are presented, the cognitive skills assessed, the type of response required, and context effects associated with the specific set of items. Often, panelists’ judgments about the difficulty of each item (probability of answering correctly) are compared with those estimated through statistical analyses (e.g., proportion of students answering each item correctly, or p value). ACT conducted studies to compare the panelists’ percentage of correct estimates with the statistically based estimates of difficulty. NAEd conducted similar analyses to compare panelists’ consistency across easy and difficult items. ACT also examined the consistency of cut-score estimates in reading and mathematics across three conditions: (1) different sets of items; (2) the order in which items were presented to panelists; and (3) type of response format (multiple choice or extended response). For reading, ACT conducted an additional analysis that examined the consistency of cut scores across reading purpose (literary experience, practical, and informational).

Comparison of Panelists’ and Statistically Based Estimates

ACT conducted an empirical analysis to evaluate the degree of “match” between the judges’ ratings for each item and the expected student performance on that item. This study focused on comparing panelists’ ratings and response probabilities.⁸ The ACT researchers sorted the mathematics item pool into nine sets of items by grade and achievement level, using a response probability of 0.65 and a set of specific decisions rules. If an item had a mean rating greater than 0.65 at the Basic level, it was classified as Basic. For the remaining items, if an item had a mean rat-

__________________

⁸ A response probability is an estimate of the chance that a given student with a specified ability (theta) will answer a certain question correctly. A response probability of 0.65 means that an examinee at a given score level has a 65 percent chance of answering the item correctly. These estimates are derived using IRT item characteristic curves.

Page 90 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-7 Comparison of Item Classifications Based on Panelist’s Judgments and Statistical Estimates

	(1) Number of Items Classified at Each Level Based on Judges’ Estimates	(2) Number of Items Classified at Each Level Based on Statistical Estimates	(3) Classification Consistency (in %)	(4) Correlations
Grade 4
Basic	9	9	100	.87
Proficient	85	75	88	.90
Advanced	94	73	78	.78
Total	188	157	84	.85
Grade 8
Basic	33	32	97	.92
Proficient	123	106	86	.92
Advanced	62	34	59	.85
Total	218	172	79	.90
Grade 12
Basic	34	33	97	.93
Proficient	110	98	89	.92
Advanced	69	54	78	.83
Total	213	185	87	.89
All Grades Combined
Basic	76	74	97	.91
Proficient	318	278	87	.91
Advanced	225	162	72	.82
Total	619	514	83	.88

NOTE: See text for discussion.

SOURCE: ACT, Inc. (1993c, Table 5.4, p. 5-25). Reprinted with permission from ACT, Inc.

ing greater than 0.65 at the Proficient level, it was classified as Proficient. All remaining items were classified as Advanced.

The researchers then compared the mean rating for the panelists for each item (judgment based) with the response probabilities (empirically based) for each item at the borderline of the cut score for the achievement level. They used a decision rule related to a response probability value of 0.65: if the student’s response probability for an item classified at a particular achievement level was 0.65 or higher at the cut score for the level, then the judges’ rating and the expected student performances matched. If the response probability value was less than 0.65, they did not match. The results appear in Table 4-7.

Page 91 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

The researchers reported that for all grades and achievement levels combined, 514 of the 619 items (83%) matched. Correlations for the judges’ ratings and response probabilities ranged from a low of .78 for grade-4 Advanced to a high of .93 for grade-12 Basic (see ACT, Inc., 1993c, pp. 5-15, 5-16). They compared the matches across achievement levels and grades and concluded that the “cutpoints on the NAEP score scale are consistent with the conceptualization of the achievement levels incorporated in the ALDs [achievement-level descriptions]” (ACT, Inc., 1993c, p. 5-16).

An analysis of cut-score estimates for different difficulty levels was available in the NAEd report. For this comparison, questions were separated into two categories—easy and difficult—on the basis of statistical estimates of their difficulty (p values). The Angoff procedure assumes that panelists can take into account the differences in the difficulties of items, thus estimating lower percentages correct for difficult items and higher percentages correct for easier items (for students at the borderline of an achievement level), which produces a pattern of lower cut scores on the easy items and higher cut scores on the difficult items. This is the pattern that is apparent in Tables 4-8 and 4-9. For instance, the cut score for 4th-grade mathematics at the Basic level was 187.4 when based on the easy items and 205.8 when based on the difficult items (see Table 4-8). This translates to differences in the percentage of students at or Above Basic of 82 percent and 66 percent, respectively. For 4th-grade reading, the cut score at the Basic level was 179.6 when based on the easy items and 195.6 when based on the difficult items. Again, the impact of these differences

TABLE 4-8 Mathematics Achievement-Level Cut Scores, by Item Difficulty

Item Difficulty	Grade 4		Grade 8		Grade 12
Item Difficulty	Scale Score	% at or Above	Scale Score	% at or Above	Scale Score	% at or Above
Basic
Easy	187.4	82	240.2	76	273.4	76
Difficult	205.8	66	270.8	47	302.8	46
Proficient
Easy	223.2	46	271.0	47	306.8	42
Difficult	250.2	16	306.6	14	338.3	12
Advanced
Easy	258.2	10	301.6	18	341.4	10
Difficult	281.0	2	339.2	2	365.5	2

NOTE: See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.5M, p. 59). Reprinted with permission from the National Academy of Education.

Page 92 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-9 Reading Achievement-Level Cut Scores, by Item Difficulty

Item Difficulty	Grade 4		Grade 8		Grade 12
Item Difficulty	Scale Score	% at or Above	Scale Score	% at or Above	Scale Score	% at or Above
Basic
Easy	179.6	85	202.8	93	228.2	96
Difficult	195.6	74	231.4	79	246.2	90
Proficient
Easy	213.5	57	254.0	59	273.8	70
Difficult	240.6	27	285.0	25	304.7	35
Advanced
Easy	242.8	25	286.8	23	314.1	25
Difficult	266.8	8	320.8	3	342.5	5

NOTE: See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.5R, p. 59). Reprinted with permission from the National Academy of Education.

is evident in the consequence data: 85 percent of students would score Basic or above if based on a cut score of 179.6, while 74 percent of students would score Basic or above if based on a cut score of 195.6 (see Table 4-9).

Consistency of Cut-Score Judgments Across Item Characteristics

ACT examined the consistency of cut-score estimates in reading and mathematics across three conditions: (1) different sets of items, (2) the order in which items were presented to panelists, and (3) type of response format (multiple choice or extended response). In reading, the researchers conducted an additional analysis that examined the consistency of cut scores across reading purpose (literary experience, practical, and informational).

These analyses revealed no statistically significant differences in cut-score judgments when based on different sets of items or when the items were presented to panelists in different orderings. However, statistically significant differences in average cut-score judgments were found for reading for different reading purposes. Statistically significant differences were also found when the items were grouped by response format, and this was true for both subject areas. The rest of this section details those two differences in cut-score judgments.

Page 93 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

Reading Purpose

Comparisons of the average cut-score judgments for items grouped by reading purpose revealed differences at all grade levels. Grade-4 items related to literary experience passages were rated as more difficult than practical and informational passages: that is, the literary passage were judged to have lower expected passing rates for students at the borderline. At grade 8, items related to practical reading tasks were rated significantly less difficult than items related to reading for literary experience. Items related to literary experience were rated the most difficult, overall. Ratings for items at the 12th-grade level indicated that panelists judged the practical passages items to be significantly less difficult than either the literary experience or the informational passages (ACT, Inc., 1993d, Ch. 3).

Response Format

As described in Chapter 3, the 1992 assessments included some items that were scored either correct or incorrect (i.e., dichotomous) and other items that required an extended response and were scored on a four-point scale (i.e., polytomous) on which higher values indicated better performance. The modified Angoff cut-score setting method was used for dichotomously scored items; a procedure called the boundary exemplars method was used for extended-response items: the cut scores for all three achievement levels were higher when based solely on the polytomous items than when based solely on the dichotomous items.⁹

In the ACT reports, the results of these analyses were based on generalizability theory and analysis-of-variance techniques and as such, they are unnecessarily complex for the purposes of this report. Instead, we draw from the analyses reported in the NAEd reports, which are reported in the NAEP scale score metric. Specifically, Shepard et al. (1993) reports the average cut scores when derived separately for (1) dichotomous questions and polytomous questions, (2) items grouped by cognitive task, and (3) easy items as compared with difficult items.

For the first comparison of dichotomous and polytomous questions,

__________________

⁹ For the boundary exemplars procedure, panelists were given a random sample of actual responses for each extended-response item prompt in the pool. The panelists selected from one to two papers for each prompt as representative of borderline performance for each achievement level (Basic, Proficient, and Advanced). These scores were averaged over panelists, and separate cut scores were derived for just the polytomous items. The considerable differences in the cut scores when based solely on the polytomous items and solely on the dichotomous items could be attributable to the item type or to the use of different cut-score setting procedures (see ACT, Inc., 1993c, p. 3-10). The data collection design did not allow for examination of the interaction of item type by cut-score method.

Page 94 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

the cut scores based on extended-response questions were much higher than those based on multiple-choice questions. Tables 4-10 and 4-11 present these results for mathematics and reading, respectively. Each table compares the cut scores for each achievement level and grade when based only on the dichotomously scored items and only on the polytomously scored items. To show the effects of the different cut scores, the percentage of students who would score at each achievement level or above is reported.

As shown in Table 4-10, the cut score in Basic for 4th-grade mathematics would be 210.4 if based only on the dichotomous items and 266.5 if based only on the polytomous items. This translates to a difference of 56 points in the percentage of students at or above the cut score. At a cut score of 210.4, 61 percent would score at or above Basic; at a cut score of 266.5, only 6 percent would score at or above Basic.

As shown in Table 4-11, large differences in cut scores were also found for reading. For instance, for the Proficient level for 12th grade, at the cut score based on dichotomous items (293.6), 48 percent of students would score at Proficient or above; when based on polytomous items (362.6), only 1 percent would score at Proficient or above.

The second analysis allowed comparison of cut scores by response format, multiple choice or short answer, disentangling the effect of the cut-score setting method. For this analysis, cut scores were compared for

TABLE 4-10 Mathematics Achievement-Level Cut Scores by Level and Item Type: Dichotomously Scored or Extended Response

Level and Item Type	Grade 4		Grade 8		Grade 12
Level and Item Type	Scale Score	% at or Above	Scale Score	% at or Above	Scale Score	% at or Above
Basic
Dichotomous	210.4	61.0	255.6	63.0	288.9	61.0
Extended response	266.5	6.0	302.7	17.0	344.3	9.0
Proficient
Dichotomous	250.0	16.0	297.5	21.0	333.9	16.0
Extended response	304.4	0.2	345.5	1.0	374.0	1.0
Advanced
Dichotomous	282.7	2.0	333.3	3.0	366.2	2.0
Extended response	330.8	0.01	376.8	0.03	388.0	0.2

NOTES: Cut scores for the dichotomously scored items were set with the Angoff procedure. Cut scores for the extended-response items were set with the boundary exemplars procedure. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.3M, p. 55). Reprinted with permission from the National Academy of Education.

Page 95 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-11 Reading Achievement-Level Cut Scores, by Level and Item Type: Dichotomously Scored or Extended Response

Level and Item Type	Scale Score	% at or Above	Scale Score	% at or Above	Scale Score	% at or Above
	Grade 4		Grade 8		Grade 12
Basic
Dichotomous	190.6	78.0	232.5	78.0	249.9	88.0
Extended response	281.1	3.0	290.1	20.0	329.1	12.0
Proficient
Dichotomous	229.6	39.0	272.1	38.0	293.6	48.0
Extended response	317.4	0.1	336.4	1.0	362.6	1.0
Advanced
Dichotomous	259.9	12.0	311.0	7.0	336.8	7.0
Extended response	356.2	—	388.8	—	393.6	0.01

NOTES: Cut scores for the dichotomously scored items were set with the Angoff procedure. Cut scores for the extended-response items were set with the boundary exemplars procedure. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.3R, p. 55). Reprinted with permission from the National Academy of Education.

multiple-choice and open-ended short-answer questions (all of which had been set using the Angoff method). Tables 4-12 and 4-13 show these results. In this case, differences between the cut scores remained but were not quite as large as those shown in the previous analysis.

For instance, for the Basic level in 4th-grade mathematics (see Table 4-12), the cut score based on multiple-choice items was 190.8, with 80 percent of students scoring Basic or above. The cut score based on the short-answer items was 218.3, with 52 percent of students scoring basic or above. That is, there was a difference of 32 points in the percentage of students scoring at or above the Basic level. However, for some levels and grades, identical or nearly identical cut scores resulted: for example, see the 4th and 8th grades for the Proficient level.

The differences in cut scores for reading (see Table 4-13) were also less than those found in the previous analysis, though no two of them are identical. The smallest difference was for the Advanced level in grade 4, which was 255.4 for multiple-choice items and 263.4 for short-answer items, a difference of 8 points. The largest was for the Basic level in grade 8, which was 184.6 for the multiple-choice items and 240.0 for the short-answer items, a difference of 55 points.

Considerable attention was given to these findings, particularly to the differences in cut scores associated with different item response formats. As noted in the ACT, Inc. (1993c, p. 12) technical report:

Page 96 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-12 Mathematics Achievement-Level Cut Scores, by Achievement Level and Item Type: Multiple Choice or Short Answer

Level and Item Type	Grade 4		Grade 8		Grade 12
Level and Item Type	Scale Score	% at or Above	Scale Score	% at or Above	Scale Score	% at or Above
Basic
Multiple choice	190.8	80	246.3	71	278.1	72
Short answer	218.3	52	257.4	61	307.3	41
Proficient
Multiple choice	246.7	19	294.6	24	324.3	24
Short answer	246.9	19	296.1	23	343.6	9
Advanced
Multiple choice	282.8	2	330.3	4	358.4	3
Short answer	275.8	3	334.7	3	369.2	1

NOTE: Cut scores for both item types were set with the Angoff procedure. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.4M, p. 57). Reprinted with permission from the National Academy of Education.

TABLE 4-13 Reading Achievement-Level Cut Scores, by Achievement Level and Item Type: Multiple Choice or Short Answer

Level and Item Type	Grade 4		Grade 8		Grade 12
Level and Item Type	Scale Score	% at or Above	Scale Score	% at or Above	Scale Score	% at or Above
Basic
Multiple choice	175.9	87	184.6	97	219.7	98
Short answer	199.7	70	240.0	72	258.7	83
Proficient
Multiple choice	222.7	47	259.1	53	283.1	61
Short answer	234.5	33	281.2	29	296.4	45
Advanced
Multiple choice	255.4	15	296.2	15	327.8	13
Short answer	263.4	10	322.2	3	337.9	7

NOTES: Cut scores for both item types were set with the Angoff procedure. See text for discussion.

SOURCE: Shepard et al. (1993, Table 3.4R, p. 57). Reprinted with permission from the National Academy of Education.

This difference was the subject of many discussions by the Technical Advisory Committee on Standard Setting, ACT’s Technical Advisory Team, NAGB’s Achievement Levels Committee, and the project staff of both ACT and NAGB. Several plausible hypotheses have been put forward and many analyses have been conducted. No definitive conclusion can be reached as to the actual cause(s) of the difference.

Page 97 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

The differences may have been due to the particular sample of papers the panelists reviewed for each extended-response question. That is, the panelists may have had higher expectations for student performance on the extended-response questions than on the multiple-choice questions. The extended-response and multiple-choice questions may have tapped different parts of the central construct, or there may have been issues of multidimensionality. Differences may have been due to the use of two different standard setting methods, or insufficient training in either method, or both.

The researchers concluded (ACT, Inc., 1993c, p. 3-8):

Further research will be required to determine the source of the differences in ratings between dichotomous and polytomous items. The analyses that can be conducted with the present data cannot produce conclusive results to determine why.

The NAEd evaluators suggested that NAGB not report results using achievement levels until researchers could offer an explanation for the results (Shepard et al., 1993). However, NAGB chose to average the cut scores for the extended-response questions with those for the multiple-choice and short-answer questions. The final cut score was then a weighted composite (ACT, Inc., 1993c).

Nearly a decade later, the issue was still unresolved. For instance, Haertel (2001, pp. 251-252) noted:

Better feedback procedures have reduced these differences somewhat, but they do not disappear even when panelists are asked to attend to the differences. The cognitive requirements of the different item types and the match to the ALDs may legitimately require differences in standards for the item types. The explanation for this phenomenon is still unknown, but it is a stable result over different content areas and does not seem to be an artifact of the standard setting procedure.

Even today, more than 20 years later, there is no definitive explanation of the differences in the results.

CONSISTENCY ACROSS REPLICATIONS

ACT organized the standard setting in order to estimate the consistency of cut scores across replications of the standard setting process. To do this, panelists were assigned to one of two groups, Group A or Group B. The groups followed identical procedures for making judgments about cut scores, but they worked with different sets of test questions. Thus, two cut scores were derived for each combination of grade and achievement level: one score reflected the average computed from Group A using half of the item pool in each grade and content area; the other

Page 98 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

TABLE 4-14 Cut Scores Averaged for Two Groups of Panelists

Grade and Subject	Achievement Level
Grade and Subject	Basic	Proficient	Advanced
Grade 4
Mathematics	211 (1.87)	248 (4.06)	280 (3.98)
Reading	212 (2.48)	243 (2.06)	275 (8.76)
Grade 8
Mathematics	256 (2.40)	294 (5.72)	331 (4.84)
Reading	244 (2.59)	283 (0.84)	328 (7.75)
Grade 12
Mathematics	287 (4.18)	334 (0.25)	366 (0.72)
Reading	269 (7.93)	304 (2.81)	348 (4.11)

NOTE: Numbers in parentheses are standard errors. See text for discussion.

SOURCE: Adapted from ACT, Inc. (1993a, Table 4, pg. 53; 1993c, Table 4.3, p. 4-13; 1993d, Table 4.2, p. 4-18).

score reflected the average computed from Group B using the other half of the item pool. The two pool halves were constituted to be as equivalent as possible.

Table 4-14 shows the mean of the two sets of cut scores, as well as their standard errors.¹⁰ All the values are expressed in the NAEP score scale units.¹¹ In assessing the magnitude of these standard errors, ACT compared them with the standard deviations for the tests. The average standard deviation in mathematics was approximately 40, considered over all five content areas of the assessment and the three grade levels (4, 8, and 12). In reading, the average standard deviation across grades was approximately 49. In mathematics, the standard errors of the mean cut points were, on average, only about 8 percent as large as the average standard deviation of the NAEP score scale. In reading, the ratio increased to approximately 9 percent (ACT, Inc., 1993c).

We note that the standard error can also be used to place a “confidence interval” around the cut score to reflect the likely range of cut scores

__________________

¹⁰ That is, the data are for (Mean of Group A − mean of Group B) + 2; the actual means for Groups A and B were not reported in the documentation.

¹¹ As noted by ACT, Inc. (1993c, Ch. 3), treating these two groups (A and B) as independent estimators of the achievement-level standards had two advantages: (1) it provided a cross validation of the standard setting process and (2) it yielded a convenient, unbiased estimate of the standard error of the mean over both groups, without requiring statistical assumptions stronger than the central limit theorem. This estimator does not require any independence assumptions about how the panelists interacted within a group.

Page 99 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

if additional replications were conducted. This range is calculated by subtracting and adding the standard error to the cut score: subtracting and adding 1 standard error yields a 68 percent confidence interval; subtracting and adding 2 standard errors yields a 95 percent confidence interval.

For example, the first row of Table 4-14 shows that the standard error for the cut score for the Basic level in 4th-grade mathematics was 1.87. Adding and subtracting 1.87 from the cut score (211) yields a range of 210-213 (rounded). This range means that if the standard setting was replicated numerous times, 68 percent of the time the cut score would be in the 210-213 range; 95 percent of the time it would be in the 207-215 range (211 plus or minus (1.87 × 2)). These ranges of –3 points for the 68 percent confidence interval and 8 points for the 95 percent confidence interval are quite good.

Other standard errors are much higher. For example, the highest standard error in the table is for the advanced level in 4th-grade reading, at 8.76. For this level, the 68 percent confidence interval ranges from 266 to 284 (18 points), and the 95 percent confidence interval ranges from 257 to 293 (36 points).

It is important to note that the results in Table 4-14 reflect only two replications: the average of two sets of cut scores, those for Group A and those for Group B. Additional replications would be needed to obtain a more stable estimate of consistency across replications.

CONCLUSIONS

The materials ACT prepared to document the standard settings were very detailed and included the kinds of reliability information one would expect to see. A considerable number of in-depth analyses were conducted and reported, and they represent the kinds of analyses suggested by the Standards for Educational and Psychological Testing (hereafter referred to as Standards) and best practice at the time (American Educational Research et al., 1985). The Standards indicate the types of data and analyses that should be examined and reported, but they do not go so far as to specify acceptable values. Questions were raised about the findings, particularly about the indicators of intrapanelist consistency: they were not answered before the achievement-level results were reported.

The documentation indicates that ACT and NAGB spent considerable time wrestling with these issues and solicited advice from their consultant teams. They concluded more research was needed, and, as noted above, the NAEd evaluators who studied this issue recommended additional studies before reporting achievement-level results.

NAGB did not accept the recommendation of the NAEd evaluators and proceeded with reporting of achievement-level results. NAGB chose

Page 100 Cite

Suggested Citation:"4 Reliability of the Achievement Levels." National Academies of Sciences, Engineering, and Medicine. 2017. Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/23409.

×

to lower the cut scores for mathematics by 1 standard error. They chose to accept the cut scores for reading. Both sets of results were reported in 1993. These decisions were the subject of much debate, some of which is characterized in publicly available reports (see, e.g., Bourque, 2009; National Research Council, 1999; Shepard et al., 1993; Vinovskis, 1998).

The committee queried NAGB about its rationale for adjusting the recommended cut scores for mathematics but not for reading, and NAGB provided further information in the form of excerpts from minutes for its quarterly board meetings held in 1992. These excerpts made it clear that there was substantial discussion involving many experts (technical, policy, and subject matter), but the committee does not believe that the excerpts capture the likely complexity of the full discussions and deliberations. Given that these events occurred more than 25 years ago, we doubt that it would be possible to fully characterize the deliberations behind the decisions. We are hesitant to make judgments about the rationale for decisions made long ago; at the same time, we acknowledge that some of these issues warranted further investigation.

CONCLUSION 4-1 The available documentation of the 1992 standard settings in reading and mathematics include the types of reliability analyses called for in the Standards for Educational and Psychological Testing that were in place at the time and those that are currently in place. The evidence that resulted from these analyses, however, showed considerable variability among panelists’ cut-score judgments: the expected pattern of decreasing variability among panelists across the rounds was not consistently achieved; and panelists’ cut-score estimates were not consistent over different item formats and different levels of item difficulty. These issues were not resolved before achievement-level results were released to the public.