Documentation of the process by which a standard setting is carried out is fundamental to establishing the validity of the resulting achievement levels. In the language of measurement, this documentation is referred to as “procedural evidence of validity.” Procedural evidence encompasses the selection and training of panelists, the selection of the method, the implementation of the method, and the panelists’ evaluation of the implementation. As noted in the Standards for Educational and Psychological Testing (hereafter referred to as Standards; American Educational Research Association et al., 2014), compilation of procedural evidence is a basic step in the standard setting process. Although procedural evidence cannot guarantee the validity of the resulting achievement levels, it can invalidate the results of the standard setting. As specified in the Standards (American Educational Research Association et al., 2014, p. 107, Standard 5.21): “When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut score should be documented clearly.”
In this chapter, we review the available documentation on the 1992 standard settings. We consider this information in relation to what was known at the time about best practices, as well as what is known now. Drawing on the advice of Hambleton et al. (2012) on the steps that should be followed in any standard setting, this chapter is organized in four major sections: method selection, panelist selection, achievement-level descriptors (ALDs), and method implementation.
In the 1980s, there were five commonly used methods for setting cut scores: Angoff, Ebel, Nedelsky, borderline groups, and contrasting groups methods. Other methods were available but were less frequently used (see Jaeger, 1989, pp. 498-499, Table 14.1). According to the available documentation (ACT, Inc., 1993c; Loomis and Bourque, 2001), the National Assessment Governing Board (NAGB) consulted with technical experts and measurement specialists and reviewed the available research before deciding on the Angoff method. NAGB’s sources included Berk (1986); Colton and Hecht (1981); Cross et al. (1984); Kane (1993); Klein (1984); Livingston and Zieky (1982); Meskauskas (1986); Mills and Melican (1988); and Smith and Smith (1988). From consultations and research reviews, NAGB judged that certain attributes of the Angoff method were desirable and influenced their choice (ACT, Inc., 1993c, p. 2-2):
- It is straightforward and easily implemented.
- It is flexible and adaptable to many test item formats and decision-making contexts.
- It can provide stable, consistent results. It yields relatively small standard errors for the passing scores, and because reliability is a necessary condition for validity, the sizes of the standard errors are relevant to questions of validity.
- Persons using the Angoff method make use of more information that is relevant to item difficulty.
- The Angoff procedure seems to be the most popular method by far among those actually setting standards.
NAGB was also persuaded by Berk (1986, p. 147), who advised “the Angoff method appears to offer the best balance between technical adequacy and practicability” and by Meskauskas (1986, p. 199), who concluded, “the present method of choice for standard setting is the Angoff method.” In addition, the test development contractor that would be conducting the standard setting, ACT, was experienced in the method and indicated it would have been the first choice (ACT, Inc., 1993c, p. 2-2).
As discussed in Chapter 2, the choice of the Angoff method provoked considerable controversy even though it was then and is still widely used. Since 1992, a large number of cut-score setting methods have been developed, and research shows that the choice of method has an effect on the resulting performance standards. There are no firm decision rules to guide the choice of method, and measurement experts disagree about the strengths and weaknesses of the various methods. The guidance currently given is to select a method that is appropriate for the assessment and the
ways the results will be used, adjust it as needed for the situation, and carefully implement it following accepted practices.1
From the perspective of procedural evidence of validity, it is important that a representative, well-qualified group of panelists be recruited for a standard setting. Panelists should have a working knowledge of the content area for which cut scores are being recommended, as well as knowledge about the test takers who will be affected by the cut scores (Livingston and Zieky, 1982). Hambleton and Powell (1983) laid out a series of questions to guide the selection of panelists, including questions related to the desired demographic profile of the standard setting panel, the inclusion of certain constituencies, how many panelists to select, and how to select panelists. These questions remain a primary concern whenever a standard setting method is implemented. As Hambleton and Pitoniak noted (2006, p. 436): “The defensibility of the performance standard will ultimately depend on many factors including the acceptability of the composition of the standard setting panel.”
For NAEP, selection of an appropriately representative group of panel members was particularly critical, given that NAEP performance standards were to represent the judgments of educators across the country. NAGB specified the composition of the standard setting panels: 55 percent teachers, 15 percent nonteacher educators, and 30 percent general public representatives. In addition, NAGB specified that panelists should reflect a balance of gender, race and ethnicity, and geographic location.
NAGB, in collaboration with ACT, developed and implemented a process that involved multiple stages of selection and feedback. Initially, they laid out an overall plan for identifying and selecting panelists and distributed it to individuals, groups, and organizations likely to be interested in the process and to have a stake in the outcomes. In January 1992, a series of meetings was held to discuss these plans with stakeholders and interested groups: 17 groups sent representatives to those meetings (see ACT, Inc., 1993c, p. 2-6).
At the time, there were no existing lists from which a probability sample of panelists could be drawn so as to represent the population of eligible panelists in the country—which is still the case today. Thus, the plan focused on identifying a representative set of individuals to serve as “nominators” of standard setting panelists. The selection process involved three main steps: (1) identifying a representative sample
of school districts; (2) contacting individuals in certain positions in those school districts and inviting them to serve as nominators of panelists; and (3) selecting panelists.
Selecting School Districts
School districts served as the basic unit of sampling. A stratified random sample of districts was drawn for each of the NAEP content areas (i.e., a sample of districts for reading, another for mathematics). In selecting the district sample, the plan stratified on geographic region, type of institution (public, private), type of community (socioeconomic status of the residents), and student enrollment size. Three samples of school districts were drawn, one for each NAEP content area (reading, mathematics, and writing). Each sample included 40 districts for nominators of teachers, 80 districts for nominators of general public representatives, and 40 districts for nominators of nonteacher educators. In addition, a sample of 15 private schools was drawn, and the principals or heads of these schools were contacted to provide nominations for teacher panelists.
The goal for identifying nominators was to identify a group of individuals judged to be qualified to do the task, but to do it in a way that was broadly representative across the country. A large and diverse set of nominators was contacted.
For the nominators of teachers, the school district superintendent and the head of the bargaining or largest teacher organization (or both) were consulted. They were asked to submit names and contact information for potential nominators in their respective districts.
For the nominators of nonteacher educators, three sources were consulted: (1) nonclassroom personnel classified as career educators (e.g., curriculum specialists, counselors, principals) in each district in the sample; (2) from a sample of universities and colleges, the deans of education, liberal arts, or humanities; and (3) for each state in the sample, a state-level education officer (e.g., commissioner, assessment director, or curriculum director).
For mathematics, the plan resulted in the identification of 424 individuals who were asked to be nominators of panelists for the achievement-level setting process: 100 to serve as nominators of teachers, 180 to serve as nominators of nonteacher educators, and 144 to serve as nominators of representatives of the general public. For reading, a total of 353 nominators were identified: 117 to serve as nominators of teachers, 103 to serve
as nominators of nonteacher educators, and 133 to serve as nominators of representatives of the general public.
Once the list of nominators was identified, it was sent to key stakeholder organizations for the subject areas of reading and mathematics, the International Reading Association and the National Council of Teachers of Mathematics, respectively. The organizations were asked to provide feedback on the list of nominators. They could also contact nominators on the list and encourage them to submit nominations for panelists, and they could lobby the nominators to nominate specific candidates.
Each nominator was asked to submit the names of up to four individuals for each grade level. They were provided with instructions and general criteria for the individuals in each category:
- Teachers were required to have at least 5 years of overall teaching experience and a minimum of 2 years in the specific subject area (mathematics, reading, or writing) at the specified grade level. Nominees were to be persons judged to be “outstanding” in their professional performance by someone in the position to make that judgment.
- Nonteacher nominees were required to have familiarity with and professional experience in the subject matter and grade level of the assessment to which they were to be nominated. Nominees were also required to be judged “outstanding” in their professional performance by their nominator, and the nominator was asked to indicate why that designation was warranted. All nominators in this category were encouraged to volunteer (nominate themselves).
- Nominees from the general public were required to have familiarity with the subject matter at the specific grade level to which they were nominated to serve as a panelist. They could not be current or former educators (to avoid overlap with the two educator categories). Nominators in this category were also encouraged to nominate themselves.
The goal was to create six panels, one for each combination of subject area (reading, mathematics) and grade level (4, 8, and 12). Each panel was to have 20 primary members and 2 backup members. For the 20 primary members, 11 were to be teachers, 3 nonteachers, and 6 from the general public.
The final selection of panelists from the pool of nominees was intended to maximize the balance of gender and race and ethnicity and ensure rep-
TABLE 3-1 Descriptive Data for Standard-Setting Nominees, Panelists, and Nominators (in percentage)
|Race and Ethnicity|
|Region of Nominator|
|Community Type of Nominator|
|Low socioeconomic status||—||22.10||15.30||24.20|
|Not low socioeconomic status||—||63.20||57.70||67.70|
|District Size of Nominator|
|More than 50,000||—||41.20||30.90||33.90|
|Less than 50,000||—||44.10||42.10||58.10|
|No data/not applicable||—||14.70||27.00||8.00|
resentation by geographic region, school affiliation, type of community, and enrollment size. The final selections were made by ACT and NAGB.
Table 3-1 shows the characteristics of the teacher and nonteacher educator nominees and panelists for the reading and mathematics standard settings.
General Approaches to Developing Descriptors
ALDs define the knowledge, skills, and abilities of students at specific levels of achievement. It is now widely recognized that the descriptors are an essential part of the standard setting process. They are viewed as key to producing valid performance standards and for communicating the meaning of the performance standards (Bourque and Boyd, 2000; Egan et al., 2012; Hambleton et al., 2012; Huff and Plake, 2010; Perie, 2008; Plake et al., 2010).
In 1992, little guidance existed with regard to the development and use of achievement-level descriptions for standard setting, and they were rarely used during the actual process of setting cut scores (Bourque, 2000, cited in Egan et al., 2012). NAEP’s 1992 standard setting represented the first time that formal, written ALDs were produced to guide standard setting panelists (Bourque and Boyd, 2000, cited in Egan et al., 2012). Prior to 1992, standard setting panels were not provided with formal descriptors, nor did they create them. During the 1980s, panelists did spend time discussing the concept of minimally competent candidates, but they did not put these definitions in writing (see, e.g., Norcini and Shea, 1992; Norcini et al., 1987, 1988). After the 1992 standard setting, researchers began to formalize the process of writing achievement-level descriptions. Some researchers have reported that since panelists began using written achievement-level descriptions, the variability of their ratings has decreased (Mills et al., 1991; Mills and Jaeger, 1998; Plake et al., 1994).
As described below, NAEP developed multiple versions of the achievement-level descriptions. Initially, NAEP used policy descriptors followed by detailed operationalized descriptions. This approach was new at the time, but the value of having different versions of the achievement-level descriptions is now recognized. It is common practice to develop different versions of the descriptors to be used for different purposes. For example, Egan et al. (2012) distinguished among four types of descriptors:
- Policy descriptors are at a high level and are used to guide test development and conceptualization.
- Range descriptors define the content range and limits and are used to guide item writers.
- Target descriptors define the lower ends of the range of descriptors and are used for standard setting.
- Reporting descriptors describe the knowledge, skills, and abilities defined by the cut scores.
NAGB formulated the policy descriptors before the standard setting was done. The standard setting panelists drafted more detailed versions as part of the standard setting process. A series of other steps was also carried out to refine them for reporting (see Chapters 4 and 5).
In 1998, Mills and Jaeger (cited in Hambleton and Pitoniak, 2006, p. 453) outlined the eight steps for developing ALDs (paraphrased below):
- Convene and orient a panel (much like a standard setting panel, and sometimes the standard setting panelists are involved in this process).
- Review the content specifications for the test and the specific content strands that serve as the basis for organizing the specifications. In the context of NAEP, this refers to the subject-area framework.
- Train the panelists in test content and scoring methods. Performance category descriptions need to be broad enough to reflect the proposed test content but not so broad that they go beyond it.
- Match test content to the content specifications.
- Present the policy descriptions of the performance categories. The task for the panel is to develop narrative descriptions of these performance categories in terms of the test content.
- Familiarize panel members with samples of candidate [test takers’] performance. At this step, panelists begin to look at candidate responses to the test items and link them if they can to the general descriptions of the performance categories.
- Begin to draft the performance category descriptions. At this point, individual or small groups of panelists begin writing out descriptions with specific content knowledge and skills that should be expected of candidates who are placed in each performance category.
- Develop consensus among panel members. At this step, all panelists come back together and try to reach consensus about the descriptions.
These steps are still appropriate, and we discuss the 1992 process in relation to these guidelines.
Pilot Test of Procedures
Before the actual standard setting took place, ACT and NAGB conducted a pilot study (held in St. Louis, Missouri, in February 1992) to test all aspects of the design and implementation of the process of setting the achievement levels and to identify any aspects requiring adjustment, elimination, or addition. Because it was the process that was being tested, the nominators used in the pilot study were not identified from a sample of districts. All other aspects of the design were implemented as outlined in the “Design Document for Setting Achievement Levels on the 1992 National Assessment of Educational Progress in Mathematics, Reading, and Writing” (ACT, Inc., 1993a, 1993b).
People in each of the designated “nominator” positions described above (see ACT, Inc., 1993c, App. C) were contacted in each of the school districts in St. Louis County. They nominated the panelists for the pilot study: 58 percent were teachers, 29 percent were nonteacher educators, and 13 percent were representatives of the general public; 39 percent were nonwhite; and 71 percent were female.
Documentation indicates that several changes were made as a result of the pilot study. One change, for example, was to increase the amount of time for developing achievement-level descriptions. Arriving at a mutual understanding of what each achievement level “meant” was critical to the success of the process. The panelists needed more time than had been planned to work on developing the achievement-level descriptions in order to feel comfortable with using them in rating items and setting achievement levels. Other technical changes were also made to the process, such as revising the format of presenting intrajudge consistency data to panelists and reversing the sequence of presenting interjudge and intrajudge data to panelists.
All of the standard setting panelists participated in the process of developing operational achievement-level descriptions. The process began by familiarizing panelists with the policy-based descriptors (see Box 2-1, in Chapter 2) along with samples of NAEP test questions. Panelists listened to a presentation intended to help them understand the difference between policy descriptors and operational descriptors. The presentation included an overview of the NAEP framework and a discussion of factors that influence item difficulty, including item type. Representatives from the major organizations for each discipline assisted in developing this overview: for mathematics, the National Council of Teachers of Math-
ematics and the Mathematical Science Education Board; for reading, the National Council of Teachers of English and the International Reading Association.
The presentation was designed to focus panelists’ attention on the framework, the test questions, and the scoring protocols for a given content area. Panelists were told that their judgments should reflect the content area as conveyed through the framework, not the entire domain of mathematics or reading. That is, panelists were told that they were to set achievement levels appropriate to the view of the discipline represented by the framework. The framework was taken to be the guide or template for all assessments administered under its regime for a particular content.
All panelists completed an appropriate grade-level form of the test and compared their answers to the scoring guides. The purpose of this exercise was to familiarize panelists with the test content and scoring protocols, as well as to refresh their memories of test taking under time constraints.
Working in small groups of five or six, separated by content area and grade level, panelists generated a list of descriptors that reflected what they thought student performance should be at each achievement level, using the NAEP framework and their experience in taking the test. There were four groups per grade and each produced a list of content-based descriptors.
The lists were compiled across groups and distributed to panelists. Panelists were asked to identify, individually, five or six descriptors that best described what Basic, Proficient, and Advanced students at their grade level should be able to do. The grade-level descriptors for each achievement level chosen by a majority of panelists were compiled into a list, with 6-10 descriptors for each achievement level.
Panelists were then asked to identify how each descriptor fit in the NAEP framework. Descriptors that did not fit in the framework were eliminated. The lists of grade-level descriptors that remained were then discussed by the grade-level groups, and suggestions were made for modifying wording. In addition, descriptors viewed as important by panelists and that fit within the framework were added. The grade-level groups then reached general agreement that the final lists of descriptors represented what students should be able to do at each of the achievement levels.
During this session, panelists also individually reviewed the half of the item pool they would not be rating later and selected items they judged to be representative of Basic, Proficient, and Advanced performance. The purpose of this activity was to further familiarize the panelists
with the item pool and to help them internalize the relationship between their descriptors and items in the 1992 version of NAEP.2
ACT’s content experts then reviewed the lists of descriptors for consistency with the framework, consistency and logical progression within and across grade levels, and editorial quality: they made changes as they deemed needed. Panelists then discussed the final lists of descriptors—as may have been amended by the content experts—and reached general agreement.
The major purpose for having panelists develop their own set of grade-specific content-based descriptions of Basic, Proficient, and Advanced was to ensure that, to the extent possible, all panelists would have both a common set of content-based referents to use during the item-rating process and a common understanding of borderline performance for each of the three achievement levels at the specified grade levels.
There are various ways to implement any standard setting method, and many decisions to make about the procedures. At the time of the 1992 standard settings, the following practices were recommended:
- An iterative process with multiple rounds of judgments (Shepard, 1976; Jaeger, 1982, as cited in Hambleton and Powell, 1983)
- Group discussion subsequent to each round (Berk, 1986)
- Response to the test questions or review of the test content (Livingston and Zieky, 1982)
- Discussion of borderline students—those that meet the minimal requirements to be placed in a specific performance category (Livingston and Zieky, 1982)
- Presentation of impact data—the percentage of students that would score at each level for a given set of cut scores (Hambleton and Powell, 1983; Livingston and Zieky, 1982)
More recent guidance encourages the use of multiple subpanels (formed from the single panel) in order to estimate the generalizability of the recommended cut scores by providing the means for computing the standard error (Hambleton and Pitoniak, 2006; Kane, 2001). As Kane (2001, p. 71) points out, “the advantage of this design is that it provides
2 For each grade group, half the panelists rated half of the item pool, and half rated the other half of the item pool. The item pool halves were balanced with respect to item difficulty, number of items, number of extended response items, number of calculator blocks, and other characteristics.
us with a direct indication of how large the difference can be from one [standard setting] to another using the same general design.”
By 1992, it was recognized that “not all standard setting methods can be used with every item format” (Hambleton and Powell, 1983, p. 7). The 1992 reading and mathematics NAEP assessments included several types of formats: multiple choice, short-answer constructed response, and extended-constructed response. As discussed above, NAGB used the Angoff procedure for the multiple-choice and short-answer constructed response questions (the ones that can be scored dichotomously—right or wrong). For the extended-constructed response questions, which are scored polytomously (i.e., with more than two gradations—correct, partially correct, incorrect), a procedure called the boundary exemplars method was used.
The concept of borderline performance is an integral part of the Angoff methodology.3 Panelists were instructed to envision 100 students whose performance was on the borderline for each performance level (Basic, Proficient, and Advanced). For the multiple-choice and short-answer items, panelists were asked to make a judgment as to how many of those 100 students, at each borderline achievement level, would answer the item correctly.
For extended-response items, panelists were asked to review 20 to 25 actual student responses for mathematics (ACT, Inc., 1993a) and 24 responses for reading (ACT, Inc., 1993d) and select three papers, one for each achievement level, to typify student performance at the borderline of that level. Panelists participated in a practice item-rating session using items from the 1992 NAEP.
During any standard setting process, panelists should receive extensive training on the context for the standard setting and the tasks of the standard setting. In the case of NAEP, training is of particular importance because of the broad range of stakeholders who participate (Loomis, 2012). The importance of training to the standard setting process was well known in 1992 (see Berk, 1986; Hambleton and Powell, 1983), and NAEP paid particular attention to the timing, quantity, and frequency of training (Loomis, 2012). The training involved multiple elements, including pur-
3 Borderline refers to the “cut point” or minimal competency point separating any two achievement levels. For example, the borderline between Basic and Proficient is the point on the NAEP scale, or that level of performance as described in panelists’ descriptions, that separates Basic from Proficient student performance. All students scoring at or above the borderline would be classified as Proficient; all students below the borderline would be classified as Basic.
pose for the standard setting; overview of the process; the test; the NAEP framework; the scoring protocols; ALDs; and the rating task (ACT, Inc., 1993c). Extensive training is important to creating a transparent process for panelists.
According to the available documentation, the strategy involved an iterative process in which whole-group general training sessions were held to ensure that every panelist received the same instructions presented in the same manner. The general sessions lasted 1-2 hours and included visual aids and practice examples. Lectures, visual aids, question-and-answer sessions, and practice were all used to provide panelists with the necessary instructions before they began the actual process of setting achievement levels.
Grade-level groups were led by experienced facilitators who were trained in the process and who had spent many hours of preparation on how best to implement the process. Procedures that had already been implemented were reviewed each day in general sessions and again in grade-level sessions. Training in new procedures was first presented, along with examples, in the general sessions; the procedures were then reviewed and discussed in grade-level groups.
The first session was to solidify the panelists’ understanding of the goals and the process to be followed to achieve those goals. During the first training session, the policy definitions and their role in the process were discussed in detail. The relative nature of the achievement levels was discussed, as well as the assumption that the knowledge and skills being discussed are cumulative across levels. Panelists were shown a graph plotting achievement levels and demonstrating the logical progression across achievement levels within each grade and from grade to grade within each achievement level.
Panelists next received approximately 2 hours of training in a modified Angoff item-rating process. Because the emphasis during development of ALDs was on what student performance should be, panelists were instructed to use their “should-based” descriptions and other information presented to them to rate the NAEP reading items, using their best judgment of how students at the borderline of each achievement level would perform on the items.
Panelists were led through a practice item-rating session using items from the assessment they had completed earlier. Panelists were encouraged to ask questions during this training session so that misconceptions or uncertainties could be addressed before round 1 of the item-rating process.
The Item-Rating Process4
This section presents a brief overview of the item-rating process; extensive detail is available in the reports documenting the standard settings for mathematics and reading. We refer the reader to these documentation reports for details. Specifically, for details on the item-rating process for mathematics, see ACT, Inc. (1993a, pp. 23-27 and associated appendixes. For details about the rating process for reading, see ACT, Inc. (1993d, pp. 2-10 to 2-14 and associated appendixes).
The item-rating process consisted of three rounds. To prepare for the first round, panelists responded to the set of assessment items they had been given and used the scoring keys and protocols to review their answers and score themselves. During round 1, panelists provided ratings for all items for all three achievement levels. A rating is a panelist’s judgment about the percentage of test takers on the borderline of each achievement level likely to respond correctly to a given item. That is, for the set of 100 items, each panelist made 300 ratings—one for each item at each achievement level. At the end of round 1, panelists’ item ratings were entered into a computer database for analysis and calculation of descriptive statistics, such as each panelist’s mean rating for each item at each achievement level.
Before round 2, panelists received feedback on the ratings. For each panelist, the mean of her/his item ratings was calculated and compared with the mean for the entire group and with the means for each of the other panelists. This interpanelist consistency information is useful for identifying outliers, such as those with extremely high or low mean ratings. Reasons for extreme mean ratings, including the possibility that some panelists misinterpreted the item-rating task, were discussed. The documentation notes that no effort was made to coerce panelists to change their ratings.
Panelists also received item difficulty data (based on students’ actual performance on items). This information was presented as the percentage of students who scored “correct” or “incorrect” for each multiple-choice and short-answer item, and as the percentage of students receiving scores of 1, 2, 3, or 4 for the extended-response items. Panelists were told that this item difficulty information should be used as a reality check. For items on which item ratings differed substantially from the item difficulty value, panelists were asked to reexamine the item to determine whether they had misinterpreted the item or misjudged its difficulty.
During round 2, panelists reviewed the same set of items they rated in round 1. Using the interpanelist consistency information, the item difficulty information, and the information provided prior to round 1, panelists reviewed their ratings and decided whether any adjustments were needed.
Panelists’ round-2 ratings were entered and analyzed, and intrapanelist variability was examined. For each panelist, intrapanelist variability information highlighted those items that he or she had rated differently from items having similar difficulty levels. Panelists were asked to review each of these items and decide whether their round-2 ratings still accurately reflected their best judgments of the items.
For round 3, panelists reviewed the same set of items they rated in rounds 1 and 2 using both the intrapanelist variability information and the information made available during rounds 1 and 2. Panelists were advised that these data were for their information and that changes in ratings should be made only if reconsideration of the item, in its entirety, indicated a need to change the rating. Panelists were instructed that they could discuss, within their small groups, ratings of specific items about which they were unsure.
Evaluation by panelists is considered an important piece of the evidence to support procedural validity. Panelists’ evaluations are typically positive, with panelists indicating strong support for the cut scores that they have recommended. When panelists provide negative evaluations, it can be seen as evidence that would undermine the recommended cut scores. If the panelists think that the cut scores that they set are too high or too low, it is difficult for others to have faith in those cut scores.
For NAEP, panelists were continually asked to evaluate the standard setting process, including round-by-round evaluations and a summative evaluation. Table 3-2 shows the questions they responded to for the summative evaluation and the distributions of their responses (most of which were on a scale that ranged from 1 to 5). Panelists’ ratings were generally positive, and the majority indicated that they had confidence in the resulting achievement levels. The majority of panelists said they would definitely or probably be willing to sign a statement recommending the use of the achievement levels (question 42) that resulted from the process. For mathematics, 56 percent said they would definitely sign, and 43 percent said they probably would sign; for reading, 65 percent said they would definitely sign, and 28 percent said they would probably sign.
In addition to the strong positive response from panelists about rec-
TABLE 3-2 Panelists’ Overall Evaluations of the NAEP Standard Setting
|27||I believe that the objectives of this meeting to establish levels on the 1992 NAEP assessment have been successfully achieved.||Completely||Partially||Not at All|
|28||The instructions on what I was to do during the rating sessions were||Absolutely Clear||Somewhat Clear||Not at All Clear|
|29||My level of understanding of the tasks I was to accomplish during the rating session was||Totally Adequate||Marginally Adequate||Totally Inadequate|
|30||The amount of time I had to complete the tasks I was to accomplish during the rating sessions was||Far Too Long||About Right||Far Too Short|
|31||The amount of time I had to complete the tasks I was to accomplish was generally:||Far Too Long||About Right||Far Too Short|
|32||The most accurate description of my level of confidence in the achievement-levels ratings I provided was:||Totally Confident||Somewhat Confident||Not at All Confident|
|33||I would describe the effectiveness of the achievement-levels setting process as:||Highly Effective||Somewhat Effective||Not at all Effective|
|34||During some of the discussions, I felt a need to defend the ratings I had made.||To a Great Extent||To Some Extent||Not at All|
|35||During the round-2 ratings, I felt coerced to modify my ratings from the previous round.||To a Great Extent||To Some Extent||Not at All|
|36||During the round-3 ratings, I felt coerced to modify my ratings from the previous rounds.||To a Great Extent||To Some Extent||Not at All|
|37||I feel that this NAEP Achievement Levels Study provided me an opportunity to use my best judgment in selecting papers to set achievement levels for an NAEP assessment.||To a Great Extent||To Some Extent||Not at All|
|38||I feel that this NAEP Achievement Levels Study would produce achievement levels that would be defensible.||To a Great Extent||To Some Extent||Not at All|
|39||I feel that this NAEP Achievement Levels Study would produce achievement levels that would generally be considered as reasonable.||To a Great Extent||To Some Extent||Not at All|
|40||I feel that the panel that rated items for the NAEP achievement levels was representative:||To a Great Extent||To Some Extent||Not at All|
|41||I feel that the panel that rated items for the NAEP achievement levels was credible:||To a Great Extent||To Some Extent||Not at All||4.54|
|42||I would be willing to sign a statement (after reading it, of course) recommending use of the achievement levels that resulted from this achievement levels-setting activity.||Yes Definitely (1)||Yes Probably (2)||Probably not (3)||Definitely not (4)|
NOTE: For all questions except the last (42), the scale for panelists’ answers ranged from 1 to 5. For question 42, the scale ranged from 1 to 4.
ommending the process, ACT, Inc. (1993c, pp. 9-10) highlighted several other panelists’ ratings:
- The instructions were generally clear (90% or more).
- Their understanding of tasks was quite adequate (95% or more).
- The amount of time to complete ratings was about right (more than 60%).
- The amount of time to complete tasks was about right (more than 75%).
- They had a high degree of confidence in the achievement-level ratings (Item 32, means around 4.0).
- The achievement-level setting process was more than somewhat effective (mean ratings were 4.1 for each process).
- A majority said they were given the opportunity to use their best judgment in setting achievement levels to a great extent (Item 37, see Table 3-2).
- Generally, the process would produce achievement levels that would generally be considered as reasonable and defensible: the mean ratings for the reasonableness of the achievement levels (Item 39, see Table 3-2) were 4.3 for reading and 4.4 for mathematics; the mean ratings for the defensible quality of the achievement levels (Item 38) ranged were 4.2 for reading and 4.3 for mathematics.
ACT concluded that these panelist evaluations provided positive evidence of procedural validity, stating (ACT, Inc., 1993c, p. 10):
Although the evaluations of panelists cannot provide definitive evidence for the success of the process, they must be given serious consideration. Panelists are uniquely well qualified to determine the extent to which the process ‘worked.’ These panelists, although not unanimous in their evaluations, generally reported that ‘it worked.’
We note that, although the evaluations are by and large positive, not all panelists were comfortable with the process or outcomes. Specifically,
- Between 35 and 44 percent of panelists were not definitely willing to sign a statement recommending the use of the achievement levels resulting from the study.
- Between 17 and 35 percent of panelists felt coerced to modify their ratings to at least some extent.
- Between 15 and 17 percent of panelists felt the process was only somewhat or less than somewhat effective.
- Between 13 and 15 percent of panelists felt that the process produced defensible cut scores to only some extent or even less.
Moreover, while panelists’ evaluations are important, it is also important to remember that they have just completed a multiday exercise that involved considerable team work and discussion. Over the course of such an activity, participants become invested in the process and its results. Their ratings may reflect their investment, as well as their objective ratings.
NAEP’s achievement levels were intended to represent the subject matter and skills that the nation wants its students to know and be able to do. The use of three achievement levels (Basic, Proficient, Advanced) provided a mechanism for tracking progress on those benchmarks, with proficient as a primary goal. The achievement levels were to be defined so that they measured performance on challenging subject matter. The standard setting needed to be carried out in a way that would support these intended inferences.
Our conclusions are based on our examination of the process for setting achievement levels and on comparing it with guidance from the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1985) and the research and knowledge base, both in 1992 and presently.
- The process for selecting standard setting panelists was extensive and, in our judgment, likely to have produced a set of panelists that represented a wide array of views and perspectives.
- In selecting a cut-score setting method, NAGB and ACT chose one method for the multiple-choice and short-answer questions and another for the extended-response questions. This was novel at the time; it is now widely recognized as a best practice.
- NAEP’s 1992 standard setting represented the first time that formal, written ALDs were produced to guide standard setting panelists. This was also novel at the time and is now widely recognized as a best practice.
CONCLUSION 3-1 The procedures used by the National Assessment Governing Board for setting the achievement levels in 1992 are well documented. The documentation includes the kinds of evidence called for in the Standards for Educational and Psychological Testing in place at the time and currently and was in line with the research and knowledge base at the time.