At its heart, standard setting is a mechanism for capturing stakeholders’ judgments and reflecting them in the metric of an assessment. Standard setting is often characterized as a blend of judgment, psychometrics, and practicality (Hambleton and Pitoniak, 2006). In the words of the Standards for Educational and Psychological Testing (hereafter referred to as Standards; American Educational Research Association et al., 2014, p. 101), cut scores embody value judgments as well as technical and empirical considerations. For NAEP, the Standards are intended to convey societal expectations for student achievement: What does the country want its children to know and be able to do?
NAEP’s standard setting process was intended to define multiple, ordered categories of performance. The policy statement of the National Assessment Governing Board (NAGB) specified three categories: Basic, Proficient, and Advanced. Relying on input from a wide spectrum of perspectives, NAGB developed definitions of the knowledge and skills necessary to perform at each of these levels. It then convened a formal meeting to set standards for each NAEP subject area.1
NAGB’s first standard setting in 1990, conducted for the 4th-, 8th-,
1 NAGB sought feedback from a wide range of individuals in setting the standards: educators, administrators, subject-matter specialists, policy makers, stakeholder groups, professional organizations, and representatives of the general public. The process is discussed in detail in Chapter 3.
and 12th-grade mathematics assessments, sparked disagreement and debate. Those discussions stimulated refinements in the procedures for their next standard settings in 1992. The 1992 standard settings, conducted for both reading and mathematics, generated additional concerns and disagreements, which have helped to foster further research that has in turn moved the field forward. Today, almost 25 years later, experts still disagree about many aspects of standard setting, but the knowledge base and the research base have greatly expanded. Much more is known about standard setting, and it is now widely used in education for achievement testing. NAGB has subsequently conducted other standard settings in other subjects. However, with the exception of 12th-grade mathematics, the original 1992 cut scores for reading and mathematics are still being used to report results today.
In this chapter, we give an overview of the events that led to achievement-level reporting for NAEP and discuss the issues that it sparked. Most of these issues are captured in reports from evaluations of the process and in documents that responded to those evaluations. This chapter summarizes these issues and also discusses some of the advances in the field since 1992. Because the committee’s charge was limited to examining achievement levels for reading and mathematics, we focus on the methods and evaluations of the original 1992 standard setting in these subjects. The changes made after 1992 in 12th-grade mathematics are discussed in Chapter 7.
NAEP was first administered in 1969 as a way to report on the academic performance and progress of the nation’s students. Reflecting concerns about potential federal intrusion into the country’s decentralized education system, NAEP was initially defined “more in terms of what it would not do than what it would do” (Bourque, 1999, p. 215). In an effort to preclude inappropriate uses and to assure that the assessment was not a “backdoor” pathway to a national curriculum, certain restrictions were imposed on the assessment. Results could be reported only at the national level. By law, NAEP could not report data on individual students, on classrooms or grades, on schools or districts, or on states or jurisdictions (Bourque, 1999; Jones and Olkin, 2004; Vinovskis, 1998). Any efforts to impose the NAEP frameworks on states as a national curriculum were strictly prohibited (National Assessment Governing Board, 1990).
Initially, results were reported at the item level as the percentages of students responding correctly to individual questions. In the early 1980s, changes were made to the sampling design and the reporting design: sampling by grade was added and results were reported for groups, not
just items. Performance was reported as scale scores for all students and by selected subgroups.
Several events led to calls for establishing performance levels on NAEP. Drawing on evidence of score declines on the Scholastic Aptitude Test (SAT) and the poor performance of U.S. students on international assessments and tests of higher-order reading and mathematics skills, the National Commission on Excellence in Education (1983) in the much-cited A Nation at Risk, declared that students were not receiving the type of education necessary to meet the demands of a technological society or to maintain the nation’s economic position internationally.
In 1984, the U.S. Department of Education released a “wall chart” that used scores on the SAT and ACT—the only available cross-state measures—to compare student performance across states. The intent was to “hold state and local school systems accountable for education” (Vinovskis, 1998, p. 12). In response, state-based groups, including the Council of Chief State School Officers (CCSSO) and the National Governors Association (NGA), called for better state comparability data. In addition, the NGA argued that the nation, as well as the states and school districts, needed better report cards about what students know and can do (National Governors Association, 1991).
In 1986, then-Secretary of Education William Bennett appointed the Alexander/James Study Group to examine NAEP and its role in addressing education policy questions (Bourque, 1999). The report, The Nation’s Report Card (Alexander and James, 1987), recommended that an independent policy board for NAEP be created to set targets for achievement. Specifically, the panel called for “identifying feasible achievement goals for each of the age and grade levels to be tested” (Bourque, 2009, p. 4).
The recommendations of the Alexander/James Study Group formed the basis for the renewal of NAEP as part of the reauthorization of the Elementary and Secondary Education Act in 1988 (P.L. 100-297). The law authorized a voluntary Trial State Assessment Program to enable the use of NAEP for cross-state comparisons.2 It also created NAGB, a bipartisan and broadly representative body, to set policy for NAEP, set achievement goals, and establish guidelines for reporting and disseminating NAEP results.
As NAGB began to address its charge, President G.H.W. Bush and the nation’s governors were meeting at the 1989 Education Summit in Charlottesville, Virginia, where they set six broad goals for education to be reached by the year 2000. Two of the goals pertained to student subject-
2 The “trial” designation for the state assessments was removed in the Improving America’s Schools Act of 1994.
matter achievement and had implications for assessments of educational progress:
Goal #3: All students will leave grades 4, 8, and 12 having demonstrated competency over challenging subject matter . . . so they may be prepared for responsible citizenship, further learning, and productive employment in our nation’s modern economy.
Goal #5: By the year 2000, United States students will be first in the world in mathematics and science achievement.
The governors adopted the goals in early 1990 and shortly thereafter created the National Education Goals Panel (NEGP) to measure progress toward these goals.3 In 1994, Congress specified that “the purpose of [NAEP] is to provide a fair and accurate presentation of educational achievement in reading, writing, and the other subjects included in the third National Education Goal, regarding student achievement and citizenship” (P.L. 103-382, Sec. 411 (b)(1)).
The legislation creating NAGB gave limited guidance for setting achievement goals for NAEP. NAGB turned to the recommendations of the National Academy of Education’s review of the Alexander/James report (cited in Glaser, 1987, p. 58):
. . . to the maximal extent technically feasible, NAEP should use descriptive classifications as its principal reporting scheme in future assessments. For each content area, NAEP should articulate clear descriptions of performance levels, descriptions that might be analogous to such craft rankings as novice, journeyman, highly competent, and expert. Descriptions of this kind would be extremely useful to educators, parents, legislators, and an informed public.
In its 1990 policy statement, NAGB established three benchmarks—Basic, Proficient, and Advanced—which would be referred to as achievement levels. These levels, and the definition of performance for each level, are generic standards that are applied across NAEP assessments and grade levels.4 They are often called “policy definitions” or “policy standards.” In 1993, NAGB adopted a new policy statement in which the policy definitions were revised, in part to reflect issues raised during evaluations of the standard setting in 1990 and 1992. The two versions appear in Box 2-1 and Box 2-2. As can be seen, the revised version is considerably shorter, and the references to readiness for the next grade or for college and life were removed.
3 The six goals, along with two additional goals, were codified in federal law in the 1994 Educate America Act ((P.L. 103-227).
4 NAGB revised the policy definitions in 1993 and again in 1995. The 1995 definitions are currently in use.
NAEP is not a static program. The use of achievement levels with NAEP has continued to evolve since the initial work. NAGB has now set standards for assessments in nine subject areas. A new mathematics framework was developed for the 2005 assessment, and the standards and score scale for 12th grade were reset. A new reading framework was developed for the 2009 assessment, and the achievement-level descriptors (ALDs) were revised, although the cut scores were not reset. At the same time, the mathematics framework for grade 12 was also adjusted and changes were made to the ALDs.
Beginning in 1996, NAGB began allowing test accommodations for English-language learners and students with disabilities. The list of approved accommodations has changed over time. Results are routinely reported for these two broad groups at the national, state, and district levels.5
5 For more information, see http://nces.ed.gov/nationsreportcard/about/inclusion.aspx [January 2016].
In 2002, the Trial Urban District Assessment (TUDA) was implemented as a multiyear study of the feasibility of a district-level NAEP. It was carried out in selected urban districts, which volunteered to participate, with federal support authorized under the No Child Left Behind Act. The first TUDA took place in conjunction with the 2002 state NAEP reading and writing assessments. It was implemented again in 2003 and every odd year since through 2015.6
Over the past few years, there has been widespread interest in ensuring that U.S. students are ready for college and work. Beginning in 2008, NAGB responded by undertaking research in this area and subsequently chose scale scores for the 12th-grade mathematics and reading assessments to estimate the percentage of students academically prepared for college. These scores are regarded as predictors of college performance. As such, they represent a move toward developing achievement levels (or benchmarks) that inform a given purpose. That is, they use an external criterion measure to give meaning to the achievement levels. As part of this work, the frameworks for 12th-grade mathematics and reading were revised to reflect indicators of college readiness.
NAEP is now transitioning some of the assessments from paper- and-pencil administration to computer based. In 2009, a digital-based
assessment with interactive tasks was used for science. More recently, a computerized writing assessment was administered to grades 8 and 12 in 2011 and will be given to grade 4 in 2017.
Standard setting has a long history in the assessment field, with some authors tracing it to a proposal by John Stuart Mill in the 19th century for a law requiring that every citizen be educated to a specific set of standards (Madaus, 1981, cited in Jaeger, 1989). Standard setting has long been used in the context of professional licensing and certification testing, where the focus is primarily on determining the cutoff score between “pass” and “fail.” In the context of education, standard setting developed as part of the criterion-referenced testing movement through the 1960s and 1970s (see below).
There are several detailed accounts of the evolution of standard setting, including a set of eight papers in a special issue of the Journal of Educational Measurement in 1978, a set of papers in a special issue of Educational Measurement: Issues and Practices in 1991, Jaeger’s chapter in the edited 1989 issue of Educational Measurement, Zeiky’s paper from the 1995 joint conference on standard setting (see Crocker and Zeiky, 1995a, 1995b), and Zeiky’s chapters in the two volumes edited by Cizek (2001, 2012). We drew from these and other sources for the brief summary in this section on the evolution of standard setting in the context of achievement testing for kindergarten through high school (K-12).
Our brief summary does not in any way do justice to the extensive literature on the subject; we highlight certain developments as a foundation for understanding the committee’s discussion in later chapters. It is important to stress that the various histories make it clear that there have been and are differences of opinion about standard setting.
As noted above, the concept of a performance standard arose in the criterion-referenced testing movement. This movement reflects a desire to establish absolute standards for student achievement; the approach gives meaning to a test result by comparing it with a defined assessment domain. For NAEP, the reference criterion is the framework. This approach stands in contrast to the more traditional norm-referenced approach, which interprets test results in a relative sense in which a student’s score is interpreted by comparing it with the scores of other students.
Glass (1978) summarized the early history of criterion-referenced testing. He noted that the first known use of the term criterion-referenced test was made by Glaser and Kraus in 1962 in a paper on assessing human performance. Subsequent work led to the idea that test scores should be informative about behavior rather than merely about relative performance on a dimension assumed to lie behind a test score. This distinction is the primary distinction between criterion-referenced testing and norm-referenced testing. Criterion-referenced testing came to be understood as “tests that relate performance to absolute standards rather than the performance of others” (Shepard, 1997, p. 3).
Gradually, the notion of a single cutoff point that distinguishes competence from incompetence or mastery from nonmastery evolved into a notion of a continuum of knowledge acquisition, ranging from no proficiency at all to perfect performance. The degree of competence along the continuum is what is assessed, with the objective of understanding where a given student’s performance is on that continuum. Early writings made little or no explicit mention of performance standards per se, but the idea of establishing one or more points on the continuum was implicit in these discussions. The desire for finding ways to describe behavior along the continuum grew. To do so, key points on the continuum would need to be identified and described in behavioral terms. Gradually, there was increased focus on how to establish and describe these points.
The articles in the 1978 issue of the Journal of Educational Measurement focus more on whether to set a cut score or performance standard than on how to do it. At issue was the notion of subjectivity. For instance, Glass starts and ends his paper with cautions about the arbitrariness of setting standards. At the end, he writes (Glass, 1978, p. 258):
[E]very attempt to derive a criterion score is either blatantly arbitrary or derives from a set of arbitrary premises. Arbitrariness is no bogeyman, and one ought not to shrink from necessary decisions because they may be arbitrary. However, arbitrary decisions often entail substantial risks of disruption and dislocation. Less arbitrariness is safer.
The judgment-based aspect of standard setting has remained a constant theme in the field and often the subject of intense debate. There is wide acceptance that the process is inherently judgmental: the debate is about the various choices that need to be made and the effects of those choices. Jaeger (1989, p. 403) notes that one of the primary choices is that of which method to use and how to implement it: “[A]ll standard setting requires judgment. Only the foci of judgment and the procedures used to elicit judgments differ across [methods].”
In this paper, Jaeger compared the results of using different standard setting methods with the same test under similar conditions (see Table 14.1 in Jaeger, 1989). He concluded that there is little consistency in the results of using different standard setting methods under seemingly identical conditions, and there is even less consistency in the comparability of methods across settings (Jaeger, 1989, p. 500).
There has been a great deal of work in developing and researching standard setting methods, and a multitude of methods are available (see Cizek, 2012; Glass, 1978; Berk, 1986; Jaeger, 1989). Over the past 10 to 15 years, much of the work has been guidance on best practices: how to implement standard setting in structured ways that are likely to yield standards considered to be reliable and valid. Possibly as a result of these debates, the field gradually acknowledged that there was no “true” standard. The process relies on judgments—and judgments may differ from one setting to the next—and none is more “right” than any other (if the procedure is properly implemented).
Minimum Competency Testing
Some of the first uses of standard setting in K-12 education were to set graduation requirements, sometimes known as minimum competency testing. In the late 1970s and early 1980s, many states implemented graduation tests and set pass/fail cutoff scores. Passing a graduation test was required to receive a high school diploma and was intended to document that graduates were competent in “basic skills.” Such tests did not last, in part because of concerns that the “minimum” was becoming the “maximum”: that is, students needed only to cross the pass/fail threshold to graduate. Little or no attention was given to gradations in performance above the threshold of the passing score. At the peak of the requirement, 37 states had implemented graduation tests, and standard setting was used to set the pass/fall threshold for awarding a high school diploma.
Although the appetite for minimum competency testing dissipated, it had given states experience with standard setting methods to determine graduation requirements. The idea of setting standards for K-12 assessments had caught on. “The nation witnessed an unprecedented level of effort at the national, state, and local levels to set more rigorous academic standards and design more challenging assessments” (National Education Goals Panel, 1999, p. 3).
Given the potential value to the nation of setting achievement levels for NAEP and the importance of “getting it right,” the procedures and
results have received considerable scrutiny. The committee reviewed six evaluation reports that focused on standard setting for mathematics in 1990 and for mathematics and reading in 1992: Stufflebeam et al. (1991), Linn et al. (1992a), and Koretz and Deibert (1995/1996) focused on 1990 mathematics; the U.S. General Accounting Office (1992) focused on 1992 mathematics; and a panel of the National Academy of Education (Shepard et al., 1993) focused on 1992 mathematics and reading.
The first evaluation—of the 1990 mathematics assessment—highlighted a number of serious concerns, and ultimately, this standard setting was redone (see Linn et al., 1992a; Stufflebeam et al., 1991; Koretz and Deibert, 1995/1996; Burstein et al., 1993; Bourque and Byrd, 2000). Evaluations of the 1992 standard settings also identified numerous concerns (U.S. General Accounting Office, 1992; National Academy of Education, 1993a, 1993b), and the evaluators recommended that the achievement levels not be used for reporting the 1992 results: see Shepard et al. (1993, p. xxv, recommendation #2). In addition, a report of the U.S. General Accounting Office (1992, p. 2) noted:
NAGB improved its standard setting procedures substantially in 1992, but the critical issue of validity of interpretation—an issue in NAGB’S approach—remains unresolved. GAO therefore concluded that NAGB’S approach is unsuited for NAEP.
Among the concerns expressed by the National Academy of Education Panel were the following (Shepard et al., 1993, pp. 76-78):
- In reading, the initial achievement level descriptors and item judgments were inappropriately influenced more by panelists’ personal experiences and opinions than by the Reading framework.
- The process for developing the description in reading and mathematics was inadequate because it did not ensure that final descriptions were agreed upon before attempting to set cut scores.
- The 1992 cut scores set in reading and mathematics are indefensible because of large internal inconsistencies.
- The Panel concluded that the Angoff procedure (Angoff, 1984) is fundamentally flawed because it depends on cognitive judgments that are virtually impossible to make.
- The process used by NAGB did not facilitate the development of consensus, either in developing descriptions or in setting cut scores.
- The decision to adjust cut scores in mathematics by “one standard error” was misleading (because doing so implies statistical precision in the cut scores that is unwarranted.
The report’s conclusions were not universally accepted in 1993. The report stimulated discussion, debate, and research that continue today.
For instance, shortly after these evaluations, in October 1994 NAGB and National Center for Education Statistics (NCES) jointly convened a 3-day conference on standard setting. The primary purpose of the conference was “to provide a forum to address technical and policy issues relevant to setting standards for large-scale educational assessments at the national, state, and local levels” (Crocker and Zieky, 1995a, p. 6). The proceedings from the conference included 19 commissioned papers that address historical, theoretical, methodological, applications, and policy issues. Three years later, NAGB sponsored another workshop on standard setting with seven commissioned papers (Bourque, 1998). Another standard setting workshop was convened by the National Research Council in 1997 as part of its evaluation of NAEP (see National Research Council, 1999), which resulted in a special issue of Applied Measurement in Education (1998). Standard setting has also been the topic of a multitude of empirical studies that have compared the results for different methods, procedures, instructions to panelists, feedback to panelists, numbers of rounds, numbers of panelists, and other variables. Examples include
- studies of specific standard setting methods (Buckendahl et al., 2002; Giraud et al., 2000; Hambleton et al., 2012; Impara and Plake, 1997; Lewis and Haug, 2005; Mills et al., 2000; Norcini, 2003; Schulz et al., 2005; Subkoviak et al., 2002).
- studies of the panelists (judges) who participate in the standard setting (Clauser et al., 2009; Ferdous and Plake, 2005; Girard et al., 2005; Impara and Plake, 1998; Jaeger, 1991; Plake and Impara, 2001; Plake et al., 1991; Skorupski and Hambleton, 2005).
- considerations in adopting cut scores (Geisinger and McCormick, 2010; Giraud et al., 2000).
- strategies for developing achievement-level descriptors (Egan et al., 2012; Hambleton and Slater, 1995; Plake et al., 2010; Zenisky et al., 2009).
- procedures for validating achievement levels (Kane, 1994, 2001).
There are now numerous guides on how to conduct a standard setting and how to implement a specific method. The documents listed below are conference proceedings, compendia of studies, and edited volumes that have been published since 1992 on this topic:
- Bourque and Byrd (2000), Student Performance Standards on NAEP: Affirmations and Improvements, edited volume
- Cizek (2001), Setting Performance Standards: Concepts, Methods, and Perspectives, edited volume
- Cizek (2012), Setting Performance Standards: Foundations, Methods, and Innovations, edited volume
In addition to compilations, many academic researchers and standard setting practitioners have continued to flesh out best practices in the field, such as Bourque (2009); Hambleton and Pitoniak (2006); Hambleton et al. (2012); Kane, (2001); Mehrens (1995); Cizek et al., (2004); Hansche (1998); Hambleton et al. (2000b); and Zieky et al. (2008).
Using achievement levels to summarize assessment results for large-scale educational assessments has come to be routine: the results of nearly all assessment programs administered in K-12 education are reported using achievement levels. Consequently, there is now a large body of research on standard setting procedures and a large number of researchers and practitioners well versed in these procedures.
It is now widely accepted that standard setting is a judgment-based activity, and, as such, there are no right or wrong standards or methods for setting them, except for the need to select a method that is appropriate for the type of questions and response format. In the 1992 standard setting, NAEP used two different methods. The modified Angoff method was used for the multiple-choice questions and short open-ended questions that could be scored dichotomously (right or wrong).7 The Angoff method does not work with questions that are scored polytomously (e.g., on a 4-point scale), and so a different method was used for setting standards on the extended response questions. The method, called the boundary exemplars method, involves looking at samples of student responses to identify those representative of performance at the cut score (see additional details in Chapters 3 and 4).
As noted above, research shows that the different methods yield different results (e.g., see Jaeger, 1989, Table 14.1, pp. 498-499). Thus, the integrity of the resulting performance standards relies on making wise choices about methods and implementing them properly.
7 The basic task of the Angoff and modified Angoff method are the same (Angoff, 1984), but the modified method has a process of iterations and feedback. Since 1992, there have been additional modifications of the procedures (see Cizek, 2012). Through this chapter and the rest of the report, we use the terms interchangeably.
Moreover, experts now emphasize that there is no such thing as a “true” cut score. That is, standard setting does not result in the “best” estimate of a population parameter. To set a standard is to develop a policy, and policy decisions are not right or wrong. They can be wise or unwise, effective or ineffective, appropriate or inappropriate, but the goal of standard setting is not to uncover the true cut score, as has been discussed by Kane (2001) and Hambleton et al. (2012), among others.
The evaluations of the 1992 standard settings harshly criticized the modified Angoff method, noting that it presented panelists with an unreasonable cognitive task. One measurement expert described the complexity of the task and the objections to it (Haertel, 2001, pp. 260-261):
In carrying out the core judgmental process that defines the Angoff procedure, panelists are asked to: imagine a hypothetical group of minimally competent examinees, inspect a test item, infer the knowledge and skills required to answer that test item correctly, compare those skill requirements to the skill profile of the hypothetical borderline examinees, and state the proportion of such examinees who would answer the item correctly. . . .
[T]he hypothetical borderline examinees not only do not exist, but also do not even resemble any real-world examinees. They possess the set of skills in the borderline description derived from the achievement level description. Moreover, the fact that the hypothetical borderline examinee does not possess any pattern of proficiencies found in the real world implies that panelists’ experience with real students, however extensive, cannot suffice to inform their judgments.
But an alternate view was expressed by a researcher who helped conduct the standard settings (Reckase, 2001, p. 252):
Teachers, nonteacher educators, and members of the general public do cognitively complex tasks every day. However, cognitive complexity does not bear on whether or not the tasks can be done. The cognitive complexity does bear on how much training and background are needed to do the task. The ALD panelists were selected because of their experience with the educational system and the student population as well as their content expertise. They have the background to do the task. They were also given extensive training in every aspect of the process. Their background and training led to their own confident appraisal that they did do the required tasks well. There is no solid evidence that this was not the case.
Despite the harsh criticism, the Angoff procedure continues to be widely used in professional certification and licensure contexts, and it is still used for some K-12 standard settings. NAGB used the modified Angoff procedure until the 2005 NAEP grade-12 mathematics standard
setting, when a new approach (Mapmark) was adopted.8 NAGB has not used the Angoff procedure since then, although this is due, in part, to the use of item formats (e.g., extended constructed responses) that do not work well with the Angoff procedure. Today, it is widely recognized that setting standards involves making probabilistic judgments. Training panelists to perform these tasks is key to obtaining reliable and valid results.
In 1992, little guidance existed with regard to the development and use of ALDs for standard setting, and they were rarely used during the actual process of setting cut scores (Bourque, 2000, cited in Egan et al., 2012). NAEP’s 1992 standard setting represented the first time that formal, written descriptions were produced to guide panelists in standard setting (Bourque, 2000, cited in Egan et al., 2012). Prior to 1992, standard setting panelists were not provided with formal ALDs nor did they create them. During the 1980s, panelists did spend time discussing the concept of minimally competent candidates, but they did not formally write those definitions (see, e.g., Norcini et al., 1987, 1988; Norcini and Shea, 1992).
While the basics of what elements were considered important in 1992 remain the same, they are couched in different terms that relate panelists, methods, and implementation to the validity argument (e.g., procedural, internal, and external evidence; Kane, 2001) that supports (or undermines) the cut scores. These changes are evident from the language and guidance in each new edition of the Standards (American Educational Research Association et al., 1985, 1999, 2014). A review of those three editions of the Standards illustrates the ways in which the practice of standard setting has evolved over time.
Table 2-1 displays the standards pertaining to standard setting across the three editions. As the table shows, the number of standards pertaining to standard setting increased from 1985 to 2014, and the nature of the guidance changed substantively. Although the 1985 Standards gives a nod to the need to describe the training, experience, and qualifications of experts (Standards 1.7 and 6.9), it is not until the 1999 Standards that more specifics are given regarding the use of experts and their training, qualifications, and experience (Standards 1.7 and 4.21). There were no standards in the 1985 version that addressed the process of standard setting, and all three editions are relatively quiet on the development of ALDs.
Standard 4.9 in the 1999 version and Standard 5.5 in the 2014 version and their associated comments, however, discuss the importance of evidence to support interpretations such as “a child scoring above a
TABLE 2-1 Comparison of Standard Setting Guidance in Successive Editions of Standards for Educational and Psychological Testing
|Validity||When subject-matter experts have been asked to judge whether items are an appropriate sample of a universe or are correctly scored, or when criteria are composed of rater judgments, the relevant training, experience, and qualifications of the experts should be described (Standard 1.7).||When a validation rests in part on the decisions of expert judges, observers, or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The qualifications and experience of the judges should be presented. The descriptions of procedures should include any training and instructions provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may influence one another should be set forth (Standard 1.7).||When a validation rests in part on the decisions of expert judges, observers, or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The qualifications and experience of the judges should be presented. The descriptions of procedures should include any training and instructions provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may influence one another should be set forth (Standard 1.9).|
|Reliability and Precision||Where cut scores are specified for selection or classification, the standard errors of measurement should be reported for score levels at or near the cut score (Standard 2.10).||Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score (Standard 2.14).||Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score (Standard 2.14).|
|Decision Consistency||If specific cut scores are recommended for decision making (for example, in differential diagnosis), the user’s guide should caution that the rates of misclassification will vary depending on the percentage of individuals tested who actually belong in each category (Standard 1.24).||When a test or combination of measures is used to make classification decisions, estimates should be provided of the percentage of test takers who would be classified in the same way on two replications of the procedure, using the same form or alternate form procedures (Standard 2.15).||When a test or combination of measures is used to make classification decisions, estimates should be provided of the percentage of test takers who would be classified in the same way on two replications of the procedure (Standard 2.16).|
|Interpretation Guidance||When raw score or derived score scales are designed for criterion-referenced interpretations, including the classification of examinees into separate categories, the rationale for recommended score interpretations should be clearly explained (Standard 4.9).||When raw score or derived score scales are designed for criterion-referenced interpretations, including the classification of examinees into separate categories, the rationale for recommended score interpretations should be clearly explained (Standard 5.5).|
|Documentation of Standard Setting Procedures||When a specific cut score is used to select, classify, or verify test takers, the method and rationale for setting that cut score, including any technical analyses, should be presented in a manual or report. When cut scores are based primarily on professional judgment, the qualifications of the judges also should be documented (Standard 6.9).||When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be clearly documented (Standard 4.19).||When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be documented clearly (Standard 5.21).|
|Selection and Training of Judges||When cut scores defining pass-fail or proficiency categories are based on direct judgments about the adequacy of item or test performances or performance levels, the judgmental process should be designed so that judges can bring their knowledge and experience to bear in a reasonable way (Standard 4.21).||When cut scores defining pass-fail or proficiency levels are based on direct judgments about the adequacy of item or test performances, the judgmental process should be designed so the participants can bring their knowledge and experience to bear in a reasonable way (Standard 5.22).|
|Relationships with Relevant Criteria||When a test is designed or used to classify people into specified alternative treatment groups (such as alternative occupational, therapeutic, or educational programs) that are typically compared on a common criterion, evidence of the test’s differential predication for this purpose should be provided (Standard 1.23).||When feasible, cut scores defining categories with distinct substantive interpretations should be established on the basis of sound empirical data concerning the relation of test performance to the relevant criteria (Standard 4.20).||When feasible and appropriate, cut scores defining categories with distinct substantive interpretations should be informed by sound empirical data concerning the relation of test performance to the relevant criteria (Standard 5.23).|
|Test Documentation||Organizations offering automated test interpretation should make available information on the rationale of the test, and a summary of the evidence supporting the interpretations given. This information should include the validity of the cut scores or configural rules used and a description of the samples from which they were derived (Standard 5.11).||When statistical descriptions and analyses that provide evidence of the reliability of scores and the validity of their recommended interpretations are available, the information should be included in the test’s documentation. When relevant for test interpretation, test documents ordinarily should include item level information, cut scores and configural rules, information about raw scores and derived scores, normative data, the standard errors of measurement, and a description of the procedures used to equate multiple forms (Standard 6.5).||Test documentation should summarize test development procedures, including the results of the statistical analyses that were used in the development of the test, evidence of the reliability/precision of scores and the validity of their recommended interpretations, and the methods for establishing performance cut scores (Standard 7.4).|
|Performance/ Achievement-Level Descriptors||Results from certification tests should be reported promptly to all appropriate parties, including students, parents, and teachers. The report should contain a description of the test, what is measured, the conclusions and decision that are based on the test results, the obtained score, and information on how to interpret the reported score and any cut score used for classification (Standard 8.6).||When score reporting includes assigning individuals to categories, the categories should be chosen carefully and described precisely. The least stigmatizing labels, consistent with accurate representation should always be assigned (Standard 8.8).||When score reporting assigns scores of individual test takers into categories, the labels assigned to the categories should be chosen to reflect intended inferences and should be described precisely (Standard 8.7).|
|Reporting Test Results||In educational settings, score reports should be accompanied by a clear statement of the degree of measurement error associated with each score or classification level and information on how to interpret scores (Standard 13.14).||In educational settings, score reports should be accompanied by a clear presentation of information on how to interpret the scores, including the degree of measurement error associated with each score or classification level, and by supplementary information related to group summary scores (Standard 12.18).|
NOTE: See text for discussion.
SOURCE: Adapted from American Educational Research Association et al. (1985, 1999, and 2014).
certain score point can successfully apply a given set of skills” (p. 103 in the 2014 Standards). It is not until the 1999 Standards that the setting of more than one cut score is discussed. In addition, the 1999 Standards indicated that the “the least stigmatizing labels, consistent with accurate representation should always be assigned” (Standard 8.8), whereas the 2014 Standards states that labels assigned to performance levels should “be chosen to reflect intended inferences and should be described precisely” (Standard 8.7).
Overall, the 1985 Standards focused on the psychometric considerations when setting standards, and the 1999 and 2014 Standards addressed value considerations in addition to psychometric considerations. All three editions of the Standards discuss the need to report the standard error of measurement for the cut scores and misclassification rates.
The 1999 and 2014 versions acknowledge the need for criterion-related evidence, which did not appear in the 1985 version. Standards 4.20 (1999) and 5.23 (2014) address the need to use “sound empirical data concerning the relation of test performance to relevant criteria,” when setting cut scores that define categories with distinct substantive interpretations.
Besides the Standards (American Educational Research Association et al., 1985, 1999, 2014), the edited volume, Educational Measurement, issued under the guidance of the National Council on Measurement in Education, provides an historical account of the psychometric considerations associated with cut scores. Feldt and Brennan (1989) discussed two types of indices appropriate for pass/fail status, one that takes into account whether examinees are consistently classified as pass or fail and the other takes into account the squared difference of the examinee score and the cut score, which subsumes measurement error and error of classification. An early variant of the latter index that reflects both measurement error and classification error was mentioned in a chapter on reliability (Stanley, 1971) in the third edition of Educational Measurement.
Many measurement experts participated in developing and carrying out the standard setting process for NAEP. Subsequent evaluations by other measurement experts raised questions about the integrity and validity of the process. The criticisms and recommendations were controversial and provoked considerable debate within the measurement field; at the same time, the debate was productive in leading to advances in the field of standard setting.
NAGB moved forward with achievement-level reporting for the 1992 results, but it took steps to address the criticisms. NAGB and NCES sponsored research conferences, sought advice from experts in standard-
setting, commissioned research, formed standing advisory groups, held training workshops, and published materials on standard setting. These efforts have helped to identify some best practices for standard setting.
Much more is known about standard setting now than in 1992, and it is more widely used in the context of K-12 achievement tests. There are still disagreements among measurement experts about many aspects of standard setting; these disagreements have fostered research that, in turn, has expanded knowledge about standard setting.
Setting performance standards for NAEP was a large undertaking. Although the standards in the then-current edition of the Standards provided guidance for some aspects of 1992 standard setting, many of the procedures used were novel and untested in the context of K-12 achievement testing. In particular, setting multiple performance standards had not been done in the past, and the use of different standard setting methods for multiple-choice items and constructed-response items was new. Subsequent revisions of the Standards provide more explicit guidance and standards for using these methods in achievement testing.