administered and scored. NCLB requires that states develop science achievement standards by the 2005–2006 school year, but does not require states to set cut scores until after they administer their first science assessments in 2007–2008.
Current methods for setting achievement levels fall into two categories: test-based methods and student-based methods. Test-based methods are those in which judgments are based on close examination of individual items or tasks that help to refine understanding of the performance of students who fall close to the border between two achievement levels. Student-based methods, by contrast, are procedures by which judgments are made about the skills or knowledge (or both) displayed by sample groups of students, generally by teachers who know them well. Methods that combine the test-based and student-based approaches also have been developed (Haertel and Lorie, 2004; Wilson and Draney, 2002). Each is likely to yield somewhat different results and no single method is recognized as the best for all circumstances (Jaeger, 1989).
While the language of standards-based reform has focused on setting high standards and helping all students move toward those levels, definitions of proficiency are not consistent from state to state or, in some cases, from district to district. For example, one study found that definitions of proficiency range from the 70th percentile or higher to as low as the 7th percentile. In other words, in one state, 30 percent of students in a particular grade may be identified as proficient in a subject, while in a neighboring state, 93 percent are identified as proficient. Such stark contrasts are likely to indicate that expectations—and perhaps the purposes of testing—are different in the two states, not that students in one state are vastly more competent in the assessed domain than students in the other. Judgment is a key part of the standard-setting process, and the variability this introduces must be factored into planning. Different groups of human beings will not produce exactly comparable results using the same process, and this source of error variance must be taken into account in the process by which the results are validated (Linn, 2003). In one study in which the standards set by independent but comparable panels of judges were evaluated, the percentages of students identified as failing ranged from 9 to 30 percent on the reading assessment and from 14 to 17 percent on the mathematics assessment (Jaeger, Cole, Irwin, and Pratto, 1980).
The method chosen to set achievement standards will also have an impact on the standards. An early study in which independent samples of teachers set standards using one of four methods showed considerable variability in the levels set (Poggio, Glasnapp, and Eros, 1981). On a 60-item reading test, the percentage of students who would have failed ranged from 2 to 29 percent across the standard-