tucky Instructional Results System program and those from the National Assessment of Educational Progress and the American College Testing Program (ACT) provides one example of a case in which a state’s assessment was not consistent with other indicators (Koretz and Baron, 1998). The Kentucky results showed dramatic upward trends, while the two national assessments showed modest improvement or level performance. Such contrasts raise such questions as whether the state test results reflect real learning or just the effects of test preparation or teaching to the test, and whether the national tests were adequate measures of the Kentucky curriculum.
In another example, California’s strong accountability system in reading and mathematics resulted in impressive initial improvement in test scores, with the majority of elementary schools meeting their target goals. Years 2 and 3 of the program saw diminishing returns, with substantially fewer schools reaching their goals, and results from 2004 showed no consistent trends (Herman and Perry, 2002). Some observers believe that patterns such as these illustrate the limits of what can be achieved primarily through test preparation, and that continuing improvement over the long term will require meaningful changes in the teaching and learning process. These findings suggest the need for states to continuously validate their gains and the meaning of their science scores over time.
Ensuring the reliability of scores is another challenge facing those who monitor school performance from year to year. All test scores are fallible. Individual test scores reflect actual student capability but also are subject to errors introduced by students’ motivation and state of health on the day of the test, in how attentive they are to the cues and questions in the tests, in how well prepared they are for a particular test format, and in other factors. Test scores at the school level similarly reflect an amalgam of students’ actual knowledge and skills and error. Error can be introduced by unpredictable events, such as, for example, loud construction near the school, waves of contagious illness, and other factors that affect which students are actually tested. In addition, there is inevitably substantial volatility in scores from year to year that has nothing to do with student learning, but more to do with variations in the population of students assessed, particularly for smaller schools and schools with high transiency rates. This volatility makes it difficult to interpret changes in test scores from year to year, as these must be interpreted in light of these potential sources of measurement error. For example, in an analysis of Colorado’s reading and mathematics assessments, Linn and Haug (2002) found that less than 5 percent of the state’s schools showed consistent growth on the Colorado Student Assessment Program of at least 1 percentage point per year from 1997 to 2000, even though schools on average showed nearly a 5 percent increase over the three-year period in the number of