time, aligned assessments enable the public to determine student progress toward the standards.
A study conducted for the committee found that alignment may be difficult to achieve (Wixson et al., 1999). The study examined assessments and standards in elementary reading in four states. Using a method developed by Webb (1997), the researchers analyzed the cognitive complexity of both the standards and the assessment items, and estimated the extent to which the assessment actually measured the standards.
The study found that, of the four states, two had a high degree of alignment, one was poorly aligned, and one was moderately aligned. Of the two highly aligned states, one, State A, achieved its alignment, at least in part, because it relied on the commercial norm-referenced test it used to develop its standards, and the standards were the least cognitively complex of any of the states analyzed. State B, whose standards were at the highest level of cognitive complexity, meanwhile, had the lowest degree of alignment; only 30 percent of its objectives were measured by the state-developed test.
The other two states administered two tests to measure reading. In State C, which had a high degree of alignment, the state-developed comprehension test measured essentially the same content and cognitive levels as the norm-referenced test. In State D, however, a second test—an oral-reading test—did make a difference in alignment. But overall, that state's assessments and standards were moderately aligned.
The Wixson study suggests a number of possible reasons why attaining alignment is difficult. One has to do with the way states went about building their assessments. Unless a state deliberately designed a test to measure its standards—or developed standards to match the test, as State A did in the study—it is unlikely that the test and the standards will be aligned, particularly if a state uses an off-the-shelf test. Commercial tests designed for off-the-shelf use are deliberately designed to sell in many states; since standards vary widely from state to state, such tests are unlikely to line up with any single state's standards. Thus states using commercial tests are likely to find gaps between the tests and their standards.
But even when states set out to develop a test to measure their standards, they are likely to find gaps as well. In large part, this is because a single test is unlikely to tap all of a state's standards, particularly the extensive lists of standards some states have adopted. In addition, the ability of tests to tap standards may be limited by the constraints imposed on tests, such as testing time and cost. Time constraints have forced some states to limit tests to a few hours in length, and as a result, they can seldom include enough items to measure every standard sufficiently. Financial constraints, meanwhile, have led states to rely more heavily on machine-scored items, rather than items that are scored by hand. And at this point, many performance-based tasks—which measure