Opportunities for Better Assessment
The present moment—when states are moving toward adopting common standards and a federal initiative is pushing them to focus on their assessment systems—seems to present a rare opportunity for improving assessments. Presenters were asked to consider the most promising ways for states to move toward assessing more challenging content and providing better information to teachers and policy makers and to identify the elements that would need to be in place. Laurie Wise enumerated some of the possible opportunities, Ronald Hambleton described some of the challenges of supporting cross-state comparisons in a new era of innovative testing, and Rebecca Zwick concluded with some historical perspective on cross-state comparisons and suggestions for moving forward.
Laurie Wise began with a reminder of the goals for improvement that had already been raised. Assessments need to support a wide range of policy uses. Current tests have limited diagnostic value and are not well integrated with instruction or with interim assessments. At the same time, however, they are not providing optimal information for accountability purposes because they cover only a limited range of what is and should be taught. Improvements are also needed in validity, reliability, and fairness. For current tests, he observed, there is little evidence that they are good indicators of instructional effectiveness or good predictors of students’ readiness for subsequent levels of instruction. Their reliability is limited because they are generally targeted to very broad
content specifications, and limited progress had been made in assessing all students accurately. Improvements such as computer-based testing and automated scoring need to become both more feasible in the short run and more sustainable in the long run.
In Wise’s view, widespread adoption of common standards might help with these challenges in two ways: (1) by pooling their resources, states could get more for the money they spend on assessment, and (2) interstate collaboration is likely to facilitate deeper cognitive analysis of standards and objectives for student performance than is possible with separate standards.
The question of how much states could save by collaborating on assessment begins with the question of how much they are currently spending to do it on their own. Savings would be likely to be limited to test development, since many per-student costs for administration, scoring, and reporting, would not be affected, so Wise focused on an informal survey he had done of development costs (Wise, 2009). (Wise’s sample included 15 state testing programs and a few test developers, and included only total contract costs, not internal staff costs.) The results are shown in Table 5-1.
Perhaps most notable was the wide range in what states are spending, as shown in the minimum and maximum columns. Wise also noted that on aver-
TABLE 5-1 Average State Development and Administration Costs by Assessment Type
age the states surveyed were spending well over a million dollars annually to develop assessments that require human scoring and $26 per student to score them.
A total of $350 million will be awarded to states through the Race to the Top Initiative. That money, plus savings that winning states or consortia could achieve by pooling their resources, together with potential savings from new efficiencies such as computer delivery, would likely yield for a number of states as much as $13 million each to spend on ongoing development without increasing their own current costs, Wise calculated.
Improved Cognitive Analyses
Wise noted that the goal for the common core standards is that they will be better than existing state standards—more crisply defined, clearer, and more rigorous. They are intended to describe the way learning should progress from kindergarten through 12th grade to prepare students for college and work readiness. Assuming that the common standards meet these criteria, states could collaborate to conduct careful cognitive analysis of the skills to be mastered and how they might best be assessed. Working together, states might have the opportunity to explore the learning trajectories in greater detail, for example, in order to pinpoint both milestones and common obstacles to mastery, which could in turn guide decisions about assessment. Clear models for the learning that should take place across years and within grades could support the development of integrated interim assessments, diagnostic assessments, and other tools for meeting assessment goals.
The combination of increased funding for assessments and improved content analyses would, in turn, Wise suggested, support the development of more meaningful reporting. The numerical scales commonly used now offer very little meaningful information beyond identifying students above and below an arbitrary cut-point. A scale that was linked to detailed learning trajectories (which would presumably be supported by the common standards and elaborated through further analysis) might identify milestones that better convey what students can do and how ready they are for what comes next.
Computer-adaptive testing would be particularly useful in this regard since it provides an easy way to pinpoint an individual student’s understanding; in contrast, a uniform assessment may provide almost no information about a student who is performing well above or below grade level. Thus, reports of both short- and long-term growth would be easier, and results could become available more quickly. Results that were available more quickly and also provided richer diagnostic information could also improve teacher engagement. Another benefit would be in the increased potential for establishing assessment validity. Test results that closely map onto defined learning trajectories could support much stronger inferences about what students have mastered than are
possible with current data, and they could also better support inferences about the relationship between instruction and learning outcomes.
It is clear that common standards can support significant improvements in state assessments, Wise said. The potential cost advantages are apparent. But perhaps more important is that concentrating available brain power and resources on the elaboration of one set of thoughtful standards (as would be possible if a number of states were all assessing the same set of standards) would allow researchers to work together and yield faster and better results.
One important reason to have common standards would be to establish common learning objectives for students regardless of where they live. But unless all students are also assessed with exactly the same instruments, comparing their learning across states is not so easily accomplished. Nevertheless, the capacity to make cross-state comparisons remains important to policy makers and educators. Ron Hambleton described some of the complex technical issues that surround this challenge, known to psychometricians as “linking”: placing test results on the same score scale so that they can be compared.1
The basic issue that linking procedures are designed to address is the need to determine, when results from two tests appear to be different, whether that difference means that one of the tests is easier than the other or that one of the groups of students is more able than the other in some way. There are several different statistical procedures for linking the results of different tests (see National Research Council, 1999a, 1999b; see also http://nces.ed.gov/pubs98/linking/c3.asp [June 2010]). If the tests were developed to measure precisely the same constructs, to meet the same specifications, and to produce scores on the same scale, linking procedures are relatively straightforward. They are also very important though, since users count on the results of, say, the SAT, to mean exactly the same thing, year after year. Other—less straightforward—linking approaches are necessary when the tests are developed to different frameworks and specifications or yield scores on different scales. A common analogy is the formula for linking temperatures measured in degrees Fahrenheit or Celsius, but because of the complexity of the cognitive activities measured by educational tests, procedures for linking test scores are significantly more complex.
In general, Hambleton explained, to compare the achievement of different groups of students it is necessary to have some comparability not only in the standards to be measured against, but in the tests and the curricula to which the students have been exposed. While much is in flux at the moment, Hambleton judged that it appears likely that states will continue to use a variety
of approaches to assessment, including varying combinations of paper-and-pencil format, computer-based assessment, and various kinds of innovative assessments. This approach may be an effective way for supporting instruction at the state and district levels, he pointed out, but it will complicate tremendously the task of producing results that can be compared across states. Innovative item types are generally designed to measure new kinds of skills, but the more detailed and numerous the constructs measured become, the greater the challenge of linking the results of one assessment to those of another.
To demonstrate the challenge, Hambleton began with an overview of the complexity of linking traditional assessments. Even in the most strictly standardized testing program, it is nearly impossible to produce tests from year to year that are so clearly equivalent that scores can be compared without using linking procedures to make statistical adjustments. Even for the SAT, possibly the most thoroughly researched and well funded testing program in the world, psychometricians have not been able to get around the need to make statistical adjustments in order to link every different form of the test every year. The basic procedure involved in linking two test forms is, first, to make sure either that some number of people take both test forms or that some number of items appear in both test forms, and, second, to then use statistical procedures to adjust (in one of several ways) for any differences in difficulty between the two test forms that become apparent (see Hambleton, 2009).
The same must also be done if comparisons are to be made across states, Hambleton noted. This is easiest if the states share the same content standards and proficiency standards and also administer the same tests, as is the case with the New England Common Assessment Program (NECAP). However, even when these three elements are the same, it is still possible that items will perform differently across states because of variations in teaching methods, curricula, or other factors, and this possibility must be checked. Any changes to the nature or administration of a test may affect the way items perform. Thus, the addition of new item types, separate sections to measure an individual state’s desired content, changes in the positioning of items, or conversion of certain parts of an assessment to computer delivery, may affect equating (see National Research Council, 1999a, 1999b). The timing of the test is also important: even if states have similar curricula, if they administer a common test at different points in the year it will affect the quality of the linking.
Since common items are necessary to perform cross-state linking procedures, some of each type must be given to each group of test takers, and therefore scoring of constructed-response items must be precisely consistent across states. The testing conditions must be as close to identical as possible, including such operational features as test instructions, timing, and test security procedures, as well as larger issues, such as the stakes attached to test results. Even minor differences in answer sheet design or test book layout have the
potential to interfere with item functioning (see National Research Council, 1999a, 1999b).
Computer-based testing adds to the complexity by potentially allowing administrations to vary in new ways (in time, in context, etc.) and also by requiring that all items be linked in advance. States and districts are likely to use different operating systems and platforms, which may affect seemingly minor details in the way tests look to students and the way they interact with them (e.g., how students scroll through the test or show their work), and these differences would have to be accounted for in the linking procedures. If there are not enough computers for every student to take an assessment at the same time, the testing window is usually extended, but doing that may affect both the linking and security.
Hambleton stressed that all of these constraints apply to comparisons between just two test forms or states or within a consortium that is using the same assessment. If the comparison is extended across consortia, more significant sources of variation come into play, including multiple different sets of curricula, test design, performance standards, and so forth.
The essential principle of linking, for which Hambleton credited psychometrician Albert Beaton, is “if you want to measure change or growth over time, don’t change the measure.” But this approach is completely impractical in the current context of educational testing, where change—in the content to be tested and in test design—is almost constant.
Moreover, Hambleton explained, most linking procedures rest on the idea that a test is unidimensional (i.e., that it measures skills in a single domain), but new forms of testing are multidimensional (they measure skills in more than one domain). So not only would linking have to account for each of the dimensions assessed, it would have to account for changing combinations of dimensions as tests evolve. Hambleton advised that, at least in assessments used for summative purposes, it will be necessary to place some constraints on these factors to make linking possible.
Hambleton believes that states’ capacity to maintain high-quality linking procedures is already being stretched. With new sorts of assessments it will be critical to take seriously the need to expend resources to sustain the psychometric features that are needed to answer the questions policy makers ask. New research on questions about vertical scaling and linking tests that use multiple assessment modes, for example, will be necessary to support the current goals for assessment reform. A related, but equally important challenge will be that of displaying and reporting new kinds of data in ways that users can easily understand and report. In addition, he said, the number of issues that are involved will complicate the task of supporting valid comparisons of 21st-century assessments, and Hambleton joked that “no psychometrician will be left behind.” He acknowledged, however, that they “can’t just sit in their ivory towers and take five years to solve a problem.”
PERSPECTIVES: THE PAST AND THE FUTURE
The importance of comparisons across states is seldom questioned now, but Rebecca Zwick pointed out that this was not always the case. When the National Assessment of Educational Progress (NAEP) was first developed in the 1960s, she noted, it was not designed to support such comparisons (Zwick, 2009). At that time, the public was very wary of federal involvement in education. Federal enforcement of school desegregation and the passage of the Civil Rights Act of 1964 and the Elementary and Secondary Education Act in 1965 were viewed as the limit of federal involvement in education that the public would accept, and state-by-state comparisons were viewed as too polarizing.
By the 1980s, however, the idea of promoting academic excellence through competition was becoming increasingly popular. President Ronald Reagan’s 1984 State of the Union address called for comparisons of academic achievement among states and schools, arguing that “without standards and competition there can be no champions, no records broken, no excellence…” (Zwick, 2009). In response, the U.S. Department of Education first developed “wall charts,” which visually displayed comparisons of the resources states committed to education and measures of performance, such as average SAT and ACT scores. These comparisons were widely viewed as un-illuminating at best, since the college admissions tests reflected only the performance of college-bound students and were not aligned to curriculum and instruction in the states. By the late 1980s, NAEP had developed its Trial State Assessment, which compared the mathematics performance of 8th graders in 37 states. By the mid 1990s the state assessments were fully operational. Later in the decade, the Clinton administration proposed a “voluntary national test” as a way of collecting comparative information on individual students, but political controversy doomed the project before it was implemented.
The NAEP state assessments received an additional boost when the No Child Left Behind Act (NCLB) required states receiving Title I funding to participate, Zwick noted. (Participation is voluntary, and NAEP has occasionally struggled to secure adequate representation for its matrix-sampling design.) NCLB also required states to meet various requirements for their own assessments and accountability provisions. Comparisons began to be made among states on the basis of the results of their own assessments, even though the assessments used different designs, proficiency standards, and so on. Studies of states’ results have shown that states’ definitions of proficiency vary significantly and tend to be lower than those used in NAEP (see National Research Council, 2008), and this variation was one of the reasons enthusiasm has grown for the common core standards.
As Laurie Wise discussed, the common core standards are likely to help states achieve economies of scale, and their decisions to form consortia as part of the Race the Top competition will do so as well. With more money to spend on assessment, states should be able to produce tests that are better aligned
to standards, contain richer and more innovative items, and are more reliable. Results are likely to be reported more quickly and provide more useful information for diagnosing individual students’ needs and guiding instruction. And states could more easily take advantage of a growing body of knowledge and experience as they collaborate to develop new approaches. Funding for research to improve cognitive analyses of test content, validity, and a many other issues could feed the development of more innovative assessments that are also feasible and affordable. And, as the NECAP example has demonstrated, when standards and assessments are shared, states are both somewhat shielded from public reaction to disappointing results and correspondingly less likely to lower their performance standards to achieve nominally better results.
However, as Hambleton made clear, there are significant challenges to making sure that state-to-state comparisons are valid, fair, and useful. One premise of the common core standards is that they will allow states to measure their students against the same expectations, but the rules permit states to supplement the core standards with up to 15 percent of additional content they value. To illustrate, Zwick suggested quantifying the standards and assuming that the 10 states in a consortium share 85 standards, and that each has its own 15 additional unique standards. In that case, out of the total of 85 + 10(15) = 235 standards, only 85/235(100), or 36.2 percent, would be shared among the 10 states. Since no single assessment is likely to cover all of a state’s standards at once, the content shared by each of the 10 state assessments could be significantly lower.
As Hambleton explained, every source of variation among state tests can impair the ability to compare results and report them on the same scale. Curriculum and instruction will not be shared, even if states share standards. States are likely to find it very difficult to adhere to common procedures for every aspect of testing (preparation, administration, booklet or format design, accommodations, etc.). States also differ in terms of demographic characteristics, per-pupil expenditures, teacher training, class size, and many other features likely to affect test results.
Zwick had several suggestions for states to support meaningful comparison across the states.
Provide as much context as possible: Document differences across states in order to provide a context for comparisons.
Make the test instrument and testing policies as similar as possible: Collaborate on scoring and reporting methods and make sure that format, instructions, protocols for each phase of assessment (including preparation), etc., are uniform. Put unique state items in a separate section to minimize their influence on students’ performance of the common items. Develop an adequate pool of items in each area of interest.
Invest in research and professional development: Provide training to school administrators and teachers in the interpretation and uses of test scores.
Zwick also advocated fostering collaboration among psychometricians, educators, and policy makers, who, despite their differences in perspective, are likely to be pursuing the same ultimate goal—the improvement of teaching and learning in U.S. classrooms.
Some participants suggested that many of the problems with state comparisons could be solved if states opted to use common assessments, but others reiterated that the many sources of variation among states still remain. It was noted that other countries have had success with common curricula and common tests and for some that seems like the best route. Others pointed out that NAEP already provides state comparisons, though without any particular alignment to the standards to which states are working and without scores for individual students.