Challenges of Developing New Assessments
The many opportunities that innovative, coherent assessment systems seem to offer were clearly inspiring to many participants, but the challenges of developing a new generation of assessments that meet the goals in a technically sound manner were also apparent. Rebecca Zwick provided an overview of some of the technical issues, and Ron Hambleton looked in depth at issues related to ensuring the student performance can be compared across states.
TECHNICALLY SOUND INNOVATIVE ASSESSMENTS
“We ask a lot of our state assessments these days,” noted Zwick, and she enumerated some of the many goals that had been mentioned. Tests should be valid and reliable and support comparisons at least within consortiums and across assessment years. They should also be fair to ethnic, gender, and socioeconomic groups, as well as to English language learners and students with disabilities. They should be conducive to improved teaching and useful for cognitive diagnosis. They should be developmentally appropriate and well aligned with standards, curriculum, and instruction, and they should be engaging to students. They should also provide data quickly enough to be useful in current lessons, according to the specifications in the Race to the Top grant application information.1
How easily might innovative assessments used for summative, accountability purposes meet these goals? First, Zwick observed, many aspects of the current vision of innovative assessment (e.g., open-ended questions, essays, hands-on science problems, computer simulations of real-world problems, and portfolios of student work) were first proposed in the early 1990s, and in some cases as far back as the work of E.F. Lindquist in the early 1950s (Lindquist, 1951; also see Linn et al., 1991). She cited as an example a science test that was part of the 1989 California Assessment Program. The assessment consisted of 15 hands-on tasks for 6th graders, set up in stations, which included developing a classification system for leaves and testing lake water to see why fish were dying (see Shavelson et al., 1993). The students conducted experiments and prepared written responses to questions. The responses were scored using a rubric developed by teachers.
This sort of assessment is intrinsically appealing, Zwick observed, but it is important to consider a few technical questions. Do the tasks really measure the intended higher-order skills? Procedural complexity does not always guarantee cognitive complexity, she noted, and, as with multiple-choice items, teaching to the test can undermine the value of its results. If students are drilled on the topics of the performance assessments, such as geometry proofs or writing 20-minute essays, it may be that, when tested, they would not need to use higher-order thinking skills to do these tasks because they have memorized how to do them.
Another question is whether the results can be generalized across tasks. Can a set of hands-on science tasks be devised that could be administered efficiently and from which one could generalize broad conclusions about students’ science skills and knowledge? Zwick noted that a significant amount of research has shown that for real-world tasks, the level of difficulty a task represents tends to vary across test takers and to depend on the specific content of the task. In other words, there tend to be large task-by-person interactions. In a study that examined the California science test discussed above, for example, Shavelson and his colleagues (1993) found that nearly 50 percent of the variability in scores was attributable to such interactions (see also Baker et al., 1993; Stecher and Hamilton, 2009).
Yet another question is whether such tests can be equitable. As is the case with multiple-choice tests, Zwick noted, performance tests may inadvertently measure skills that are irrelevant to the construct—if some students are familiar with a topic and others are not, for example, or if a task requires students to write and writing skills are not the object of measurement. Limitations in mobility and coordination may impede some students’ access to hands-on experiments at stations or their capacity to manipulate the materials. Some students may have anxiety about responding to test items in a setting that is more public than individual paper-and-pencil testing. Almost any content and format could pose this sort of issue for some students, Zwick said, and research has shown
that group differences are no less of a problem with performance assessment than they have been with multiple-choice assessment (see, e.g., Dunbar et al., 1991; Linn et al., 1991; Bond, 1995).
Reliability is generally lower for performance items scored by human raters than for multiple-choice items (see Lukhele et al., 1994). Achieving acceptable reliability rates may require extensive (and expensive) efforts for each task, including development and refinement of rubrics and training of raters. Zwick noted a number of challenges in trying to develop assessments that provide results that are comparable across years and from state to state. Performance tasks tend to be longer and more memorable than shorter ones, so security concerns would dictate that they not be repeated making linking difficult (as discussed below). Because performance tasks are more time consuming, students will generally complete fewer of them, another reason why reliability can be low and linkages difficult to establish.
Zwick also addressed some of the challenges of using computerized adaptive tests. She acknowledged the many benefits others had already identified, such as flexible administration, instant reporting, and more precise estimates of proficiency. However, she noted, development of these test requires significant resources. Very large numbers of items are needed—for example, the Graduate Management Admissions Test (GMAT) had a pool of 9,000 items when it converted to this format in 1997 and has steadily increased its pool since then (Rudner, 2010). A large pool of items is needed to cover a range of difficulty levels. Security is another concern, particularly in the case of high-stakes exams, such as the GMAT and Graduate Record Examination (GRE), where there is ample motivation for people to memorize items for use in test preparation. Thus, most programs use both multiple rotating pools of items and algorithms that control test takers’ exposure to particular items.
With these concerns in mind, Zwick offered several recommendations. Psychometrics should be an integral part of the planning from the inception of a program, she said. That is, effective scoring and linking plans cannot be developed after data have been collected. Pilot testing of every aspect of an assessment is also very important, including test administration, human and machine scoring, and score reporting. And finally, she highlighted the importance of taking advantage of the lessons others have already learned, closing with a quotation from a 1994 paper on the Vermont Portfolio Assessment: “The basic lesson … is the need for modest expectations, patience, and ongoing evaluation in our national experimentation with innovative large-scale … assessments as a tool of educational reform” (Koretz et al., 1994).
One important reason to have common standards would be to establish common learning objectives for students regardless of where they live. But
unless all students are also assessed with exactly the same instruments, comparing their learning across states is not so easily accomplished. Nevertheless, the capacity to make cross-state comparisons remains important to policy makers and educators. Ron Hambleton described some of the complex technical issues that surround this challenge, known to psychometricians as “linking”: placing test results on the same score scale so that they can be compared.2
The basic issue that linking procedures are designed to address is the need to determine, when results from two tests appear to be different, whether that difference means that one of the tests is easier than the other or that one of the groups of students is more able than the other. There are several different statistical procedures for linking the results of different tests (see National Research Council, 1999a, 1999b).3 If the tests were developed to measure precisely the same constructs, to meet the same specifications, and to produce scores on the same scale, the procedures are relatively straightforward. They are still very important though, since users count on the results of, say, the SAT (formerly, the Scholastic Aptitude Test), to mean exactly the same thing, year after year. More complex approaches are necessary when the tests are developed to different frameworks and specifications or yield scores on different scales. A common analogy is the formula for linking temperatures measured in degrees Fahrenheit or Celsius, but because of the complexity of the cognitive activities measured by educational tests, procedures for linking test scores are significantly more complex.
In general, Hambleton explained, in order to compare the achievement of different groups of students it is necessary to have some comparability not only in the standards that guide the assessment, but also in the tests and the curricula to which the students have been exposed. While much is in flux at the moment, Hambleton suggested that it appears likely that states will continue to use a variety of approaches to assessment, including combinations of paper-and-pencil format, computer-based assessment, and various kinds of innovative assessments. This multifaceted approach may be an effective way to support instruction at the state and district levels, he pointed out, but it will complicate tremendously the task of producing results that can be compared across states. Innovative item types are generally designed to measure new kinds of skills, but the more detailed and numerous the constructs measured become, the greater the challenge of linking the results of one assessment to those of another.
To demonstrate the challenge, Hambleton began with an overview of the complexity of linking traditional assessments. Even in the most strictly stan-
Hambleton credited two reports from the National Research Council (1999a, 1999b) for much of his discussion.
See also information from the National Center for Education Statistics, see http://nces.ed.gov/pubs98/linking/c3.asp [accessed August 2010]).
dardized testing program, it is nearly impossible to produce tests from year to year that are so clearly equivalent that scores can be compared without using statistical linking procedures. Even for the SAT, possibly the most thoroughly researched and well funded testing program in the world, psychometricians have not been able to get around the need to make statistical adjustments in order to link every different form of the test every year. There are two basic procedures involved in linking two test forms: first, to make sure either that some number of people take both test forms or that some number of items appear in both test forms, and, second, to use statistical procedures to adjust (in one of several ways) for any differences in difficulty between the two test forms that become apparent (see Hambleton, 2009).
The same must also be done if comparisons are to be made across states, Hambleton noted. This is easiest if the states share the same content standards and proficiency standards and also administer the same tests, as is the case with the New England Common Assessment Program (NECAP). However, even when these three elements are the same, it is still possible that items will perform differently across states because of variations in teaching methods, curricula, or other factors, and this possibility must be checked. Any changes to the nature or administration of a test may affect the way items perform. Thus, the addition of new item types, separate sections to measure an individual state’s desired content, changes in the positioning of items, or conversion of certain parts of an assessment to computer delivery, may affect equating (see National Research Council, 1999a, 1999b). The timing of the test is also important: even if states have similar curricula, if they administer a common test at different points in the year it will affect the quality of the linking.
In addition, Hambleton noted, since common items are necessary to perform cross-state linking procedures, some of each type must be given to each group of test takers, and therefore scoring of constructed-response items must be precisely consistent across states. The testing conditions must be as close to identical as possible, including such operational features as test instructions, timing, and test security procedures, as well as larger issues, such as the stakes attached to test results. Even minor differences in answer sheet design or test book layout have the potential to interfere with item functioning (see National Research Council, 1999a, 1999b).
Computer-based testing adds to the complexity by potentially allowing administrations to vary in new ways (in time, in context, etc.) and also by requiring that all items be linked in advance. States and districts are likely to use different operating systems and platforms, which may affect seemingly minor details in the way tests look to students and the way they interact with them (e.g., how students scroll through the test or show their work), and these differences would have to be accounted for in the linking procedures. If there are not enough computers for every student to take an assessment at the same
time, the testing window is usually extended, but doing that may affect both the linking and security.
Hambleton stressed that all of these constraints apply whether the desired comparison is between just two test forms or state assessments or among a number of states in a consortium that use the same assessment. If the comparison is extended across consortia, more significant sources of variation come into play, including multiple different sets of curricula, test design, performance standards, and so forth.
The essential principle of linking, for which Hambleton credited psychometrician Albert Beaton, is “if you want to measure change or growth over time don’t change the measure.” But this approach is completely impractical in the current context of educational testing, where change—in the content to be tested and in test design—is almost constant.
Moreover, Hambleton noted, most linking procedures rest on the idea that a test is unidimensional (i.e., that it measures skills in a single domain), but new forms of testing are multidimensional (they measure skills in more than one domain). So not only would linking have to account for each of the dimensions assessed, it would have to account for changing combinations of dimensions as tests evolve. Hambleton suggested that, at least in assessments used for summative purposes, it would be necessary to place some constraints on these factors to make linking possible.
Hambleton suggested that states’ capacity to maintain high-quality linking procedures is already being stretched. With new sorts of assessments it will be critical to take seriously the need to expend resources to sustain the psychometric features that are needed to answer the questions policy makers ask. New research on questions about vertical scaling and linking tests that use multiple assessment modes, for example, will be necessary to support the current goals for assessment reform. A related, but equally important challenge will be that of displaying and reporting new kinds of data in ways that users can easily understand and report. The number of issues that are involved will ensure, Hambleton joked, that “no psychometrician will be left behind.” He acknowledged, however, that they “can’t just sit in their ivory towers and take 5 years to solve a problem.”
PERSPECTIVES: PAST AND FUTURE
The importance of comparisons across states is seldom questioned now, but Zwick pointed out that this was not always the case. When the National Assessment of Educational Progress (NAEP) was first developed in the 1960s, she reminded participants, it was not designed to support such comparisons (Zwick, 2009). At that time, the public was very wary of federal involvement in education. Federal enforcement of school desegregation and the passage of the Civil Rights Act of 1964 and the Elementary and Secondary Education Act in 1965 were
viewed as the limits of federal involvement in education that the public would accept, and state-by-state comparisons were viewed as potentially polarizing.
By the 1980s, however, the idea of promoting academic excellence through competition was becoming increasingly popular. President Ronald Reagan’s 1984 State of the Union address called for comparisons of academic achievement among states and schools, arguing that “without standards and competition there can be no champions, no records broken, no excellence…” (Zwick, 2009). In response, the U.S. Department of Education first developed “wall charts,” which displayed comparisons of the resources states committed to education and measures of performance, such as average SAT and ACT (American College Test) scores. These comparisons were widely viewed as un-illuminating at best, since the college admissions tests reflected only the performance of college-bound students and were not aligned to curriculum and instruction in the states. By the late 1980s, NAEP had developed its Trial State Assessment, which compared the mathematics performance of 8th graders in 37 states. By the mid-1990s, the state assessments were fully operational. Later in the decade, the Clinton administration proposed a “voluntary national test” as a way of collecting comparative information on individual students, but it was never implemented.
The NAEP state assessments received an additional boost when the No Child Left Behind (NCLB) Act made states’ participation a condition for the receipt of Title I funding, Zwick noted. (Participation in NAEP is technically voluntary, and the program has in the past occasionally struggled to secure adequate representation for its matrix-sampling design.) NCLB also required states to meet various requirements for their own assessments and accountability provisions. Comparisons began to be made among states on the basis of the results of their own assessments, even though the assessments used different designs, proficiency standards, and so on. Studies of states’ results have shown that states’ definitions of proficiency vary significantly and tend to be lower than those used in NAEP (see National Research Council, 2008). This variation has contributed to the growing enthusiasm for common core standards.
The common core standards are likely to help states achieve economies of scale, as will their decisions to form consortia as part of the Race the Top competition, Laurie Wise noted. With more money to spend on assessment, states should be able to produce tests that are better aligned to standards, contain richer and more innovative items, and are more reliable. Results are likely to be reported more quickly and provide more useful information for diagnosing individual students’ needs and guiding instruction. And states could more easily take advantage of a growing body of knowledge and experience as they collaborate to develop new approaches. Funding for research to improve cognitive analyses of test content, validity, and a many other issues could feed the development of more innovative assessments that are also feasible and affordable. The NECAP example has suggested to some that when standards and assessments are shared, states may be somewhat shielded from public
reaction to disappointing results and correspondingly less likely to lower their performance standards to achieve nominally better results.
However, Hambleton made clear that there are significant challenges to making sure that state-to-state comparisons are valid, fair, and useful. One premise of the common core standards is that they will allow states to measure their students against the same expectations, but the rules permit states to supplement the core standards with up to 15 percent of additional content they value. To illustrate, Zwick suggested quantifying the standards and assuming that the 10 states in a consortium share 85 standards, and that each has its own 15 additional unique standards. In that case, of the total of 85 + 10(15) = 235 standards, only 85/235(100), or 36.2 percent, would be shared among the 10 states. Since no single assessment is likely to cover all of a state’s standards, the content shared by each of the 10 state assessments could be significantly lower.
Hambleton reiterated that every source of variation among state tests can impair the ability to produce comparable results and report them on the same scale. Moreover, he noted, curriculum and instruction will not be shared, even if states share standards. States are likely to find it very difficult to adhere to common procedures for every aspect of testing (preparation, administration, booklet or format design, accommodations, etc.). States also differ in terms of demographic characteristics, per-pupil expenditures, teacher training, class size, and many other features that are likely to affect test results.
Zwick had several suggestions for states to support meaningful comparison across the states:
Provide as much context as possible: Document differences across states in order to provide a context for comparisons.
Make the test instrument and testing policies as similar as possible: Collaborate on scoring and reporting methods and make sure that format, instructions, and protocols for each phase of assessment (including preparation) are uniform. Put unique state items in a separate section to minimize their influence on students’ performance of the common items. Develop an adequate pool of items in each area of interest.
Invest in research and professional development: provide training to school administrators and teachers in the interpretation and uses of test scores.
Zwick also advocated fostering collaboration among psychometricians, educators, and policy makers, who, despite their differences in perspective, are likely to be pursuing the same goal—the improvement of teaching and learning in U.S. classrooms.
Some participants suggested that many of the problems with state com-
parisons could be solved if states opted to use common assessments, but others responded that the many sources of variation among states still remain. Several commented that other countries have had success with common curricula and common tests, and for some that seems like the best route. Others pointed out that NAEP already provides state comparisons, though without any particular alignment to the standards to which states are working and without scores for individual students.