Political Experiences and Considerations
A recurring theme at both workshops was tradeoffs. Many of the discussions highlighted that political considerations are a very important aspect of almost all decisions about assessment and accountability, and that they have played a critical role in the history of states’ efforts with innovative assessments. Veterans of three programs—the Maryland School Performance Assessment Program (MSPAP), the Kentucky Instructional Results Information System (KIRIS), and the Minnesota Comprehensive Assessment in Science—reflected on the political challenges of implementing these programs and the factors that have affected the outcomes for each.
MSPAP was a logical next step for a state that had been pursuing education reform since the 1970s, Steve Ferrara explained. Maryland was among the first to use school- and system-level data for accountability, and the state also developed the essay-based Maryland Writing Test in the 1980s. That test, one of the first to use human scoring, set the stage for MSPAP. MSPAP was controversial at first, particularly after the first administration yielded pass rates as low as 50 percent for 9th graders. However, the test had a significant influence on writing instruction and scores quickly rose.
The foundation for MSPAP was a Commission on School Performance established in 1987 by then governor William Schaeffer, which outlined ambitious goals for public education and recommended the establishment of content standards that would focus on higher order thinking skills. The commission
also explicitly recommended that the state adopt an assessment that would not rely solely on multiple-choice items and that could provide annual report cards for schools.
Ferrera said that the governor’s leadership was critical in marshalling the support of business and political leaders in the state. Because the Maryland governor appoints members of the state board of education, who, in turn, appoint the superintendent of public instruction, the result was “a team working together on education reform.” This team was responding to shifting expectations for education nationally, as well as a sense that state and district policy makers and the public were demanding assurances of the benefits of their investment in education. Ferrara recalled that the initial implementation went fairly smoothly, owing in part to concerted efforts by the state superintendent and others to communicate clearly with the districts about the goals for the program and how it would work and to solicit their input.
Most school districts were enthusiastic supporters, but several challenges complicated the implementation. The standards focused on broad themes and conceptual understanding, and it was not easy for test developers to design tasks that would target those domains in a way that was feasible and reliable. The way the domains were described in the standards led to complaints that the assessment did not do enough to assess students’ factual knowledge. The schedule was also exceedingly rapid, with only 11 months between initial planning and the first administration. The assessment burden was great—for example, 9 hours over 5 days for 3rd graders. There were also major logistical challenges posed by the manipulatives needed for the large number of hands-on items.
The manipulatives not only presented logistical challenges, they also revealed a bigger challenge for teachers. For example, teachers who had not even been teaching science were asked to lead students through an assessment that included experiments. Teachers were also being asked more generally to change their instruction. The program involved teachers in every phase—task development, scoring, etc.—and Ferrara said that the teachers’ involvement was one of the most important ingredients in its early success.
As discussed in an earlier workshop session (see Chapter 3), there is evidence that teachers changed their practice in response to MSPAP (Koretz et al., 1996; Lane et al., 1999). Nevertheless, many people in the state began to oppose the program. Criticisms of the content and concern about the lack of individual student scores were the most prominent complaints, and the passage of the No Child Left Behind (NCLB) Act in 2002 made the latter concern urgent. The test was discontinued in that year.
For Ferrara, several key lessons can be learned from the history of MSPAP:
Involving stakeholders in every phase of the process was very valuable; doing so both improved the quality of the program and built political acceptance.
It paid to be ambitious technically, but it is not necessary to do everything at once. For example, an assessment program could have a significant influence on instruction without being exclusively composed of open-ended items. If one makes a big investment in revolutionary science assessment, there will be fewer resources and less political capital for other content areas.
It was short-sighted to invest the bulk of funds and energy in test development at the expense of ongoing professional development.
Brian Gong explained that the 1989 Kentucky Supreme Court decision that led to the development of KIRIS was intended to address both stark inequities in the state’s public education system and the state’s chronic performance at or near the bottom among the 50 states. The resulting Kentucky Education Reform Act (KERA) was passed in 1990, with broad bipartisan and public support. It was one of the first state education reform bills and included innovative features, such as a substantial tax increase to fund reform; a restructuring of education governance; the allocation of 10 paid professional development days per year for teachers; and a revamped standards, assessment, and accountability system.
KERA established accountability goals for schools and students (proficiency within 20 years) and called for an assessment system that would be standards based, would rely primarily on performance-based items, and would be contextualized in the same way high-quality classroom instruction is. The result, KIRIS, met these specifications, but the developers faced many challenges. Most critically, the court decision had actually identified the outcomes to be measured by the assessment. Those outcomes were the basis for the development of academic expectations and then core content for assessment, but the definitions of the constructs remained somewhat elusive. Educators complained that they were not sure what they were supposed to be doing in the classroom, Gong said, and in the end it was the assessment that defined the content that was valued, the degree of mastery that was expected, and the way students would demonstrate that mastery. But developing tasks to assess the standards in valid ways was difficult, and the assessment alone could not provide sufficient information to influence instruction.
There were other challenges and adaptations, Gong noted. It was difficult to collect data for standard setting using the complex, multifaceted evidence of students’ skills and knowledge that KIRIS was designed to elicit. Equating the results from year to year was also difficult: most of the tasks were very memorable and could not be repeated. Alternate strategies—such as equating KIRIS to assessments in other states, judgmental equating, and equating to multiple-choice items—presented both psychometric challenges and practical
disadvantages. KIRIS also initially struggled to maintain standards of accuracy and reliability in scoring the constructed-response items and portfolios, and adaptations and improvements were made in response to problems.
Guidelines for the development of portfolios had to be strengthened in response to concerns about whether they truly reflected students’ work. With experience, test developers also gradually moved from holistic scoring of portfolios to analytic scoring in order to obtain more usable information from the results. The state also faced logistical challenges, for example, with field testing and providing results in time for accountability requirements.
Nontechnical challenges emerged as well. The state was compelled to significantly reduce its assessment staff and to rely increasingly on consultants. School accountability quickly became unpopular, Gong explained, and many people began to complain that the aspirations were too high and to question the assertion that all students could learn to high levels. Philosophical objections to the learning outcomes assessed by KIRIS also emerged, with some people arguing that many of them intruded on parents’ prerogatives and invaded students’ privacy. The so-called math and reading “wars”—over the relative emphasis that should be given to basic skills and fluency as opposed to broader cognitive objectives—fueled opposition to KIRIS. Finally, there were changes in the state’s political leadership that decreased support for the program, and it did not survive an increasingly contentious political debate; KIRIS was ended in 1998.
For Gong there are several key lessons from the Kentucky experience:
Clear definitions of the constructs to be measured and the purposes and uses of the assessment are essential. No assessment can compensate for poorly defined learning targets.
The design of the assessment should be done in tandem with the design of the intended uses of the data, such as accountability, so that they can be both coherent and efficient.
The people who are proposing technical evaluations and those who will be the subject of them should work together in advance to consider both intended and unintended consequences, particularly in a politically charged context.
Anyone now considering innovative assessments for large-scale use should have a much clearer idea of how to meet technical and operational challenges than did the pioneering developers of KIRIS in the 1990s.
Current psychometric models, which support traditional forms of testing, are inconsistent with new views of both content and cognition and should be applied only sparingly to innovative assessments. The field should invest in the development of improved models and criteria (see, e.g., Shepard, 1993; Mislevy, 1998).
Dirk Mattson explained that Minnesota’s Comprehensive Assessment Series II (MCA-II) was developed in response to NCLB, so the state was able to benefit from the experiences of states that had already initiated innovative assessments. Some existing assessments in some subjects could be adapted, but the MCA-II in science, implemented in 2008, presented an opportunity to do something new. State assessment staff were given unusual latitude to experiment, Mattson said, because science had not been included in the NCLB accountability requirements.1
The result is a scenario-based assessment delivered on computers. Students are presented with realistic representations of classroom experiments, as well as phenomena that can be observed. Items—which may be multiple choice, short or long constructed response, or figural (i.e., students interact with graphics in some way)—and are embedded in the scenario. This structure provides students with an opportunity to engage in science at a higher cognitive level than would be possible with stand-alone items.
Mattson emphasized that the design of the science assessment grew out of the extensive involvement of teachers from the earliest stages. Teachers rejected the idea of an exclusively multiple-choice test, but a statewide performance assessment was also not considered because a previous effort had ended in political controversy. The obvious solution was a computer-delivered assessment, despite concerns about the availability of the necessary technology in schools. The developers had the luxury of a relatively generous schedule: conceptual design began in 2005, and the first operational assessment was given in 2008. This schedule allowed time for some pilot testing at the district level before field testing began in 2007.
A few complications have arisen. First, Minnesota statute requires regular revision of education standards, so the science standards were actually being revised before the prior ones had been assessed, but assessment revision was built into the process. Second, in 2009 the state legislature, facing severe budget constraints, voted to make the expenditure of state funds on human scoring of assessments illegal.2 More recently, the state has contemplated signing on to the “common core” standards and is monitoring other changes that may become necessary as a result of the Race to the Top initiative or reauthorization of the federal Elementary and Secondary Education Act.
The technical manual and other information about the program are available at http://education.state.mn.us/MDE/Accountability_Programs/Assessment_and_Testing/Assessments/MCA/TechReports/index.html [accessed April 2010].
State money could still be used for machine scoring of assessments. Because Minnesota has no board of education, the legislature is responsible for overseeing the operations of the department of education; federal dollars were used to support human scoring.
In reflecting on the process so far, Mattson noted that despite the examples of other innovative assessment programs, many MCA elements were new and had to be developed from scratch. These elements included operations, such as means of conveying what was needed to testing contractors, estimating costs, and supporting research and development efforts. The new elements also included parts of the fundamental design, and Mattson noted that often the content specialists were far ahead of the software designers in conceptualizing what could be done. Technical challenges—from designing a test security protocol to preparing schools to load approximately 300-475 megabytes of test content onto their servers—required both flexibility and patience. A Statewide Assessment Technology Work Group helped identify and address many of the technical challenges, and Mattson pointed to this group as a key support.
For Mattson, it is important that the developers were not deterred by the fact that there were no paths to follow in much of development process. The success of the assessment thus far, in his view, has hinged on the fact that the state was able to think ambitiously. The leaders had enthusiastic support from teachers, as well as grant funding and other community support, which allowed them to sustain their focus on the primary goal of developing an assessment that would target the skills and knowledge educators believed were most important. The flexibility that has also been a feature of the MCA since the beginning—the state assessment staff’s commitment to working with and learning from all of the constituencies concerned with the results—should allow them to successfully adapt to future challenges, Mattson said.
The three state examples, suggested Lorraine McDonnell, highlight the familiar tension between the “missionaries” who play the important role of seeking ways to improve the status quo and those who raise the sometimes troublesome questions about whether a proposed solution addresses the right problem, whether the expected benefits will outweigh the costs, and whether the innovation can be feasibly implemented. She distilled several policy lessons from the presentations.
First, for an assessment to be successful, it is clear that the testing technology has to be well matched to the policy goals the assessment is intended to serve. Accurate educational measurement may be a technical challenge, but assessment policy cannot be clearly understood independent of its political function. Whether the function is to serve as a student- or school-level accountability device, to support comparisons across schools or jurisdictions, or to influence the content and mode of classroom instruction, what is most important is to ensure that the goals are explicitly articulated and agreed on. McDonnell observed that in many states test developers and politicians had not viewed the function of the state assessment in the same way. As a result, test
developers could not meet the policy makers’ expectations, and support for the assessment weakened. When policy makers expect the test to serve multiple purposes, the problem is most acute—and policy makers may not agree among themselves about the function of an assessment.
It is also very important that the testing system is integrated into the broader instructional system, McDonnell said. This idea was an element of the argument for systemic reform that was first proposed during the 1990s (Smith and O’Day, 1991) and has been prominent in reform rhetoric, but it has not played a major role in the current common standards movement. She pointed out that although the conditions that truly support effective teaching and learning should be central, they “appear once again to have been relegated to an afterthought.” Support is needed to strengthen instructional programs, as well as assessment programs. Like many workshop participants, McDonnell highlighted the importance of a comprehensive and coherent system of standards, instruction, and assessment.
States and the federal government have tended, she suggested, to select instruments that were easy to deploy—such as tests—and to underinvest in such measures as curricula and professional development that could help to build schools’ capability to improve education. Yet unless teachers are provided with substantial opportunities to learn about the deeper curricular implications of innovative assessments and to reshape their instruction in light of that knowledge, the result of any high-stakes assessment is likely to be more superficial test preparation, which McDonnell called “rubric-driven instruction.” This conflict between policy pressure for ambitious improvements in achievement and the weak capability of schools and systems to respond was an enduring dilemma during the first wave of innovation, in the 1990s, and McDonnell suggested that it has not been resolved.
Yet another lesson, McDonnell said, is that policy makers and test designers need to understand the likely tradeoffs associated with different types of assessments and the need to decide which goals they want to advance and which ones they are willing to forgo or minimize. The most evident tension is between using tests as accountability measures and using them as a way to guide and improve instruction. As earlier workshop discussions showed, McDonnell said, these two goals are not necessarily mutually exclusive, but pursuing both with a single instrument is likely to make it difficult to obtain high-quality results.
This lesson relates to another, which may be obvious, McDonnell suggested, but appears to be easily overlooked: if a process is to be successful, every constituency that will be affected by an assessment must have ample opportunity to participate throughout its development. The differing perspectives of psychometricians and curriculum developers, for example, need to be reconciled if an assessment system is to be successful, but parents, teachers, and other interests need to be involved as well. If developers fail to fully understand and take into account public attitudes, they may encounter unexpected
opposition to their plans. Conveying the rationale for an assessment approach to policy makers and the public, as well as the expected benefits and costs, may require careful planning and concerted effort.
It is often at the district level that the critical communications take place, and too often, McDonnell said, district leaders have not been involved in or prepared for this important aspect of the process. The benefit of clear and thorough communication is that stakeholders are more likely to continue to support a program through technical difficulties if they have a clear understanding of the overall goals.
Finally, McDonnell stressed, people need to remember that the implementation of innovative assessments takes time. It is very important to build in an adequate development cycle that allows for gradual implementation and for adaptation to potential problems. In several of the experiences discussed at the workshop, rushed implementation led to technical problems, undue stress on teachers and students, and a focus on testing formats at the expense of clear connections to curriculum. In several states, testing experts acquiesced to political pressure to move quickly in a direction that the testing technology could not sustain. Programs that have implemented innovative features gradually, without dismantling the existing system, have had more flexibility to adapt and learn from experience.
These policy lessons, as well as a growing base of technical advances, can be very valuable for today’s “missionaries,” McDonnell said. However, although past experience provides lessons, it may also have left a legacy of skepticism among those who had to deal with what were in some cases very disruptive experiences. Fiscal constraints are also likely to be a problem for some time to come, and it is not clear that states will be able to sustain new forms of assessment that may be more expensive than their predecessors after initial seed funding is exhausted. She also noted that the common standards movement and the Race to the Top Initiative have not yet become the focus of significant public attention, and there is no way to predict whether they will become the objects of ideological controversies, as have past education reforms. None of these are reasons not to experiment with new forms of assessment, McDonnell concluded, but “they are reasons for going about the enterprise in a technically more sophisticated way than was done in the past and to do it with greater political sensitivity and skill.”