Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Best Practices for State Assessment Systems, Part I: Summary of a Workshop 1 Introduction Educators and policy makers in the United States have relied on tests to measure educational progress for more than 150 years. During the 20th century, technical advances, such as machines for automatic scoring and computer-based scoring and reporting, have supported states in a growing reliance on standardized tests for statewide accountability. The history of state assessments has been eventful. Education officials have developed their own assessments, have purchased ready-made assessments produced by private companies and nonprofit organizations, and have collaborated to share the task of test development. They have tried minimum competency testing; portfolios; multiple-choice, brief and extended constructed-response items; and more. They have responded to calls for tests to answer many kinds of questions about public education and have encountered controversies related to literacy, student privacy, international comparisons, accountability, and even property values. State assessment data have been cited as evidence for claims about many achievements of public education, and the tests have also been blamed for significant failings. Most recently, the implementation of the No Child Left Behind Act of 2001 has had major effects on public education: some of those effects were intended, and some were not; some have been positive, and some have been problematic. States are now considering whether to adopt the “common core” academic standards that were developed under the leadership of the National Governors Association and the Council of Chief State School Officers, and they are competing for federal dollars from the Department of Education’s Race to the
OCR for page 2
Best Practices for State Assessment Systems, Part I: Summary of a Workshop Top initiative.1 Both of these efforts are intended to help make educational standards clearer and more concise and to set higher standards for students. As standards come under new scrutiny, so, too, do the assessments that measure their results—to be eligible for Race to the Top funds, a state must adopt internationally benchmarked standards and also “demonstrate a commitment to improving the quality of its assessments” (U.S. Department of Education, 2009). The goal for this workshop, the first of two, was to collect information and perspectives on assessment that could be of use to state officials and others as they review current assessment practices and consider improvements, as Diana Pullin indicated in her opening remarks. In organizing the workshop, the Committee on Best Practices for State Assessment Systems identified four questions for consideration: How do the different existing tests that have been or could be used to make comparisons across states—such as the National Assessment of Educational Progress (NAEP), the advanced placement (AP) tests, the SAT Reasoning Test (SAT, formerly, the Scholastic Aptitude Test and the Scholastic Assessment Test), ACT (formerly, American College Testing), and the Program for International Student Assessment (PISA)—compare to each other and to the existing state tests with their associated content and performance standards? What implications do the similarities and differences across these tests have for the state comparisons that they can be used to make? How could current procedures for developing content and performance standards be changed to allow benchmarking to measures and predictions of college and career readiness and also promote the development of a small set of clear standards? What options are there for constructing tests that measure readiness with respect to academic skills? Are there options for assessing “21st century” or “soft” skills that could provide a more robust assessment of readiness than a focus on academic skills alone? What does research suggest about best practices in running a state assessment system and using the assessment results from that system to improve instruction? How does this compare to current state capacity and practices? How might assessment in the context of revised standards be designed to move state practices to more closely resemble best practices? 1 The Race to the Top initiative is a pool of federal money set aside for discretionary grants. States are competing to receive the grants on the basis of their success in four areas: standards and assessments, data systems, improving the teacher work force, and improving the lowest-achieving schools (see http://www2.ed.gov/programs/racetothetop/index.html [March 2010]).
OCR for page 3
Best Practices for State Assessment Systems, Part I: Summary of a Workshop How could assessments that are constructed for revised standards be used for accountability? Are there important differences in the use of assessments for accountability if those assessments are based on standards that are (1) shared in common across states, (2) designed to be fewer and clearer, or (3) focused on higher levels of performance? For this workshop, held in December 2009, the committee focused on lessons to be drawn both from the current status of assessment and accountability programs and the results of past efforts to innovate, and this report describes the presentations and discussion. Chapter 1 describes current approaches to assessment in the United States and some of the recent developments that have shaped them. Chapter 2 explores possibilities for changing the status quo, by changing both standards and assessments with the goal of improving instruction and learning, as well as some of the technical challenges of implementing innovative approaches on a large scale. Chapter 3 examines practical and political lessons from past and current efforts to implement innovative assessment approaches, and Chapter 4 focuses on the political considerations that have affected innovative assessment programs. Chapter 5 explores the possibilities presented by the current policy environment. The final chapter discusses the research needed to support states’ efforts to make optimal use of assessments and to pursue innovation in assessment design and implementation. The report of the second workshop, held in April 2010, is expected to be published in the summer of 2010. CONTEXT Standards-based accountability is a widely accepted framework for public education. Every state now has education standards, although the current array of accountability approaches is characterized by significant variation, as Diana Pullin noted. A previous set of workshops illuminated the extent to which content and performance standards vary in what they formally expect students to know and be able to do at each grade and the extent to which the implementation of state and district standards varies across schools and even classrooms (National Research Council, 2008). By design, assessments play a key role in standards-based accountability, but because standards are not working exactly as they were intended to, Pullin suggested, “assessments can be more powerful in driving what happens in schools than standards themselves.” Recent research has indicated that the influence of assessments on curriculum and instruction has increased since 2002, when the No Child Left Behind (NCLB) Act was passed, and that tests themselves have changed in significant ways (see, e.g., Jennings and Rentner, 2006; McMurrer, 2007; Lai and Waltman, 2008;
OCR for page 4
Best Practices for State Assessment Systems, Part I: Summary of a Workshop Sunderman, 2008). Several presenters provided perspectives on the history and current status of assessment and accountability systems. The idea that assessments should be used to evaluate not just individual students’ progress, but also the quality of instruction and the performance of educators more generally, is one with longstanding roots, Joan Herman noted. Edward Thorndike, who published pioneering books on educational measurement in the first decades of the 20th century, viewed his work as useful in part because it would provide principals and teachers with a tool for improving student learning. Ralph Tyler, known for innovative work in educational evaluation, in the 1940s posed the idea that objectives ought to drive curriculum and instruction and that new kinds of assessments (beyond paper-and-pencil tests) were needed to transform learning and the nature of educational programs. Other contributions to thinking about evaluation include Benjamin Bloom’s 1956 taxonomy of educational objectives, the development of criterion-referenced testing in the 1950s, mastery learning in the 1960s and 1970s, minimum competency testing in the 1970s and 1980s, and performance assessment in the 1990s. All of these, Herman suggested, have been good ideas, but they have not had the effects that might have been hoped for. Recent Changes in Tests Most recently, NCLB has had a very clear impact on many aspects of the system, which Scott Marion described. Prior to 2002, for example, states were required to test at one grade each in the elementary, middle, and high school years. NCLB required testing in grades 3 through 8 as well as in high school. Marion argued that this increased testing resulted in some improvements to state standards. The new requirement compelled states to define coherent content standards by grade level, rather than by grade span, and to articulate more precisely what the performance standards should be for each grade. Testing at every grade has also opened up new possibilities for measuring student achievement over time, such as value-added modeling.2 The design of states’ test forms has also been affected, most notably in the almost complete elimination of matrix sampling designs because they do not provide data on individual students. For a long time test designers have used matrix sampling to produce data about the learning of large groups of students (such as all children in a single grade) across a broad domain. With this sampling approach, tests are designed so that no one student answers every question (which would require too much testing time for complete coverage), but there is sufficient overlap in tested content from student to student to support inferences about how well the group as a whole has learned each aspect 2 “Value-added modeling” is statistical analysis in which student data collected over time are used to measure the effects of teachers or schools on student learning.
OCR for page 5
Best Practices for State Assessment Systems, Part I: Summary of a Workshop of the domain tested. One advantage of matrix sampling is that each student faces fewer test items—because student-level scores are not produced—and thus there is time to include more complex item types. This approach makes it possible to better assess a broad academic domain because the inclusion of more complex item types is likely to yield more generalizable inferences about students’ knowledge and skills. The types of test questions commonly used have also changed, Marion suggested, with developers relying far less on complex performance assessments and more on multiple-choice items. He cited evidence from the U.S. Government Accountability Office (2009) that the balance between multiple-choice and open-ended items (a category that includes short items) has shifted significantly in favor of the multiple-choice format as states have responded to the NCLB requirements. Many states still use items that could be described as open ended, but use of this term disguises important differences between a short constructed-response item worth a few points and an extended, complex performance task that contributes significantly to the overall score. To illustrate the difference, he showed sample items—a four-page task from a 1996 Connecticut test that called for group collaboration and included 16 pages of source materials, contrasted with mathematics items from a 2009 Massachusetts test that asked students to measure an angle and record their result or to construct a math problem and solve it. Marion was not suggesting that the shorter items—or others like them—are necessarily of inferior quality. Nevertheless, he views this shift as reflecting an increased focus on breadth at the expense of depth. The nature of post-NCLB state assessments signals that the most important goal for teachers is to ensure that students have an opportunity to learn a broad array of content and skills, even if the coverage is superficial. It is important to ask, if this is the case, whether the types of processes students use to solve multiple-choice items are truly the basis for the 21st century “college- and work-ready” skills that policy makers stress. For example, he pointed out, the 1996 extended item begins by asking the students to break into small groups and discuss their approach to the task—a challenge much closer to what is expected in many kinds of work than those that are posed by most test items. States have also changed their approaches to high school testing, though the situation is still in flux. There has been a modest increase in the number of states using end-of-course examinations (given whenever students complete a particular course)—as opposed to survey tests that all students take at a certain point (such as at the end of particular grades). A few states have begun using college entrance tests as part of their graduation requirements.
OCR for page 6
Best Practices for State Assessment Systems, Part I: Summary of a Workshop Interim Assessments Another development has been a marked increase in the use of interim assessments, particularly at the district level, which Margaret Goertz also pointed out. These are assessments that measure students’ knowledge of the same broad curricular goals that are measured in annual large-scale assessments, but they are given more frequently and are designed to give teachers more data on student performance to use for instructional planning. Interim assessments are often explicitly designed to mimic the format and coverage of state tests and may be used not only to guide instruction, but also to predict student performance on state assessments, to provide data on a program or approach, or to provide diagnostic information about a particular student. Researchers stress the distinction between interim assessments and formative assessments, however, because the latter are typically embedded in instructional activities and may not even be recognizable as assessments by students (Perie, Marion, and Gong, 2007).3 Districts have vastly increased their use of interim assessments in the past 10 years, Goertz noted (see Stecher et al., 2008), and draft guidelines for the Race to the Top Assessment Program encourage school districts to develop formative or interim assessments as part of comprehensive state assessment systems. However, there have been very few studies of how interim assessments are actually used by individual teachers in classrooms, by principals, and by districts or of their impact on student achievement. Moreover, Goertz pointed out, many of the studies that are cited in their favor were actually focused on formative assessments. Studies are also needed to provide technical and other validity evidence to support inferences made from interim assessments. In surveys, teachers have reported that interim test results helped them monitor student progress and identify skill gaps for their students, and led them to modify curriculum and instruction (Clune and White, 2008; Stecher et al., 2008; Christman et al., 2009). Goertz also noted a study of how teachers used curriculum-based interim assessments in elementary mathematics in two districts that showed that teachers did use the data to identify weak areas or struggling students and to make instructional decisions (Goertz, Olah, and Riggan, 2009). The study also showed that teachers varied in their capacity to interpret the interim assessment data and to use it modify their teaching. However, the researchers found that few of the items in the interim assess- 3 Formative assessments are those designed primarily to provide information that students can use to understand the progress of their learning and that teachers can use to identify areas in which students need additional work. Formative assessments are commonly contrasted with summative assessments, those designed to measure the knowledge and skills students have attained by a particular time, often after the instruction is complete, for the purpose of reporting on progress. Interim assessments are assessments, which may be formative or summative, that are given at intervals to monitor student progress.
OCR for page 7
Best Practices for State Assessment Systems, Part I: Summary of a Workshop ments provided information that teachers could readily use, and few actually changed their practice even as they retaught material that was flagged by the assessment results. For example, many teachers focused on procedural rather than conceptual sources of error. Marion noted that the limited research available provides little guidance for developing specifications for interim assessments or for support and training that would help teachers use them to improve student learning. There is tremendous variability in the assessments used in this way, and there is essentially no oversight of their quality, he noted. He suggested that interim assessments provide fast results and seem to offer jurisdictions eager to respond to the accountability imperative an easy way to raise test scores. Multiple Purposes for Assessment Another significant change, Goertz pointed out, is that as demands on state-level assessments have increased in a time of tight assessment budgets, tests are increasingly used for a number of different purposes. Table 1-1 organizes some common testing purposes by goal and by the focus of the information collected. In practice these uses may overlap, but the table illustrates the complexity of the roles that assessments play. TABLE 1-1 Uses of Assessment Use Student Teacher School Diagnosis Instructional decisions, placement, allocation of educational services Professional development and support Resource allocation, technical assistance Inform Teaching and Leaning Focus, align, redirect content; instructional strategies Instructional focus, align curriculum to skills or content; school improvement and planning Evaluation Certification of individual achievement Teacher preparation programs, teacher pay Program evaluation Public Reporting Transcripts Parent or community action External Accountability Promotion, high school graduation Renewal, tenure, pay Sanctions and rewards SOURCE: Goertz (2009).
OCR for page 8
Best Practices for State Assessment Systems, Part I: Summary of a Workshop To clarify the implications of putting a single test to multiple uses, Goertz highlighted the design characteristics that are most important for two of these uses, informing instruction and learning and external accountability. For informing instruction and learning, for example, a test should be designed to provide teachers with information about student learning on an ongoing basis, which they can easily interpret and use to improve their instruction. For this purpose, an assessment needs to provide information that is directly relevant to classroom instruction and is available soon after the assessment is given. Ideally, this kind of assessment would provide continuous information—if it is embedded in instruction it can provide continuous feedback. For this purpose, statistical reliability is not as important as relevance and timeliness. When test data are to be used for external accountability, the assumption is that information about performance will motivate educators to teach well and students to perform to high standards. Therefore, incentives and sanctions based on test results are often used to stimulate action: this raises the stakes for the tests as well as for students and educators. So, when accountability is the goal, several test characteristics are of critical importance: alignment of test questions to standards; standardization of the content, test administration, and scoring to support fair comparisons; and the fairness, validity, and reliability of the test itself. The tension between these two purposes is at the heart of many of the problems that states have faced with their assessment programs, and it is a key challenge for policy makers to consider as they weigh improvements to accountability systems. The growing tendency to use assessments for multiple purposes may be explained in part by the loss of assessment staff in a time of tight education budgets. Marion reported that most states have seen an approximately three-fold increase in testing requirements without a corresponding increase in personnel (U.S. Government Accounting Office, 2003; Toch, 2006). As a result, many states have moved from internal test design and development to outside vendors, and, he suggested, remaining staff have less time to work with vendors and to think about innovative approaches to testing. A number of other factors help explain recent changes in the system, Marion suggested. NCLB required rapid results, and the “adequate yearly progress” formula put a premium on a “head-counting” methodology (that is, measuring how many students meet a particular benchmark by a particular time, rather than considering broader questions about how well students are learning). However, the law did not, in his view, provide adequate financial support for ongoing operational costs. He also said that there has been insufficient oversight of technical quality, with the result that the validity of particular assessments for particular purposes has received inadequate attention. The fact that results for multiple-choice and open-ended items are well correlated has been mistaken for evidence that they are interchangeable. Marion said that an
OCR for page 9
Best Practices for State Assessment Systems, Part I: Summary of a Workshop era of tight funding has made it easy for policy makers to conclude that open-ended items and other innovative approaches are not worth their higher cost, without necessarily understanding that such items make it possible to assess skills and content that cannot be assessed with multiple-choice items. THE CURRENT SYSTEM This outline of some of the important recent changes in assessment systems provided the foundation for a discussion of strengths and weaknesses of the current system and targets for improvement. Goertz and Marion presented very similar messages, which were endorsed by discussant Joan Herman. Strengths Attention to Traditionally Understand Student Populations Including all students in assessments, to ensure that schools, districts, and states are accountable for their results with every group, was a principal goal of NCLB. As a result, although much work still needs to be done, assessment experts have made important progress in addressing the psychometric challenges of testing students with disabilities and English-language learners. Progress has been made in thinking about the validity of assessments for both of these groups—which are themselves very heterogeneous. Test designers have paid attention to the challenges of specifying constructs to be measured more explicitly and removing construct irrelevant variance from test items (for example, by reducing the reading burden in tests of mathematics so that the measure of students’ mathematics skills will not be distorted by reading disabilities). As policy makers and psychometricians have worked to strike an appropriate balance between standardization and technical quality, more assessments are available to measure academic skills—not just functional skills—for special populations. Improved understanding of patterns of language acquisition and the relationship between language skills and academic proficiency have supported the development of better tools for assessing English-language learners across multiple domains. Increased Availability of Assessment Data The premise of NCLB is that if states and districts had more data to document their students’ mastery of educational objectives, they would use that information to improve curricula and instructional planning. States and districts have indeed demonstrated a growing sophistication in the analysis and use of data. Improved technology has made it easier for educators and policy makers to have access to data and to use it, and more people are using it. However, the capacity to draw sound inferences from the copious data to which most policy makers and educators now have access
OCR for page 10
Best Practices for State Assessment Systems, Part I: Summary of a Workshop depends on their training. As discussed below, this capacity has in many cases lagged behind the technology for collecting data. Improved Reporting The combination of stricter reporting requirements under NCLB and improved technology has led states and districts to pay more attention to their reporting systems since 2002. Some have made marked improvements in presenting data in ways that are easy for users to understand and use to make effective decisions.4 Weaknesses Greater Reliance on Multiple-Choice Tests In comparison with the assessments of the 1990s, today’s state assessments are less likely to measure complex learning. Multiple-choice and short constructed-response items that are machine-scorable predominate. Though valuable, these item types assess only a limited portion of the knowledge and skills in current standards. More Focus on Tested than on Standards Particularly in low-performing schools, test-based accountability has focused attention on standards, especially the subset of academic standards and content domain that is covered by the tests. Although this focus has had some positive effects, it has also had negative ones. States and districts have narrowed their curricula, placing the highest priority on tested subjects and on the content in those subjects that is covered on tests. Research indicates that the result has been emphasis on lower-level knowledge and skills and very thin alignment with the standards. For example, Porter, Polikoff, and Smithson (2009) found very low to moderate alignment between state assessments and standards—meaning that large proportions of content standards are not covered on the assessments (see also Fuller et al., 2006; Ho, 2008). More Narrow Test Preparation Because of the considerable pressure to make sure students meet minimum requirements on state assessments, many observers have noted an increased focus on so-called “bubble kids,” those who are scoring just below cutoff points. A focus on drilling these children to get them above the passing level may often come at the expense of other kinds of instruction that may be more valuable in the long run. Discussants suggested that this test-preparation focus is particularly prevalent in schools serving poor and traditionally low-performing students and that the emerging result is a dual curriculum, in which already underserved children are not benefiting from the rigorous curriculum that is the ostensible goal of accountability (see, e.g., 4 Marion cited the Colorado Department of Education’s website as a good example of innovative data reporting (http://www.schoolview.org/ [January 2010]).
OCR for page 11
Best Practices for State Assessment Systems, Part I: Summary of a Workshop McMurrer, 2007). It was noted that this approach often results in less attention to the needs of both high- and low-performing students. Insufficient Rigor Current assessments are regarded as insufficiently rigorous. Analysis of their cognitive demand suggest that they focus on the lower levels of cognitive demand defined in standards and that they are less difficult than, for example NAEP (see, e.g., Ho, 2008; National Research Council, 2008; Cronin et al., 2009). Challenges Many of the challenges that the presenters and discussant identified as most pressing mirrored the strengths and weaknesses. They identified opportunities not only to address the weaknesses, but also to build on many of the strengths in the current system. Purposes of Testing For Goertz, any plans for improving assessment and accountability systems should begin with clear thinking about several questions: “What do we want to test and for what purpose? What kinds of information do we want to generate and for whom? What is the role of a state test in a comprehensive assessment system? What supports will educators need?” Assessments, whatever their nature, communicate goals to students and teachers. They signal what is valued, in terms of the content of curriculum and instruction and in terms of types of learning. All who play a part in the system listen to the signal that testing sends, and they respond accordingly. Goertz suggested that current approaches may be coherent, in a sense, but many are sending the wrong signals to students and teachers. The System as a Whole A theme of the discussion was that if tests are bearing too much weight in the current system, it is logical to ask whether every element of an accountability system must be represented by a test score. Measures of students’ opportunity to learn and student engagement, as well as descriptive measures of the quality of the teaching and learning, may provide a valuable counterbalance to the influence of multiple-choice testing. It is important to balance the need for external accountability against the other important goals for education. Thus, in different ways, Marion, Goertz, and Herman each suggested that it is important to evaluate the validity of the entire system, seeking evidence that each element of the system serves its intended purpose. The goal for an accountability system should be to provide appropriate evidence for all intended users and to ensure that those users have the capacity and resources to use the information. The key question, as Herman phrased it, is: “Can we engineer the process well enough that we minimize the negative and maximize
OCR for page 12
Best Practices for State Assessment Systems, Part I: Summary of a Workshop the positive consequences?” Clearly, she stressed, it does not make sense to rely on an annual assessment to provide all the right data for every user—or to measure the breadth and depth of the standards. Looking at the system as a whole will entail not only consideration of intended and unintended consequences, but also a clear focus on the capacity of each element of the system to function as intended. But Goertz pointed out that a more innovative assessment system—one that measures the most important instructional goals—cannot by itself bring about the changes that are desired. Support for the types of curriculum and instruction that research is indicating will foster learning, as well as such critical elements as teacher quality, is also needed. Staff Capacity to Interpret and Act Many at the workshop spoke about the importance of developing a “culture of data use.” Even as much more data has become available, insufficient attention has been paid to developing teachers’ and administrators’ capacity to interpret it accurately and use it to support their decision making. Ideally, a user-friendly information management system will focus teachers’ attention on the content of assessment results so they can easily make correct inferences (e.g., diagnose student errors) and connect the evidence to specific instructional approaches and strategies. Teachers would have both the time to reteach content and skills students have not mastered and the knowledge of effective strategies to target the gaps. System Capacity Looking more broadly at the capacity issue, Marion noted that there has been a “three- or four-fold increase in the number of tests that are given” without any corresponding increase in assessment personnel. Yet performance or other kinds of innovative assessments require more person-hours at most stages of the process than do multiple-choice assessments. These issues were discussed in the next session of the workshop, described in Chapter 2. Reporting of Results Although there have been improvements in reporting, it has generally received the least attention of any aspect of the assessment system. NCLB has specific reporting requirements, and many jurisdictions have better data systems and better technology as a result. Nevertheless, even the best reports are still constrained by the quality of the data and the capacity of the users to turn these data into information, decisions, and actions.