Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
The Criteria in Context The steering committee began the process of planning the workshop by considering the characteristics of an assessment system in which classroom and large-scale assessments work together to support learning. It agreed with earlier committees that, to be effective, assessment systems must do more than provide valid data. They must also be designed so that the information produced can be used to improve both the educational system and the teaching and learning process. In such a system a single assessment does not function in isolation but rather within a coordinated system in which the state, the district, the school, and the classroom each play a role. The specific criteria the committee identified are listed and briefly described here. They are elaborated further in the discussion of workshop presentations later in this report. The steering committee made no attempt to evaluate the relative importance of each of the criteria, nor did it use the criteria to evaluate programs. Rather, the intent was to use the experiences of workshop presenters as a vehicle for thinking about the ways in which each of the criteria can contrib- ute to the establishment of a coherent system. The following are the ideal characteristics of assessment systems that the committee identified: 1 . Comprehensive: A comprehensive system is one in which a range of measurement approaches are used to provide a variety of evidence to 1 The first three criteria are adapted from Knowing What Students Know (NRC, 2001c); the last two were distilled from other reports listed in the Introduction. s
6 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING . . . support educational decision-making. A well-designed system includes both formative (to support students' ongoing learning and help teachers make instructional decisions) and summative (to evaluate students' level of achievement at the completion of a phase of learning) assessments that move students toward a manageable and clearly articulated set of out- comes. Measures might also include those that assess the quality of instruction, and provide evidence that improvements in tested achieve- ment represent actual gains in learning as opposed to improved test- taking skills, for example. Coherent: A coherent system is one in which the conceptual base or models of learning underlying the assessments used at all levels (large- scale or classroom) are compatible. Furthermore, the content, processes, and skills measured by different assessments across the system are com- patible. For a system to be coherent, alignment is needed among stan- dards, curriculum, instruction, and professional development so that each element contributes to a common set of learning goals. Continuous: In a coordinated system, assessments measure student progress over time for example, over a school year, over several grades, or over a student's school career. Assessments are ongoing and seam- lessly integrated into instruction. · Integrated: An assessment system is integrated if it is carefully designed to fit into a larger, coherent educational system that provides resources and professional development to ensure that teachers have the capacity to do what is expected of them based on the standards in place. Includes High-Quality Assessments: All of the assessments included in the system should be of high quality, by which is meant, first, that they must adhere to relevant professional standards. To further illustrate what high quality means, the committee has identified a set of specific charac- teristics that large-scale and classroom assessments can exhibit, which are summarized in Boxes 2-1 and 2-2. These criteria address the educational assessment environment as a whole, and certainly it is not possible to talk about the relative effectiveness of large- scale or classroom systems without considering the contexts in which they are designed to operate. Nevertheless, there are many choices of approach for assessing students, and the workshop began with an overview of current thinking about both large-scale and classroom assessments. The discussion was grounded in professional thinking on the purposes that each kind of assessment serves best, and offered an overview of their potential, as well as their limitations.
THE CRITERIA IN CONTEXT 7
8 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING
THE CRITERIA IN CONTEXT 9
10 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING THE IDEAL While no current assessment programs have been identified that satisfy all of the attributes described above, some can be seen as making significant progress in implementing specific features of a high-quality program. To explore what it might be like to teach and learn in a coherent and balanced assessment environ- ment, where assessments, curriculum, instruction, and professional development are fully aligned with standards, the committee invited Gail Burrill, a teacher and teacher educator at Michigan State University and former president of the Na- tional Council of Teachers of Mathematics, to inaugurate the workshop by simu- lating such a situation for the workshop audience. Describing an array of embedded, formative assessment techniques, Burrill illustrated for the workshop participants how assessment can help to shape learn- ing and direct instruction. Examples from Japan, the Netherlands, and China helped to illustrate the ways in which assessments can circumscribe both what is taught and how it is learned. Burrill used these international examples to make
THE CRITERIA IN CONTEXT 11 the point that educators in the United States are often leery of expecting students to transfer their knowledge to new contexts. In the examples she discussed, assessments were more challenging in that they called on students to use cogni- tive processes on unfamiliar material, but she argued that U.S. students could handle this kind of challenge. To be sure that there is correspondence between what is taught and what is valued, Burrill suggests, input from many sources is necessary. Subject area experts, curriculum developers, researchers, teachers, cognitive scientists, and assessment developers need to work together to develop the standards and the assessments that will be used to measure student mastery of the specified compe- tencies. Key for Burrill is that teachers be able to make choices as they imple- ment a curriculum, and that assessments serve as an appropriate guide to what is taught. Coherent assessments will foster coherent curriculum and effective instruction; lack of coherence leads to unfocused learning and shallow under- standing. LARGE-SCALE ASSESSMENTS While large-scale assessments can be controversial, and are easily misused, they are an important way of obtaining certain kinds of extremely valuable infor- mation about students. Large-scale assessments, those that are designed to pro- vide evidence about large numbers of students, are the primary means by which accountability evidence is obtained in the United States. Indeed, there is little dispute that accountability the provisions made for those who use, fund, and oversee public education to review and evaluate its effectiveness is a crucial element in the continued success of public education. As Lorrie Shepard of the School of Education, University of Colorado, Boulder, outlined at the workshop, there are three particular uses for which large- scale tests are essential. The first is program diagnosis. Assessments that make it possible to compare the performance of a large number of students can be used to identify patterns of strengths and weaknesses that are in turn critical for iden- tifying any needed improvements in curriculum or instruction. Assessments developed for large-scale use, to provide evidence about district- or statewide performance, can also exemplify, as Shepard termed it, the educational goals described in standards and curriculum documents. In other words, assessment tasks and examples of student work make concrete just what students will actu- ally know or be able to do if they meet defined standards. Large-scale assess- ments are also useful for one-time certification or screening; for example, to identify students who are not ready for grade-level work in reading and who need follow-up targeted assessment to determine their specific needs for remediation. Shepard also noted that large-scale assessments often provide teachers an opportunity for effective professional development. Development of tests, scor- ing, curriculum development, and standards-based professional development are
2 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING all occasions when efforts to improve classroom assessment strategies can be woven into the program. Shepard argues that more could be gained through these opportunities if teachers had improved access to materials that model teaching for understanding, such as extended instructional activities, formative assessment tasks, and scoring rubrics with summative assessments built in to them. While the value of large-scale assessments for these purposes is clear, it is equally clear that they are not useful for many other important educational pur- poses, particularly that of providing detailed understanding of individual students' performance. Professional standards are firm on the point that it is not a test itself that can be established as valid, but particular inferences that may be made from the test data (see National Science Education Standards (NSES) Standard 13.2, NRC, 1996). Nevertheless, administrators who are pressed for both time and resources are often tempted to find tests that can serve more than one purpose. While this can be done, it necessarily entails compromises. Noting, "Ironically, the questions that are of most use to the state officer are of the least use to the teacher" (NRC, 2001c, p. 224), the Committee on the Foundations of Assessment framed the problem as a trade-off in assessment design between supporting accountability for schools and systems and supporting the need for specific guidance about individual students. As Shepard stated, "The best way to help policy makers understand the limitations of an external, once-per-year test for instruction is to recognize that good teachers should already know so much about their students that they could fill out the test booklet for them." Shepard listed some of the contrasts, shown in Box 2-3, between large-scale and classroom assessments that make clear why different instruments are usually needed for different purposes.
THE CRITERIA IN CONTEXT 13 Many large-scale assessments are what psychometricians call "norm- referenced," which means that one of their functions is to provide evidence of how students compare to one another. The resulting scores can be used to spread students' performance out along a scale. The SAT is a good example of such a test: it is designed not to assess particular knowledge or content, but to provide college and university admissions officials with a means of ranking students based on their potential to succeed at college-level work. The questions are carefully selected, based on pretesting results, to present a range of difficulty, so that very few students are likely to succeed at either all or none of them, and so that the students will be spread out along the scale. Performance on such tests is often expressed in terms of percentiles, with a particular score reflecting perfor- mance that is better than that of a certain percentage of other test takers. Other assessments are called "criterion-referenced" because their scoring "refers" not to the past performance of other students but to a fixed body of knowledge. Good examples of this kind of testing include professional licensure tests, which often identify minimum acceptable levels of mastery. With such tests, it does not matter how well other students have done; it matters only that a prospective airline pilot or surgeon has mastered a particular body of knowledge deemed essential. Assessments used with K-12 students can be of either type, and in some cases may blend the two. For example, states that use tests devel- oped by national companies, which are often norm-referenced and offer the state the opportunity to determine how its students compare to those of the same age across the country, may also wish to assess their students' knowledge of particu- lar aspects of their standards. A state may add sections to the norm-referenced portion or make other modifications to adapt the test to the multiple purposes it has identified, though, as noted above, such an approach entails compromise. CLASSROOM ASSESSMENTS Discussion of classroom assessments has been somewhat less tidy, in part because the definition of such assessments is less precise, and the range that the term covers was evident at the workshop. Teachers make assessments of their students' learning every day, by noting the misconceptions or insights that under- lie a question, for example, or observing the way a student makes use of materials provided for a task. They also assess them more formally, with particular questions in mind, and it is through the teacher's aim in assessing that presenter Dylan Wiliam, professor of education at King's College in London, defines class- room assessment, or, in his phrase, assessment for learning. That is, if the aim of the assessment is to improve the student's learning in some direct way, rather than to rank, evaluate, or certify some aspect of performance, then it is properly in the realm of classroom assessment. For Wiliam, it is the feedback provided to the student that is critical to the success of this enterprise, and he describes it as a three-part process. First the
4 ASSESSMENT IN SUPPORT OF INSTRUCTION AND LEARNING teacher must find out where the student is in relation to the goals for the class; next, he or she must clearly convey to the student what those goals are. Perhaps most important, the teacher must then help the student in concrete ways to move toward those goals. Assessments that are intended primarily to provide feedback to students and to shape their learning are often called formative assessments, and distinguished from summative ones, which are intended primarily to evaluate students. This mode of categorizing assessments shares some aspects with the dichotomy between classroom and large-scale assessments that is the subject of this report, but it is important to remember that a large-scale assessment could serve formative purposes, just as a classroom assessment can serve summative purposes. Presenter Jan de Lange, professor and director of the Freudenthal Institute at the University of Utrecht, The Netherlands, addressed the issue of classroom assessments used in teaching mathematics, using a description of a project car- ried out in Philadelphia and Milwaukee by the Freudenthal Institute to highlight several points. The project's goal was to influence the quality of learning and instruction by changing classroom assessment methods, and it used an Assess- ment Pyramid to depict the different levels of mathematical competencies that students display. In the pyramid, level 1 covers reproduction and facts, level 2 is making connections and simple problem solving, and level 3 is complex problem solving and mathematical reasoning. Teachers involved in the project were given a variety of supports, including both assessment materials and training, through which they could help their students think more deeply about mathematics. At the same time, teachers' thinking about what constitutes effective classroom assessment, scoring, and other issues was expanded. The pyramid was the basis for defining expectations for student performance, for structuring instruction, and for giving students use- ful feedback in relation to learning goals and competency levels. The pyramid was derived from the framework used in the Organisation for Economic Co-operation and Development's Programme for International Stu- dent Development (PISA) (PISA's assessment program is described in Chapter 4~. De Lange argued that the alignment between the pyramid used in the class- room and the large-scale PISA demonstrated for teachers that a comprehensive, coherent, and continuous assessment is possible. At the same time, by working with the pyramid the teachers became skilled at recognizing and analyzing qual- ity assessment. Through the two-year study, de Lange explained, teachers changed their approaches to both classroom assessment and the teaching of math- ematics in significant ways. For committee chair J. Myron Atkin, professor at the Center for Educational Research, Stanford University, the key is the teacher's unique capacity to monitor students' progress over time. In his presentation, which focused on the way classroom assessment functions in science education, Atkin asked workshop par- ticipants to consider the many different opportunities a teacher has to assess what
THE CRITERIA IN CONTEXT 15 students know and can do in the course of a project that takes place over several weeks or months. As an example, Atkin cited a project in which a group of students monitored the state of a pond near their school and investigated the nature and possible causes for an algal bloom that occurred in the course of their study. Not only were they conducting original research, in the sense that no scientists had previ- ously studied that particular pond, the students were also able to respond to unpredictable events. The project afforded them many opportunities to demon- strate their capacity to bring prior knowledge and experience to bear on a problem, their proficiency with available methods and tools, and their resourcefulness in drawing on available sources of data and interpretation. Their teacher was able to monitor their progress through formal output, such as field notes and reports, as well as in countless informal interchanges that revealed the students' thinking and their development over time. This project exemplified for Atkin how a teacher can develop an "assessment culture," in which the focus is on inquiry a key element of both the content and skill standards included in the NSES (NRC, 1996~. The teacher was able to assess students on skills and knowledge that are deemed essential by NSES, and yet are impossible to measure using a one-time performance assessment. The challenge Atkin identified is to find more ways to make systematic use, for purposes of accountability beyond the classroom, of the information about stu- dents that teachers are in a unique position to obtain.