Overview

Many states and school districts are implementing reforms of science education aimed at helping all students acquire deep knowledge of scientific concepts and advanced thinking skills. Often these reforms involve new teaching methods that encourage students to explore key science topics in depth, work on realistic and complex problems, think like scientists, and integrate knowledge from several domains. In many states, science education reforms are linked to broader efforts to improve the entire educational system and to set high standards for student learning.

Assessment has become an increasingly critical part of these reforms. States and districts are revamping their assessment systems to make them more compatible with new science and math education standards and with innovative teaching methods. Interest is growing in alternative forms of assessment that ask students to demonstrate their performance by solving open-ended problems, doing projects and presentations, and collecting portfolios of their work. Researchers are working with classroom teachers to pilot techniques—such as encouraging students to assess themselves or teachers to formalize their judgments—that can make assessment a more integral part of everyday teaching and learning.

These developments in science assessment raise many technical and policy issues about the reliability, validity, fairness, and cost of new forms of assessment, and whether these new assessments are feasible and suitable for applications on a wide scale or with high stakes attached.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
Overview Many states and school districts are implementing reforms of science education aimed at helping all students acquire deep knowledge of scientific concepts and advanced thinking skills. Often these reforms involve new teaching methods that encourage students to explore key science topics in depth, work on realistic and complex problems, think like scientists, and integrate knowledge from several domains. In many states, science education reforms are linked to broader efforts to improve the entire educational system and to set high standards for student learning. Assessment has become an increasingly critical part of these reforms. States and districts are revamping their assessment systems to make them more compatible with new science and math education standards and with innovative teaching methods. Interest is growing in alternative forms of assessment that ask students to demonstrate their performance by solving open-ended problems, doing projects and presentations, and collecting portfolios of their work. Researchers are working with classroom teachers to pilot techniques—such as encouraging students to assess themselves or teachers to formalize their judgments—that can make assessment a more integral part of everyday teaching and learning. These developments in science assessment raise many technical and policy issues about the reliability, validity, fairness, and cost of new forms of assessment, and whether these new assessments are feasible and suitable for applications on a wide scale or with high stakes attached.

OCR for page 1
Purpose of the NRC Conference To explore the connections between new approaches to science education and new developments in assessment, the Board on Testing and Assessment (BOTA) of the National Research Council (NRC) sponsored a two-day conference on February 22 and 23, 1997. Participants included BOTA members, other measurement experts, and educators and policymakers concerned with science education reform. The conference encouraged the exchange of ideas between those with measurement expertise and those with creative approaches to instruction and assessment. Participants presented recent research on science assessment, teaching, and learning and discussed efforts to improve assessment practice and classroom practice. They debated the role of assessment in the systemic reform of science education. The conference also featured a poster session in which participants could see firsthand some of the new science assessment tools being used in several schools. The conference gave particular attention to these questions: What are the various roles that assessment is expected to play in science education? How well do existing assessment tools align with the kinds of scientific knowledge and skills that students should have? Which kinds of new assessment strategies are needed to support new instructional approaches? What are some innovative approaches to assessment in science education, and what effects do they have in classrooms? What evidence is available about the technical quality and feasibility of alternative assessments? How much do they cost? How well can new assessment methods meet the need for accountability in educational programs? Are they suitable for widespread use or for high-stakes situations? What practical and policy issues are raised by alternative approaches to science assessment? Are the alternative approaches credible and understandable to parents, the public, and policymakers? How soon can they be implemented?

OCR for page 1
This report summarizes the presentations, papers, and group discussions from the conference. In the four main sections below, the report (1) describes several alternative assessment methods presented at the conference, (2) explores major issues related to their design and use, (3) discusses how alternative assessments are being used to improve instruction, and (4) examines the implications of using various forms of assessment to promote systemic reform and accountability. A concluding section looks at avenues for meeting the complex demands for new kinds of assessments that serve instructional and accountability purposes. Purposes of Science Assessment Assessment has multiple purposes. One purpose is to monitor educational progress or improvement. Educators, policymakers, parents, and the public want to know how much students are learning compared to standards of performance or to their peers. This purpose, often called summative assessment, is becoming more significant as states and school districts invest more resources in educational reform. A second purpose is to provide teachers and students with feedback. The teachers can use the feedback to revise their classroom practices, and the students can use the feedback to monitor their own learning. This purpose, often called formative assessment, is also receiving greater attention with the spread of new teaching methods. A third purpose of assessment is to drive changes in practice and policy by holding people accountable for achieving the desired reforms. This purpose, called accountability assessment, is very much in the forefront as states and school districts design systems that attach strong incentives and sanctions to performance on state and local assessments. As testing experts so often stress, certain kinds of assessments are better for some purposes than for others. Good testing practice emphasizes the importance of using a test only for the purpose for which it was designed. Because good testing practice also means properly interpreting and reporting test data, the audience for the assessment data is another important consideration in test development. In addition to students and teachers, the major audiences for publication of test scores include parents, school administrators, policymakers, researchers, and the general public. Particular care must be taken when tests are used for high-stakes decisions about people or institutions, such as student promotion, college attendance, teacher career paths, or school funding allocations. Tests used for high-stakes applications, or accountability to an external authority, or publication of scores to a wide audience must meet strict

OCR for page 1
psychometric criteria (technical properties established through measurement science). For example, these tests should demonstrate adequate reliability (scores are consistent and generalizable to the broader universe of knowledge and skills) and validity (the test really measures what it is supposed to and can form a basis for appropriate inferences). They should also be fair (not biased based on factors such as ethnicity or gender). A critical issue for the development of innovative approaches to science assessment is the usefulness of these approaches for the various purposes of testing. Standardized multiple-choice tests continue to prevail in large-scale assessment programs used for accountability and other high-stakes purposes largely because they meet the technical criteria of reliability, validity, and fairness and are low-cost and efficient to administer and score. But many educators and researchers question the broader validity of traditional multiple-choice tests, arguing that they do not fully capture the kinds of conceptual knowledge and thinking skills embodied in new standards that outline desirable learning goals. Because of this, existing tests might not adequately support reforms in science education and might discourage teachers from adopting effective strategies for teaching advanced knowledge and skills. The developers of new assessment approaches are trying to overcome the limitations of multiple-choice tests by broadening the available types of assessment tasks and methods and by more directly tying them to classroom practices. But innovative assessments have their own drawbacks. They are generally more expensive and complicated to administer and score. Most have not achieved the technical quality desired for accountability and other large-scale or high-stakes purposes. Developing innovative assessments has proved to be an exceedingly difficult problem that will require expertise from both measurement science and educational practice. Innovative Forms of Science Assessment The conference highlighted a variety of efforts under way to develop and implement innovative approaches to assess science learning. The designers of these assessments are motivated by several goals. They want assessments that better measure advanced knowledge and skills, are more closely aligned with state and local standards, and will tell people how students are progressing toward standards. Many innovators are also trying to influence instruction by making assessments more compatible with new teaching strategies and by using tasks that model desirable classroom practices. Some also

OCR for page 1
seek to involve students more in their own assessments—and their own learning—by giving them meaningful tasks and encouraging them to reflect on what they are doing. Many of these alternative assessments involve open-ended items, complex problems, and performance-based tasks. Many also entail nontraditional methods of response and scoring. Several take a longer-term view of assessment, requiring students to do in-depth experiments or projects or collect examples of their work over time in a portfolio. In contrast to traditional assessment, which often takes place at the end of learning when it is too late to make a difference, many innovative assessments encourage teachers and students to engage in continuous cycles of problem solving, reflection, discussion, and revision. Performance Assessment Many of the alternative approaches presented at the conference fall within the broad category of performance assessment, which refers to any assessment that requires students to demonstrate their knowledge or skills by creating an answer or product. Performance assessment includes a wide range of activities, from responding to a short-answer essay question to conducting a semester-long project. Performance assessment aims to provide richer information about student capabilities, such as planning and conducting an experiment and working collaboratively, that are not captured by traditional test formats. It is particularly focused on assessing what one can do, as distinct from what one knows. For example, as Bert Green of Johns Hopkins University observed, someone might know how to play volleyball but no longer be able to do it physically. The Scientific and Measurement Arenas for Refining Thinking (SMART) Project, a middle school curriculum described at the conference by Susan Goldman and James Pellegrino of Vanderbilt University, uses various kinds of performance assessment for both formative and summative ends. SMART instructional activities include working on familiar, attention-grabbing problems of water quality as a way to develop the students' deep knowledge of scientific concepts and the ability to monitor their own understanding. Students design and discuss problem-solving strategies, receive guided support from teachers, obtain feedback from classmates and Internet buddies, and eventually conduct hands-on water testing at local rivers. As with many new curricular projects, assessment is integrated into the ongoing teaching and learning activities that make up the substance of the program. Assessment occurs through presentations, projects, discussions, and paper-and-pencil tests. For example, as

OCR for page 1
a final SMART task and summative assessment, students make a group oral presentation to their classmates. Students score each other using class-developed criteria. The presentations are also videotaped, and the tapes are submitted to a central team of teachers, who review them and send some to the scientists at the state water quality department. In evaluating the presentations, the teachers and experts look at criteria such as how well students identify the main issues and how completely they understand key concepts and scientific relationships. The project also includes some performance items in its written exams used for pre- and post-testing. For example, students are asked to make a drawing explaining the concept of dissolved oxygen; those who have learned the desired concepts will produce more sophisticated drawings showing molecular structures. Performance assessment can also be administered as part of a large-scale standardized test. The 1995 Third International Mathematics and Science Study (TIMSS), described at the conference by Maryellen Harmon of Boston College, was the largest and most technically sophisticated international achievement test ever conducted, involving half a million students in 41 countries. A much smaller sample of 6,000 students from various countries took a set of 12 performance items in science and math, each either 15 or 30 minutes long, that were designed to measure skills that could not be assessed with the multiple-choice items on the basic TIMSS test. Some tasks measured a single skill, such as using a thermometer, while others involved several steps to replicate complex real-world situations, such as determining and explaining the effect of exercise on one's own heart rate. Portfolio Assessment Portfolio assessment measures performance by collecting portfolios of student work representing what they know and can do. One effort to grapple with the challenges of portfolio assessment was presented by Elizabeth Stage, co-director of the New Standards Project science initiative. In this project, several states and urban districts are working together to design an assessment system that will measure how well students are doing compared with the voluntary content standards emerging from national professional organizations in science, math, and other disciplines. The New Standards assessment system includes three related elements: (1) performance standards in various subjects based on the voluntary content standards; (2) external examinations that measure student performance in reference to these standards, using a combination of multiple-choice and performance tasks; and (3) a portfolio assessment system. The last two elements are still under development.

OCR for page 1
Each New Standards portfolio contains entries from an individual student. These entries include papers, experiments, projects, and commentaries on books read, which demonstrate that the student has met certain performance standards. Maintaining consistency of portfolios across students and classrooms is a particular challenge, which the New Standards Project has tried to address with specific criteria for portfolio entries. In science, the entries must illustrate the students' understanding of key concepts in physical, life, earth, and space sciences and in scientific connections and applications; entries must also demonstrate skills in scientific thinking, communication, investigation, and use of tools and technologies. For instance, a student might demonstrate her understanding of the properties of matter (a key physical sciences concept) by entering in her portfolio a write-up of a laboratory investigation of how volume and mass are related to density. This write-up might also demonstrate good communication skills by showing that the student can write clear instructions for an experiment. To help teachers and students understand what high-quality student work looks like, the New Standards Project is collecting pieces of authentic work that exemplify high performance at different grade levels, along with explanations of how that work was evaluated. To make the point that students in all types of classroom situations can meet high standards, the project is trying to amass diverse examples of high-quality work from a variety of classrooms with different instructional programs. The New Standards Project sees several potential benefits of portfolio assessment. It can make performance expectations more explicit and more connected to what students do in the classroom. It allows educators to assess knowledge and skills that require extended time and opportunities for revision. (in contrast to knowledge and skills that students should be able to demonstrate on demand). Portfolio assessment enables teachers and students to have sustained conversations about the quality of students' work and to share responsibility for the outcome. Although optimistic about the instructional value of portfolios, Stage contends that portfolio assessment is a long way from achieving the technical quality necessary for large-scale assessment (e.g., consistency of scoring), and thus makes more sense as one of several components of an assessment system. Concept Maps A concept map is a graph in which nodes represent key concepts, lines linking the nodes represent relationships between these concepts, and labels on the lines describe how the concepts are related. A pair of concepts and a labeled linking line are called a proposition, the fundamental unit of a concept map. A concept map is intended to document an

OCR for page 1
important structural aspect of an individual's knowledge about a content domain, for example, whether a student understands the process of photosynthesis and the relationships of carbon dioxide and oxygen to life cycles. An accepted instructional tool, concept maps have recently become the subject of research about their usefulness as assessment tools; pertinent studies include work by conference presenters Maria Ruiz-Primo of Stanford University and Harry O'Neil of the University of Southern California. A concept map becomes an assessment tool when it is used as a format for students to give evidence of what they know about a particular domain and when it is accompanied by a scoring system. According to the presenters, concept maps have some apparent assessment advantages. They can measure certain kinds of conceptual understanding in more depth than multiple-choice tests. They require less student writing than some other types of performance assessments, and often cost less to administer and score. Students can collaborate on constructing concept maps, which means the maps can be used to assess teamwork skills, such as leadership, decision making, and communication, as well as individual skills. A disadvantage is that concept maps can be difficult for parents and the public to understand, and thus are not as credible as more familiar kinds of assessment. Assessments based on concept maps can vary considerably in their tasks, response modes (oral, written, or computerized), and scoring systems. Tasks can be designed to provide test takers with varying degrees of support. Some concept map assessments ask students to generate their own concepts for a particular domain, while others give students a list of concepts from which to construct their map. Still others give students a map with concepts already filled in and ask them to label the connecting lines. The research of Ruiz-Primo indicates that the first two formats—letting students generate their own concepts or giving them a list—produce equivalent student scores, both in the accuracy of students' propositions and the proportion of students' propositions that are valid. The two formats also produce similar reliability and validity coefficients. Providing students with lists of concepts makes scoring easier, however, and is therefore more feasible for large-scale assessment (although assessment designers must be careful to choose appropriate concepts). Concept maps can be scored by checking whether the map includes specific features, such as numbers of nodes and accurate propositions; by comparing the student's map with a map developed by an expert; or by combining these two strategies. According to the presenters, research to date indicates that concept maps are a feasible form of assessment and can provide good estimates of student competence. Ruiz-Primo

OCR for page 1
compared concept map scores with scores on multiple-choice tests and found that correlations between the two are moderately high (r =.57 on average); this suggests that these two types of assessments measure overlapping yet somewhat different aspects of knowledge. O'Neil's work with collaborative concept maps found a correlation of r =.7 between concept maps and an essay task. In short, the technique is promising, but further research is needed before concept maps become a proven assessment tool. Among the issues to be addressed are: How large a sample of concepts needs to be assessed to capture a student's knowledge structure? Are different response modes interchangeable? Cognitively Guided Assessment Another trend in assessment is to apply advances from cognitive science about how people think and learn to the process of designing and evaluating tests. As described by Robert Glaser of the University of Pittsburgh, the aim is to produce an assessment that effectively measures complex cognitive skills, such as the ability to integrate knowledge and make inferences, select and execute strategies for solving problems, adjust one's own performance, and offer coherent explanations for the strategies used. Under this approach, a test developer determines the construct validity of a test by paying attention to its cognitive validity, that is, by looking at whether the test is a valid measure of both the content knowledge and the cognitive skills required by a particular domain, and whether it engages students in the kinds of cognitive activities that underlie competent performance. When a test has cognitive validity, students who score well on the test are those who demonstrate high-quality cognitive activities. Assessment tasks can draw upon many combinations of content knowledge and process skills that place varying demands on the test taker. Robert Glaser and Gail Baxter at the University of Michigan have developed a content-process space model that can be used to analyze the cognitive demands of existing assessments. In this model, the task demands for content knowledge are placed on a scale from rich (tasks that require an in-depth understanding of subject matter) to lean (tasks that depend on information given at the time of the test rather than on prior knowledge). The task demands for process skills are placed on a scale from open (students are given a minimum of explicit direction and must decide themselves how to solve the problem) to constrained (e.g., students are given step-by-step instructions).

OCR for page 1
How a task is classified on these dimensions reveals much about the nature and extent of the cognitive activity that underlies performance. For example, a content-rich, process-constrained task might ask students to describe the possible forms of energy and types of materials involved in growing a plant, and to explain how they are related. To succeed in this task, a student would need an understanding of photosynthesis and the ability to develop a coherent explanation, but would not need to plan a solution strategy. A content-rich, process-open activity might ask students to design and carry out experiments with a maple seed to explain its flight to a friend who has not studied physics. To complete this task, students would need knowledge of force and motion, the ability to design a controlled experiment, and skills in model-based reasoning. Test designers can adjust the content and process demands to align assessments with performance objectives. Glaser and Baxter are studying several trial state performance assessments and have conducted observations of students as they complete science tasks and solve problems in different ways and contexts. The research goal is to determine whether these tests actually measure the cognitive capabilities associated with various levels of student achievement, whether the tasks accurately reflect the performance objectives, and whether the scores reflect the quality of the cognition exhibited by test takers. The researchers have concluded that it is not easy to translate educational goals into test objectives, and then into assessment situations that maintain the integrity of those goals. Looking at the cognitive qualities of student competence, however, is a critical step in this process. The current push to apply findings from cognitive science research—about the structure of content domains and the development of students' competence in those domains—to the development of educational assessments is an important area that was not addressed in great depth at this conference. Questions to consider include: What is the connection between theories of domain knowledge and its long-term acquisition, on one hand, and performance assessment schemes, on the other? Designing and Using Innovative Assessments The conference explored several issues concerning the design and use of innovative assessments.

OCR for page 1
Technical Issues A major issue in designing alternative assessments is the need for empirical evidence of reliability, validity, and fairness. Several measurement experts at the conference felt that innovative science assessments are "not ready for prime time," as Bert Green noted. These participants expressed the view that, although these assessments might be useful for improving instruction, they do not have sufficient validity and reliability to be suitable for accountability decisions and other high-stakes uses in U.S. schools today. When these criteria are applied, certain types of performance assessment are closer to being ready for use in accountability than others. The TIMSS performance assessment tasks were developed according to technical criteria typically applied to large-scale standardized assessments. And concept maps also show promise for wider use. Scoring Issues The degree to which there is consistency in scoring is a primary concern related to the reliability of performance assessments. Most performance assessments must be scored by people (in contrast to assessments using multiple-choice items, which can be scored by machine). Raters of performance assessments are asked to make complex judgments, which opens the potential for variability across raters and bias in scoring. To achieve consistency, designers of performance assessments develop careful scoring criteria, provide extensive training to raters, usually use more than one rater (at least on a subset of student responses to check reliability), and develop procedures for cross-checking discrepancies. Based on cognitive research and field trials, the TIMSS designers developed an extensive scoring system for the performance items. This scoring system was based on descriptions of typical student responses that exemplified different levels of student performance, and on common types of misconceptions that students were likely to demonstrate. One TIMSS item, for example, asked students to determine the effect of temperature on the rate of solution of an Alka Seltzer tablet and to explain their hypotheses. A student who referred to the greater energy levels of hot water molecules, and thus the faster speed with which the tablet dissolves, would get a relatively high score, and an even higher one if she adequately explained how she planned her investigation. Because test designers could not foresee all possible student answers, the scoring system also allowed raters some latitude to make judgments about responses that did not fit any of the descriptions; these judgment scores were double-checked by groups of experts in every country.

OCR for page 1
plan more effective instruction. A long-term goal of the projects is to develop documentation methods that could be informative to parents, administrators, and oars outside the classroom and perhaps could be aggregated and summarized. In one project, kindergarten teachers wrote down key things that children said about their drawings of light and shadow. In another, teachers charted changes in children's answers to such questions as, "What are some things you've noticed lately about the caterpillars?" Through careful observation, teachers can identify what children already know, what misconceptions they have, and where they have gaps in understanding. Teachers reported being surprised at what students who are lagging academically actually know. Assessment as Professional Development These new forms of assessment depend heavily on implementation by teachers. Many teachers have not been trained to use assessment in the ways envisioned by science education reforms, and few have in-depth knowledge of evaluation issues. Margaret Brown found that more than half of the teachers studied emphasized summative assessment over formative assessment; indeed, many considered formative assessment an unnecessary addition to their workload. These teachers taught according to the prescribed curriculum and did not adjust their teaching based on assessment-related feedback. But another group of teachers—about one-third of those studied—did conduct regular formative assessment, even when it was not required. Some of these teachers relied on observation, note taking, and discussion, while others organized distinct assessments. All used the feedback they collected within a short time to plan classroom work for subsequent weeks. The top teachers in the study, as gauged by student gains, were focused on assessment all the time. Several projects discussed at the conference use assessment expressly as a vehicle for the professional development of teachers. Margaret Brown found that, with the right kind of professional development, it is possible to encourage teachers to use formative assessment more often and more effectively. Teachers are trained to assess at the beginning, middle, and end of a topic. Over time, they internalize the assessment criteria and integrate assessment more into their teaching. Teachers develop enthusiasm for this kind of assessment, as long as it is not being used for high-stakes purposes. Many of the new science instruction methods also require teachers to interact differ-

OCR for page 1
ently with students and manage a more open classroom. But many teachers do not have the experience to lead this kind of instruction, nor a clear vision of what it should look like. In addition, to teach to high standards, many teachers will need a better mastery of content knowledge. As Joan Baron of the Connecticut Department of Education noted, "If we want students to internalise standards, then teachers must be comfortable with them." Several projects presented at the conference are attempting to use assessment not only to build teachers' evaluation skills, but also to develop their pedagogical skills and deepen their understanding of the content they are teaching. Collegial Judgment of Student Work The move to embed assessment in classroom practice has increased teachers' needs for more formal and consistent strategies for judging student work and using those judgments to improve instruction. As explained by Baron, the Connecticut Department of Education has used Goals 2000 funds to encourage teachers to develop a process for holding frequent and sustained conversations about students' work. In one project, teachers looked at student work alone, and then discussed it in groups, identifying the main mathematical ideas they were trying to teach and the major areas where students were having problems. After reviewing promising curricula and scoring techniques, the teachers worked in pairs to develop activities to address areas of student difficulty. The teachers taught their units, making revisions as needed. As students revisited tasks that had given them trouble earlier, teachers tracked the work of three students at different achievement levels and periodically reviewed it in group meetings as a means of documenting changes in their teaching. In this project, university and state education department researchers studied the consequences of the participating teachers' conversations about student work. They concluded that teachers needed to be given a specific set of criteria to keep them focused on the students' understanding of mathematical ideas, rather than simply telling stories about what the students did. The researchers also found that soliciting external feedback from teachers in other districts was an effective motivator and kept conversations from becoming too ingrown. Furthermore, they concluded that combining teachers from different grade levels encouraged teachers to explore the vertical articulation of curriculum and helped them identify incorrect ideas that students were retaining over time. Jan Hawkins of the Center for Children and Technology described the Center's projects to help teachers learn to make reliable and useful judgments about student work. Teachers in one project judged three kinds of student performance tasks: a final product, an oral

OCR for page 1
explanation, and a process record (e.g., journal or log). Originally the project gave teachers a set of rubrics for judging student work, but the researchers realized a more effective approach was to provide teachers with a wide variety of student examples and allow them to arrive at their own rubrics through collective deliberation and experimentation. Over time, the teachers designed their own guidelines for making judgments, and they learned how to moderate scores collegially and cite evidence for their judgments. Hawkins has identified conditions that seem to affect consistency in evaluations by teachers. First, teachers' judgments are better, more consistent, and more reliable when teachers cite specific evidence for them. Second, the moderation process by which several teachers reach a uniform judgment—especially its public nature and social dimension—is very important and cannot be omitted. Third, the type of task being judged is a critical factor; some tasks are more amenable to collaborative judgment than others. Fourth, individuals who have more content expertise tend to be more consistent with each other, and can be helpful when they are paired with someone with less expertise. Issues in Classroom-Based Assessment Discussion at the conference raised some critical issues related to using classroom formative assessments for improving instruction. Effectiveness in Changing Instruction To what extent are these assessment approaches actually making a difference in the classroom ? According to conference presenters, teacher-based assessment projects help teachers become more systematic in collecting their observations and more reliable in judging student work. By seeing and discussing exemplary work, they begin to internalize standards and adjust instruction based on their assessments. When teachers achieve these goals, students benefit because they receive more regular and useful feedback, as well as models for self-assessment. Jan Hawkins said that her project on collegial judgment did improve teaching, but slowly. Rather than adopting the rubrics that came out of the project, participating teachers used their new knowledge to redesign their classroom activities and modify their judgments about student projects. It was critical to involve students in this process, too, so they understood the kinds of judgments being brought to their work.

OCR for page 1
Some participants felt that the effects of formative assessments in changing classroom practice tend to be overstated. Others believed that new assessment methods would be more likely to drive change if the results had consequences and if they could be incorporated into accountability systems, as discussed below. Scaffolding for Teachers The projects discussed in this section suggest that teachers need appropriate scaffolding, too. This means providing teachers with a great deal of support and guidance during the early stages, as they are learning to use new forms of assessment, followed by diminishing support as teachers gain competence in the new techniques. Several projects achieved this by providing teachers with some type of framework or information for analyzing assessment data. Having a framework encourages teachers to work systematically rather than intuitively, and it also raises teacher professionalism by exposing them to new techniques. But the framework should not be too prescriptive; as Hawkin's experience indicates, teachers learned a great deal by working through the process themselves. Other Teacher Supports Participants noted that teachers who are implementing new approaches to assessment and instruction need other forms of support and capacity building to help their efforts succeed. First, teachers need time in the school day for collegial discussions and other activities related to formative assessment. Second, a school should have a learning culture that encourages teachers to work together and supports new modes of student-teacher interactions. Third, assessment alone is not enough to help teachers gain the knowledge and skills necessary to teach to high standards. There is a critical need for professional development in content and pedagogy at all stages of a teacher's preparation and career. This is a major undertaking that will require political support and significant resources. Assessments for Accountability and Systemic Reform Standards-based reforms place increasing demands on assessment. Policymakers and educators expect assessments to signal—and if possible clarify—what is important for teachers to teach and students to learn. Reformers expect assessments both to drive changes in curricula and instruction and to measure the effects of those changes. Assess-

OCR for page 1
ment is further intended to form the backbone of new state and local accountability systems. Conference participants discussed whether and how these expectations could be met with various forms of assessment. Aligning Assessments with Standards A primary motive for revising assessment is to improve the alignment between what schools and teachers are held accountable for and what students are supposed to learn. Developing assessments aligned with standards is a daunting task. The ambiguity, vague­ness, and wide variation in standards make it difficult to construct compatible assess­ments. As noted by Eva Baker of the University of California, Los Angeles, the standards being adopted by states tend to describe broad areas of content emphasis and general cognitive demands, so that an array of performance tasks can be inferred from any given standard. Most state standards are not in a form that provides sufficient focus or substance to make the connections from standards to assessment to everyday instruc­tion, yet such connections are needed to provide a solid basis for evaluating the effects of standards-based reforms. The diffusion of education authority and political power in the United States, among 50 states and thousands of school districts, makes it still more difficult to align standards and assessments and negotiate systemic reform, as pointed out by William Clune of the University of Wisconsin. This diffusion contributes to the dominance of standardized multiple-choice tests, because they are a generic instrument that states and localities can use to test the wide range of curricula. Audrey Champagne of the State University of New York, Albany, described the chal­lenge of writing a practically oriented performance task to assess a state standard for eighth graders. This particular standard called on students to generate mathematical models of actual situations and apply them to other situations. Champagne's group devised a task that asked students to calculate the minimum amount of a solvent necessary to clean a paint brush, then write a memo applying their findings to other industrial situations. The problem could be solved using sixth-grade arithmetic, but it also required students to integrate knowledge from several disciplines, extract key information from a written passage, organize a fairly complex task, and communicate how they reached their solutions. A review committee agreed that the task corresponded well to the standard. But when Champagne and other researchers piloted the task with a group of college students,

OCR for page 1
they discovered that many students did not really understand its point. Several com­plained about the amount of reading involved, and many did not know the basic facts needed to solve it, such as the number of ounces in a quart. Some students used creative responses to avoid addressing the mathematics. The reactions of the college students show the challenges involved in designing practical tasks with literacy demands that test math and science, and the complexities of scoring diverse student responses. It also gives a sense of the job teachers face in preparing students to accomplish this kind of perfor­mance task. Designing Assessments for Accountability In aligning assessments with standards, many states and local districts hope to produce assessments that can be used for various external and accountability purposes, such as reporting progress to the public or attaching consequences to the performance of classrooms, schools, and perhaps individuals. Tests for accountability must meet relatively high technical standards and be feasible to administer on a large scale. They must also meet practical concerns, such as being understandable and credible to parents, policymakers, and the public. And they should encourage students to learn things that are important and useful. Tension between the uses of assessments for accountability and for instructional improvement was a major topic of discussion during the conference. Experience of Other Countries Paul Black examined how other countries try to resolve the tension between account­ability and instructional improvement. The resolutions these countries reach often depend on how much their society trusts teachers' assessment of their own pupils. In countries in which teachers have high social status, assessments are more diverse, and teachers play a central role in all functions of assessment. Countries in which teachers are less trusted tend to place more emphasis on external control and high-stakes testing, with less lati­tude for teachers to make their own decisions. Germany and, to a lesser extent, Sweden exemplify the first view; England and some aspects of the system in France illustrate the second. Countries differ in their use of teacher-conducted assessments for accountability pur­poses. Some countries use teacher assessment as the sole source of accountability. In some of these, there might be calibration of teachers' results at the national level, or some external check on them. Other countries combine teacher-conducted assessment with external tests to produce a single result, or use both kinds of assessments but not in a

OCR for page 1
combined way. Finally, in other countries, including the United States, teacher-conducted assessments are not generally used for accountability, except when they help pupils learn in preparation for external tests. Not all assessments for accountability in other countries have high reliability. In England, for example, the reliability of some national tests is not very high. Only in the United States are assessments considered unsuitable for accountability if they are not as reliable as multiple-choice tests. According to Black, American policymakers and the public tend to undervalue what teachers do and overvalue external standardized tests. Other factors in the United States, such as college admission policies, also drive accountability assessments toward traditional test formats and away from certain kinds of alternative assessments. Classroom Assessments and Accountability Testing Where should U.S. policy be headed? Some conference participants expressed caution about the use of performance and classroom-based assessments for accountability and external purposes, noting that most did not have the requisite features or technical properties. For example, some performance assessments do not yield a score that can be aggregated. Some participants made the point that formative assessments are designed primarily to assess what is taught in the classroom and inform revisions in teaching, and are less concerned with testing a child's overall knowledge and skills. For purposes of certification, college admission, and other uses, the object of assessment is to find out what students actually know, regardless of what they were taught, and to measure their knowledge rather than the quality of instruction. Other participants felt that collaboration between psychometricians and educators could lead to assessment methods that would legitimately serve both purposes and give teachers' contributions a greater role in accountability. As Clune observed, teachers already gather considerable performance information, and teacher-conducted assessment constitutes a large, ongoing testing system. Yet there is not much empirical information about the relationship of teacher-conducted tests to external exams. Nor is there well-refined information about the strengths and weaknesses of various types of assessments that would enable reformers to make intelligent choices. Instead the debate deteriorates to one of "good" versus "bad" tests, or "old fashioned" versus "new" tests. It would also be useful, Clune observed, to have more empirical evidence about the conditions under which high-stakes testing produces negative educational consequences—such as

OCR for page 1
teachers' focusing instruction only on topics that are on the test, or districts' excluding certain children from testing—and whether performance assessment can help to reduce some of these problems. Several participants suggested that performance-based assessments could be combined with multiple-choice items and other formats to produce a balanced assessment program. Various kinds of assessments could be given throughout the year, rather than one time only, to achieve a better balance. Issues in Assessments for Accountability Developing new approaches to assessments for accountability requires attention to practical and political implications, as well as psychometric ones. As Ed Reidy of the Kentucky Department of Education noted, Americans often try to turn value-laden issues into technical issues to be resolved by technical experts, instead of debating them in a public forum with advice from the research, education, and policymaking communities. Defining Performance Thresholds A critical unresolved issue in designing assessments for accountability is: What level of performance is adequate? The passing scores have to be set high enough to stimulate students and teachers to improve, but not so high as to discourage students who initially fail to reach them from pursuing science and math courses. Nor do we know whether or how various performance increments are linked to meaningful social outcomes, such as success in the labor market. Curriculum and Accountability Another unresolved issue is how to allow for differences in a curriculum, such as a vocational education curriculum, within a common accountability system, and whether we should even have a common system. Many science and math education reformers emphasize advanced academic courses for all students, although an applied math and science curriculum might work better for some students. Can a limited core of knowledge and skills be defined that would allow a student to have some degree of deep understanding and scientific literacy without covering a large number of topics or advancing very far into formal training?

OCR for page 1
Student Motivation Conference participants held diverse views on the relevance of student motivation to various kinds of assessments. In general, students are more motivated to do well on a test when it "counts," but what does that really mean? Consequences that seem to matter to students are promotion, graduation, and college acceptance. Tests also have a major impact when they are part of a student's grade. But are there other motivators, such as participation in team projects? Students might be more motivated to learn when they have some personal responsibility for their own assessment, which is what several alternative assessments are trying to generate. President of the National Academy of Sciences, Bruce Alberts, gave an example from a medical school that introduced essay questions into major examinations that had previously been multiple-choice. The school found that students made a greater effort to learn for understanding, instead of for short-term retention. Many current approaches to standards-based reform place the primary locus of accountability on the school or classroom. In these systems, it is the teacher or principal who is primarily held responsible for student performance; the motivating consequences are not as directly tied to the student. Diverse views were expressed at the conference about whether or not accountability systems should also include more meaningful consequences for students in order to be effective and have a real impact. Public Engagement Large-scale assessments, especially those used for accountability, must be credible, meaningful, and understandable to parents, policymakers, and the public. In many states and school districts, debates over assessment have been subject to increased scrutiny and involvement of policymakers and the public. Although an open process is important, it sometimes has made these debates more complex and politically charged, allowing researchers and assessment designers "few protected enclaves in which to try and fail," as Eva Baker noted. Multiple-choice tests are familiar and comfortable to the public. Educators report being frustrated in their efforts to improve science instruction and assessment by parents who might appreciate their children's having new learning opportunities, but still prefer them to be prepared for traditional tests. As Bruce Alberts observed, if forced to choose between their child's learning something in high school and getting into a good college, some parents would prefer them to get into a good college and learn later. Other

OCR for page 1
participants felt that what parents really care about is what their children know and can do, not which instructional method gets them there. The public wants some fairly simple things, said Baker: tests that provide a quick summary of individual student performance and progress, and that provide a basis for comparisons among students and programs. The experts' admonitions that not all tests can be used for all purposes does not make sense to many policymakers and parents. To them, a good test serves multiple purposes, and if such a test does not exist, it should be invented. But this desire may be unrealistic; some purposes of assessment might not be reconcilable in a single instrument. Educators and test designers have not developed a strategy for educating the public about assessment issues. Convincing parents, policymakers, and the public that unfamiliar forms of tests are important and useful to a student's favor will require a major public engagement effort. Educators must be prepared to explain to the public what student scores on performance assessments mean. For instance, performance assessment results are typically reported in relation to public goals or standards, rather than in terms of the normative comparisons that the public is more accustomed to, so new reporting systems need to be designed that make sense to the typical parent. Support for standards-based reforms could well disappear if new assessments do not yield achievement information that parents and the public can understand and find useful. Timetable for Change Education is under pressure to show positive results soon after reforms are agreed to—often before they are really in place. Right now the education and research communities are spreading their energies across an astonishing number and variety of assessment-related projects. With so many reform efforts going on at once, it will be difficult to produce adequate empirical data in a timely way, and it may be impossible to convey to the public any sense of progress. Educators and test designers do not have to immediately reinvent every component of an assessment and accountability system. The deadlines for action suggest a need for models that combine existing and new assessments. Reformers can begin by using assessments that are available now, and phase in changes by grade or subject area. Such a phased-in approach would demonstrate the cost and effectiveness of a new system. Reform could focus on accomplishing a manageable number of changes, rather than diffusing efforts across too many priorities.

OCR for page 1
Conclusion The exciting developments in the field of science assessment represent a golden opportunity to better integrate assessment with curriculum and to improve teaching and learning. Paul Black echoed the view of many conference participants when he expressed a desire for stronger bridges among instructional innovators, psychometricians, and classroom teachers. Instructional innovators could contribute a vision for integrating assessment and instruction with some high-quality ideas for how to get there. Psychometricians could seek ways to make classroom-oriented assessments more compatible with accountability needs. Teachers could play a stronger role in assessment development to ensure that new approaches are credible and feasible. All three groups must communicate with the larger polity of parents, policymakers, and the public to build support for changes that will have a lasting positive impact on classroom practice.