Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 59
Classroom Assessment and the National Science Education Standards 4 The Relationship Between Formative and Summative Assessment—In the Classroom and Beyond This chapter discusses the relationships between formative and summative assessments—both in the classroom and externally. In addition to teachers, site-and district-level administrators and decision makers are target audiences. External test developers also may be interested. Teachers inevitably are responsible for assessment that requires them to report on student progress to people outside their own classrooms. In addition to informing and supporting instruction, assessments communicate information to people at multiple levels within the school system, serve numerous accountability purposes, and provide data for placement decisions. As they juggle these varied purposes, teachers take on different roles. As coach and facilitator, the teacher uses formative assessment to help support and enhance student learning. As judge and jury, the teacher makes summative judgments about a student's achievement at a specific point in time for purposes of placement, grading, accountability, and informing parents and future teachers about student performance. Often in our current system, all of the purposes and elements of assessment are not mutually supportive, and can even be in conflict. What seems effective for one purpose may not serve, or even be compatible with, another. Review Table 2-1 in Chapter 2. The previous chapters have focused primarily on the ongoing formative assessment teachers and students engage in on a daily basis to enhance student learning. This chapter briefly examines summative assessment that is usually prescribed by a local, district, or state agency, as it occurs regularly in the classroom and as it occurs in large-scale testing. The chapter specifically looks at the relationship between formative and
OCR for page 60
Classroom Assessment and the National Science Education Standards summative assessment and considers how inherent tensions between the different purposes of assessment may be mitigated. HOW CAN SUMMATIVE ASSESSMENT SERVE THE STANDARDS? The range of understanding and skill called for in the Standards acknowledges the complexity of what it means to know, to understand, and to be able to do in science. Science is not solely a collection of facts, nor is it primarily a package of procedural skills. Content understanding includes making connections among various concepts with which scientists work, then using that information in specific context. Scientific problem-solving skills and procedural knowledge require working with ideas, data, and equipment in an environment conducive to investigation and experimentation. Inquiry, a central component of the Standards, involves asking questions, planning, designing and conducting experiments, analyzing and interpreting data, and drawing conclusions. If the Standards are to be realized, summative as well as formative assessment must change to encompass these goals. Assessment for a summative purpose (for example, grading, placement, and accountability) should provide students with the opportunity to demonstrate conceptual understanding of the important ideas of science, to use scientific tools and processes, to apply their understanding of these important ideas to solve new problems, and to draw on what they have learned to explain new phenomena, think critically, and make informed decisions (NRC, 1996). The various dimensions of knowing in science will require equally varied assessment strategies, as different types of assessments capture different aspects of learning and achievement (Baxter & Glaser, 1998; Baxter & Shavelson, 1994; Herman, Gearhart, & Baker, 1993; Ruiz-Primo & Shavelson, 1996; Shavelson, Baxter, & Pine, 1991; Shavelson & Ruiz-Primo, 1999). FORMS OF SUMMATIVE ASSESSMENT IN THE CLASSROOM As teachers fulfill their different roles as assessors, tensions between formative and summative purposes of assessment can be significant (Bol and Strange, 1996). However, teachers often are in the position of being able to tailor assessments for both summative and formative purposes. Performance Assessments Any activity undertaken by a student provides an opportunity for an assessment of the student's performance. Performance assessment often implies a more formal assessment of a student as he or she engages in a performance-
OCR for page 61
Classroom Assessment and the National Science Education Standards based activity or task. Students are often provided with apparatus and are expected to design and conduct an investigation and communicate findings during a specified period of time. For example, students may be given the appropriate material and asked to investigate the preferences of sow bugs for light and dark, and dry or damp environments (Shavelson, Baxter, & Pine, 1991). Or, a teacher could observe while students design and conduct water-quality tests on a given sample of water to determine what variables the students measure, and what those variables indicate to them, and how they explain variable interaction. Observations can be complemented by assessing the resultant products, including data sheets, graphs, and analysis. In some cases, computer simulations can replace actual materials and journals in which students include results, interpretations, and conclusions can serve as proxies for observers (Shavelson, Baxter, & Pine, 1991). By their nature, these types of assessments differ in a variety of ways from the conventional types of assessments. For one, they provide students with opportunities to demonstrate different aspects of scientific knowledge (Baxter & Shavelson, 1994; Baxter, Elder, & Glaser, 1996; Ruiz-Primo & Shavelson, 1996). In the sow bug investigation, for example, students have the opportunity to demonstrate their ability to design and conduct an experiment (Baxter & Shavelson, 1994). The investigation of water quality highlights procedural knowledge as well as the content knowledge necessary to interpret tests, recognize and explain relationships, and provide analysis. Because of the numerous opportunities to observe students at work and examine their products, performance assessments can be closely aligned with curriculum and pedagogy. Portfolios Duschl and Gitomer (1997) have conducted classroom-based research on portfolios as an assessment tool to document progress and achievement and to contribute to a supportive learning environment. They found that many aspects of the portfolio and the portfolio process provided assessment opportunities that contributed to improved work through feedback, conversations about content and quality, and other assessment-relevant discussions. The collection also can serve to demonstrate progress and inform and support summative evaluations. The researchers document the challenges as well as the successes of building a learning environment around portfolio assessment. They suggest that the relationship between assessment and instruction requires reexamination so that information gathered from student discussions can be used for instructional purposes. For
OCR for page 62
Classroom Assessment and the National Science Education Standards this purpose, a teacher's conception and depth of subject-matter knowledge need to be developed and cultivated so that assessment criteria derive from what is considered important in the scientific field that is being studied, rather than from poorly connected pieces of discrete information. Researchers at Harvard's Graduate School of Education (Seidel, Walters, Kirby, Olff, Powell, Scripp, & Veenema, 1997) suggest that the following elements be included in any portfolio system: collection of student work that demonstrates what students have learned and understand; an extended time frame to allow progress and effort to be captured; structure or organizing principles to help organize as well as interpret and analyze; and student involvement in not only the selection of the materials but also in the reflection and assessment. An example for the contents for a portfolio of a science project could be as follows: the brainstorming notes that lead to the project concept; the work plan that the student followed as a result of a time line; the student log that records successes and difficulties; review of actual research results; photograph of finished project; and student reflection on the overall project (p. 32). Using Traditional Tests Differently Certain kinds of traditional assessments that are used for summative purposes contain useful information for teachers and students, but these assessments are usually too infrequent, come too late for action, and are too coarse-grained. Some of the activities in these summative assessments provide questions and procedures that might, in a different context, be useful for formative purposes. For example, rescheduling summative assessments can contribute to their usefulness to teachers and students for formative purposes. Tests that are given before the end of a unit can provide both teacher and student with useful information on which to act while there is still opportunity to revisit areas where students were not able to perform well. Opportunities for revisions on tests or any other type of assessment give students another chance to work through, think about, and come to understand an area they did not fully understand or clearly articulate the previous time. In reviewing for a test, or preparing for essay questions, students can begin to make connections between aspects of subject matter that they may not have related previously to one another. Sharing designs before an experiment gets under way during a peer-assessment session gives each student a chance to comment on and to improve his or her own investigation as well as
OCR for page 63
Classroom Assessment and the National Science Education Standards those of their classmates. When performed as a whole class, reviewing helps make explicit to all students the key concepts to be covered. Selected response and written assessments, homework, and classwork all serve as valuable assessment activities as part of a teacher 's repertoire if used appropriately. The form that the assessment takes should coincide with careful consideration of the intended purpose. Again, the use of the data generated by and through the assessment is important so that it feeds back into the teaching and learning. As shown in Table 4-1, McTighe and Ferrara (1998) provide a useful framework for selecting assessment approaches and methods. The table accents the range of common assessments available to teachers. Although their framework serves all subject-matter areas, the wide variety of assessments and assessment-rich activities could be applicable for assessments in a science classroom. TABLE 4-1 Framework of Assessment Approaches and Methods HOW MIGHT WE ASSESS STUDENT LEARNING IN THE CLASSROOM? Selected-Response Format Constructed-Response Format Multiple-choice True-false Matching Enhanced multiple choice Brief Constructed Response Performance-Based Assessment Fill in the blank Word(s) Phrase(s) Product Performance Process-Focused Assessment Short answer Sentence(s) Paragraphs Label a diagram “Show your work” Visual representation Essay Research paper Story/play Poem Portfolio Art exhibit Science project Model Video/audiotape Spreadsheet Lab report Oral presentation Dance/movement Science lab demonstration Athletic skill performance Dramatic reading Enactment Debate Musical recital Keyboarding Teach-a-lesson Oral questioning Observation (“kid watching”) Interview Conference Process description “Think aloud” Learning log SOURCE: McTighe and Ferrara (1998).
OCR for page 64
Classroom Assessment and the National Science Education Standards GRADING AND COMMUNICATING ACHIEVEMENT One common summative purpose of assessment facing most teachers is the need to communicate information on student progress and achievement to parents, school board officials, members of the community, college admissions officers. In addition to scores from externally mandated tests, teacher-assigned grades traditionally serve this purpose. A discussion in Chapter 2 defends the use of descriptive, criterion-based feedback as opposed to numerical scoring (8/10) or grades (B). A study cited (Butler, 1987) showed that the students who demonstrated the greatest improvement were the ones who received detailed comments (only) on their returned pieces of work. However, grading and similar practices are the reality for the majority of teachers. How might grading be used to best support student learning? Though they are the primary currency of our current summative-assessment system, grades typically carry little meaning because they reduce a great deal of information to a single letter. Furthermore, there is often little agreement between the difference between an A and a B, a B and a C, a D and an F or what is required for a particular letter grade (Loyd & Loyd, 1997). Grades may symbolize achievement, yet they often incorporate other factors as well, such as work habits, which may or may not be related to level of achievement. They are often used to reward or motivate students to display certain behaviors (Loyd & Loyd, 1997). Without a clear understanding of the basis for the grade, a single letter often will provide little information on how work can be improved. As noted previously, grades will only be as meaningful as the underlying criteria and the quality of assessment that produced them. A single-letter grade or the score on an end-of-unit test does not make student progress explicit, nor does either provide students and teachers with information that might further their understandings or inform their learning. A “C” on a project or on a report card indicates that a student did not do exemplary work, but beyond that, there is plenty of room for interpretation and ambiguity. Did the student show thorough content understanding but fall short in presentation? Did the student not convey clear ideas? Or did the student not provide adequate explanation of why a particular phenomenon occurred? Without any information about these other dimensions, a single-letter grade does not provide specific guidance about how work can be improved.
OCR for page 65
Classroom Assessment and the National Science Education Standards Surrounded by ambiguity, a letter grade without discussion and an understanding of what it constitutes does little to provide useful information to the student, or even give an indication of the level of performance. Thus, when a teacher establishes criteria for individual assessments and makes them explicit to students, they also need to do so for grading criteria. The criteria also should be clear to those who face interpreting them, such as parents and future teachers, and incorporate priorities and goals important to science as a school subject area. Careful documentation can allow formative assessments to be used for summative purposes. The manner in which summative assessments are reported helps determine whether they can be easily translated for formative purposes—especially by the student, teacher, and parents. In the vignette in Chapter 3, a middle school science teacher confers with students as they engage in an ongoing investigation. She keeps written notes of these exchanges as well as from the observations she makes of the students at work. When it is time for this teacher to assign student grades for the project, she can refer to these notes to provide concrete examples as evidence. Using ongoing assessments to inform summative evaluations is particularly important for inquirybased work, which cannot be captured in most one-time tests. Many teachers give students the opportunity to make test corrections or provide other means for students to demonstrate that they understand material previously not mastered. Documenting these types of changes over time will show progress and can be used as evidence of understanding for summative purposes. Teachers face the challenge of overcoming the common obstacle of assigning classroom grades and points in such a way that they drive classroom activity to the detriment of other, often more informative and useful, types of assessment that foster standards-based goals. Grading practices can be modified, however, so that they adhere to acceptable standards for summative assessments and at the same time convey important information that can be used to improve work in a way that is relatively easy to read and understand. Mark Wilson and colleagues at the University of California, Berkeley, have devised one such plan for the assessment system designed for the SEPUP (Science Education for Public Understanding Program) middle school science curriculum (Wilson & Sloane, 1999; Roberts, Wilson, & Draney, 1997; Wilson & Draney, 1997). The SEPUP assessment system serves as an example of possible alternatives to the traditional, current single-letter grade scheme. As shown in Table 4-2, the SEPUP assessment blueprint indicates that a single assessment will not capture all of the skills and content desired in any particular curricular unit. However, teachers do not need to be concerned about getting all the assessment information they need at a single time with any single assessment.
OCR for page 66
Classroom Assessment and the National Science Education Standards TABLE 4-2 SEPUP Assessment Blueprint Teacher's Guide Part 1: Water Usage and Safety Designing and Conducting Investigations Designing investigation Selecting and Recording Procedures Organizing Data Analyzing and Interpreting Data Evidence and Tradeoffs Using Evidence Using Evidence to Make Tradeoffs 1 Drinking-Water Quality 2 Exploring Sensory Thresholds 3 Concentration 4 Mapping Death 5 John Snow A: Using Evidence (p. 52) 6 Contaminated Water √: Designing Investigation (p. 61) 7 Chlorination A: All elements (p. 66) 8 Chicken Little, Chicken Big 9 Lethal Toxicity √: Organizing Data (p. 94) 10 Risk Comparison √: Analyzing and Interpreting Data (p. 109) 11 Injection Problem √: Both elements (p. 120) 12 Peru Story A: Organizing Data and Analyzing and Interpreting Data (p. 130) A: Both elements (p. 132) SOURCE: Science Education for Public Understanding Program (1995).
OCR for page 67
Classroom Assessment and the National Science Education Standards Sections A and B Understanding Concepts Recognizing Relevant Content Applying Relevant Content Communicating Scientific Information Organization Technical Aspects Group Interaction Time Management Role Performance/Participation Shared Opportunity 1 2 √: Both elements (p. 16) Measurement and Scale★ 3 √: Applying Relevant Content (p. 28) Measurement and Scale★ 4 √: Time Management; Shared Opportunity (p. 38) 5 A: Both elements (p.52) 6 7 8 √: Shared Opportunity (p. 76) 9 A: Applying Relevant Content (p. 97) Measurement and Scale★ 10 √: Applying Relevant Content (p. 111) Measurement and Scale★ 11 12 A: Both elements (p. 132) ★Indicates content concepts assessed SOURCE: Science Education for Public Understanding Program (1995).
OCR for page 68
Classroom Assessment and the National Science Education Standards By using the same scale for the entire unit, the SEPUP assessment system allows teachers to obtain evidence about the students' progress. Without the context or criteria that the SEPUP scoring guide (Table 4-3) provides, a score of “2” on an assessment, could be interpreted as inadequate, even if the scale is 0-4. However, as the scoring guide indicates, in this example, a “2” represents a worthwhile step on the road to earning a score of “4”. In practice, the specific areas that need additional attention are conveyed in the scoring guide, thus a student could receive a “2” as feedback and know what they need to do to improve the piece of work. The scoring guide also can provide summative assessments at any given point. TABLE 4-3 SEPUP Scoring Guide Scoring Guide: Evidence and Tradeoffs (ET) Variable Score Using Evidence Response uses objective reason(s) based on relevant evidence to argue for or against a choice. Using Evidence to Make Tradeoffs Response recognizes multiple perspectives of issue and explains each perspective using objective reasons, supported by evidence, in order to make a choice. 4 Response accomplishes level 3, AND goes beyond in some significant way, e.g. questioning or justifying the source, validity, and/or quantity of the evidence. Accomplishes Level 3 AND goes beyond in some significant way, e.g., suggesting additional evidence beyond the activity that would influence choices in specific ways, OR questioning the source, validity, and/or quantity of the evidence and explaining how it influences choice. 3 Provides major objective reasons AND supports each with relevant and accurate evidence. Uses relevant and accurate evidence to weigh the advantages and disadvantages of multiple option, and makes a choice supported by the evidence. 2 Provides some objective reasons AND some supporting evidence, BUT at least one reason is missing and/or part of the evidence is incomplete. States at least two options AND provides some objective reasons using some relevant evidence BUT reasons or choices are incomplete and/or part of the evidence is missing; OR only one complete and accurate perspective has been provided. 1 Provides only subjective reasons (opinions) for choice; uses unsupported statements; OR uses inaccurate or irrelevant evidence from the activity. States at least one perspective BUT only provides subjective reasons and/or uses inaccurate or irrelevant evidence. 0 Missing; illegible, or offers no reasons AND no evidence to support choice made. Missing, illegible, or completely lacks reasons and evidence. X Student had no opportunity to respond. SOURCE: Science Education for Public Understanding Program (1995).
OCR for page 69
Classroom Assessment and the National Science Education Standards The SEPUP assessment system provides one such example, but teachers can employ other forms of assessment that capture progress as well as achievement at a specific point in time. Keyed to standards and goals, such systems can be strong on meaning for teachers and students and still convey information to different levels of the system in a relatively straightforward and plausible manner that is readily understood. Teachers can use the standards or goals to help guide their own classroom assessments and observations and also to help them support work or learning in a particular area where sufficient achievement has not been met. Devising a criterion-based scale to record progress and make summative judgments poses difficulties of its own. The levels of specificity involved in subdividing a domain to assure that the separate elements together represent the whole is a crucial and demanding task (Wiliam, 1996). This becomes an issue whether considering performance assessments or ongoing assessment data and needs to be articulated in advance of when students engage in activities (Quellmalz, 1991; Gipps, 1994). Specific guidelines for the construction and selection of test items are not offered in this document. Test design and selection are certainly important aspects of a teacher's assessment responsibility and can be informed by the guidelines and discussions presented in this document (see also Chapter 3). Item-writing recommendations and other test specifications are topics of a substantial body of existing literature (for practitioner-relevant discussions, see Airasian, 1991; Cangelosi, 1990; Cunningham, 1997; Doran, Chan, and Tamir, 1998; Gallagher, 1998; Gronlund, 1998; Stiggins, 2001). Appropriate design, selection, interpretation and use of tests and assessment data were emphasized in the joint effort of the American Federation of Teachers (AFT), the National Council on Measurement in Education (NCME), and the National Education Association (NEA) to specify pedagogical skills necessary for effective assessment (AFT, NCME, & NEA, 1990). VALIDITY AND RELIABILITY IN SUMMATIVE ASSESSMENTS Regardless of what form a summative assessment takes or when it occurs, teachers need to keep in mind validity and reliability, two important technical elements of both classroomlevel assessments and external or large-scale assessments (AERA, APA, & NCME, 1999). These concepts also are discussed in Chapter 3. Validity and reliability are judged using different criteria, although the two are related. Validity has different
OCR for page 70
Classroom Assessment and the National Science Education Standards dimensions, including content (does the assessment measure the intended content area?), construct (does the assessment measure the intended construct or ability?) and instructional (was the material on the assessment taught?). It is important to consider the uses of assessment and the appropriateness of resulting inferences and actions as well (Messick, 1989). Reliability has to do with generalizing across tasks (is this a generalizable measure of student performance?) and can involve variability in performance across tasks, between settings, as well as in the consistency of scoring or grading. What these terms mean operationally varies slightly for the kinds of assessments that occur each day in the classroom and in the form of externally designed exams. For example, the ongoing classroom assessment that relies on immediate feedback provides different types of opportunities for follow-up when compared to a typical testing situation where follow-up questioning for clarification or to ensure proper interpretation on the part of the respondent usually is not possible (Wiliam & Black, 1996). The dynamic nature of day-to-day teaching affords teachers with opportunities to make numerous assessments, take relevant action, and to amend decisions and evaluations if necessary and with time. Wiliam and Black (1996) write, “the fluid action of the classroom, where rapid feedback is important, optimum validity depends upon the self-correcting nature of the consequent action ” (pp. 539-540). With a single-test score, especially from a test administered at the end of the school year, a teacher does not have the opportunity to follow a response with another question, either to determine if the previous question had been misinterpreted or to probe misunderstandings for diagnostic reasons. With a standardized test, where on-the-spot interpretation of the student's response by the teacher and follow-up action is impossible, the context in which responses are developed is ignored. Measures of validity are decontextualized, depending almost entirely on the collection and nature of the actual test items. More important, all users of assessment data (teachers, administrators and policy makers) need to be aware of what claims they make about a student's understanding and the consequential action based on any one assessment. Relying on a variety of assessments, in both form and what is being assessed, will go a long way to ensuring validity. Much of what is called for in the standards, such as inquiry, cannot be assessed in many of the multiplechoice, short-answer, or even two-hour performance assessments that are currently employed. Reliability, though more straightforward, may be more difficult to ensure than validity. On external tests, even when scorers
OCR for page 71
Classroom Assessment and the National Science Education Standards are carefully calibrated (or done by a machine), variations in a student's performance from day to day, or from question to question, poses threats to reliability. Viable systems that command the same confidence as the current summative system but are free of many of the inherent conflicts and contradictions are necessary to make decisions psychometrically sound. The confidence that any assessment can demand will depend, in large part, on both reliability and validity (Baron, 1991; Black, 1997). As Box 4-1 indicates, there are some basic questions to be asked of both teacher-made and published assessments. Teachers need to consider the technical aspect of the summative assessments they use in the classroom. They also should look for evidence that disproves earlier judgments and make necessary accommodations. Likewise, they should be looking for further assessment data that could help them to support their students ' learning. LARGE-SCALE, EXTERNAL ASSESSMENT—THE CURRENT SYSTEM AND NEED FOR REFORM Large-scale assessments at the district, state and national levels are conducted for different purposes: to formulate policy, monitor the effects of policies and enforce them, make BOX 4-1 Applying Validity and Reliability Concerns to Classroom Teaching What am I interested in measuring? Does this assessment capture that? Have the students experienced this material as part of their curriculum? What can I say about a student's understandings based on the information generated from the assessment? Are those claims legitimate? Are the consequences and actions that result from this performance justifiable? Am I making assumptions or inferences about other knowledge, skills or abilities that this assessment did not directly assess? Are there aspects of this assessment not relevant to what I am interested in assessing that may be influencing performance? Have I graded consistently? What could be unintended consequences associated with this assessment? comparisons, monitor progress towards goals, evaluate programs, and for accountability purposes (NRC, 1996). As a key element in the success of education-improvement systems, accountability has become one of the most important issues in educational policy today (NRC, 1999b). Accountability is a means by which policy makers at the state and district levels—and parents and taxpayers—monitor the performance of students and schools. Most states use external assessments for accountability purposes (Bernauer & Cress, 1997). These
OCR for page 72
Classroom Assessment and the National Science Education Standards standardized, externally designed tests are either norm-referenced tests (NRTs), criterion-referenced tests (CRTs), or some combination of the two. A “standardized” test is one that is to be carried out in the same way for all individuals tested, scored in the same way, and scores interpreted in the same way (Gipps, 1994). NRTs are developed by test publishers to measure student performance against the norm. Results from these tests describe what students can do relative to other students and are used for comparing groups of students. The norm is a rank, the 50th percentile. For national tests, the norm is constructed by testing students all over the country. (It also is the score that test-makers call “at grade level” [Bracey, 1998]). On a norm-referenced test, half of all students in the norm sample will score at or above the 50th percentile, or above grade level, and half will score below the 50th percentile, or below grade level. These tests compare students to other students, rather than measuring student mastery of content standards or curricular objectives (Burger, 1998). Increasingly, states and districts are moving towards criterion-referenced tests (CRTs), usually developed by state departments of education and districts, which compare student performance to a set of established criteria (for example, district, state or national standards) rather than comparing them to the performance of other students. CRT's allow all students who have acquired skills and knowledge to receive high scores (Burger, 1998). A well-designed and appropriately used standardized test can generate data that can be used to inform different parts of the system and to assess a range of understandings and skills. Currently, they generally concentrate on the knowledge most amenable to scoring in multiple-choice and short-answer formats. These formats most easily capture factual knowledge (Shavelson & Ruiz-Primo, 1999) and are the most inexpensive in terms of resources necessary for test development, administration, and scoring (Hardy, 1995). Although many of the current standardized tests are intended to assess student achievement, too often they are used only to stimulate competition among students, teachers or schools, or to make other judgments that are not justified by student scores on such tests. The lack of coherence among the different levels of assessment within the system, often leaves teachers, schools and districts torn between mandated external testing policies and practices, and the responsibilities of teachers to use assessment in the service of learning. These large-scale tests, which often command greater esteem than classroom assessments, create a tension for formative and summative assessment and a challenge for exemplary classroom
OCR for page 73
Classroom Assessment and the National Science Education Standards practice (Black, 1997; Frederiksen, 1984; Smith & Rottenberg, 1991). Teachers are left facing serious dilemmas. BUILDING AN EXTERNAL STANDARDS-BASED SUMMATIVE ASSESSMENT SYSTEM The foundations for a standards-based summative assessment system are assessments that are systemically valid: aligned to the recommendations of the national standards, grounded in the educational system, and congruent with the educational goals for students. Alignment of assessment to curriculum and standards ensures that the assessments match the learning goals embodied in the standards and enables the students, parents, teachers and the public to determine student progress toward the standards (NRC, 1999b). Assessment and accountability systems cannot be isolated from their purpose: to improve the quality of instruction and ultimately the learning of students (NRC, 1999b). They also must be well understood by the interested parties and based on standards acceptable to all (Stecher & Herman, 1997). An effective system will provide students with the opportunity to demonstrate their understanding and skills in a variety of ways and formats. The form the assessment takes must follow its purpose. Multiple-choice tests are easy to grade and can quickly assess some forms of science-content knowledge. Other areas may be better tapped through open-ended questions or performance-based assessments, where students demonstrate their abilities and understandings such as with an actual hands-on investigation (Shavelson & Ruiz-Primo, 1999). Assessing inquiry skills may require extended investigations and can be documented through portfolios of work as it unfolds. Educators need to be cautious, deliberate, and aware of the strong influence of high-stakes, external tests on classroom practice specific to the instruction emphasis and its assessment (Frederiksen, 1984; Gifford & O'Connor, 1992; Goodlad, 1984; Popham, 1992; Resnick & Resnick, 1991; Rothman, 1995; Shepard, 1995; Smith et al., 1992; Wolf et al., 1991) when considering, implementing, and evaluating large-scale assessment systems. No assessment form is immune from negative influences. Messick (1994) concludes It is not just that some aspects of multiple-choice testing may have adverse consequences for teaching and learning, but that some aspects of all testing, even performance testing, may have adverse as well as beneficial educational consequences. And if both positive and negative aspects, whether intended or unintended, are not meaningfully addressed in the validation process, then the concept of validity loses its force as a social value. (p. 22)
OCR for page 74
Classroom Assessment and the National Science Education Standards Even well-designed assessments will need to be augmented by other assessments. Most criterion-referenced tests are multiple-choice or short-answer tests. Although they may align closely to a standards-based system, other assessment components, such as performance measures, where students demonstrate their understanding by doing something educationally desirable, also are necessary to measure standards-based outcomes. A long-term inquiry that constitutes a genuine scientific investigation, for example, cannot be captured in a single test or even in a performance assessment allotted for a single class period. LEARNING FROM CURRENT REFORM Beyond a Single Test Several states and districts are making strides in expanding external testing beyond traditional notions of testing to include more teacher involvement and to better align classroom and external summative assessments, so to better support teaching and learning. The state of Vermont (VT) was one pioneer. The state sought to develop an assessment system that served accountability purposes as well as generated data that would inform instruction and improve individual achievement (Mills, 1996). The system had three components: Students and teachers gathered work for portfolios, teachers submitted a “best piece” sample for each student, and students took a standardized test. Scoring rubrics and exemplars were used by groups of teachers around the state to score the portfolios and student work samples. Despite the different pieces in place (which also included professional development) the VT experiment faced mixed results and is still evolving. The scoring of the portfolios and student work samples lacked an adequate reliability (in the technical sense) to be used for accountability purposes (Koretz, Stecher, Klein, & McCaffrey, 1994). Many teachers saw a positive impact on student learning, due in part to the focus and feedback on specific pieces of student work that teachers provided to students during the collection and preparation process (Asp, 1998) but also acknowledged the additional time needed for portfolio preparation (Koretz, Stecher, Klein, McCaffrey, & Deibert, 1993). Kentucky (KY) is another state that made changes to their system and faced similar challenges. The portfolio and performance-based assessment system in that state also did not achieve consistently reliable scores (Hambleton et al., 1995). Both states demonstrate that consistency across scores for samples of work requires training and time. Research on performance assessments in large-scale systems shows that variability in student performance across tasks also can be significant (Baron, 1991).
OCR for page 75
Classroom Assessment and the National Science Education Standards Involving Teachers Teachers who are privy to student discussions and able to making ongoing observations are in the best position to assess many of the educational goals including areas such as inquiry. Therefore, teachers need to become more involved in summative assessments for purposes beyond reporting on student progress and achievement to others in the system. Practices within the United States and in other countries provide us with possibilities of how to better tap into teachers ' summative assessments to augment or complement external exams. In Queensland, Australia, for example, the state moved away from their state-wide examination and placed the certification of students in the hands of teachers (Butler, 1995). Teachers meet in regional groups to exchange results and assessment methods with colleagues. They justify their assessments and deliberate with colleagues from other schools to help ensure that the different schools are holding their students to comparable standards and levels of achievement. Additional examples of the role of teacher judgment in external assessment in other countries are discussed in the next chapter. Accountability efforts that exclude teachers from assessing their students' work are often justified on grounds that teachers could undermine the reliability by injecting undue subjectivity and personal bias. This argument has some support based on results of efforts in VT and KY. However, as the teachers in Queensland engage in deliberation and discussion (a procedure called moderation), steps are taken that mitigate the possible loss of reliability. To help ensure consistency among different teachers in moderation sessions, teachers exchange samples of student work and discuss their respective assessments of the work. These deliberations, in which the standards for judging quality work are discussed, have proved effective in developing consistency in scoring by the teachers. Moderation also serves as an effective form of professional development because teachers sharpen their perspectives about the quality of student work that might be expected, as is illustrated in the next chapter. In the United States, teacher-scoring committees for Advanced Placement exams follow this model. Moderation is expensive and not always practical. There are other ways to maintain reliability and involve teachers in summative assessments that serve accountability and reporting purposes. In Connecticut, the science portion of the state assessment system involves teachers selecting from a list of tasks and using them in conjunction with their own curriculum and contexts. The state provides the teachers with exemplars and criteria, and the teachers are responsible for scoring
OCR for page 76
Classroom Assessment and the National Science Education Standards their own student work. Teachers can use the criteria in other areas of their curriculum throughout the year. Douglas County Schools in Colorado rely heavily on teacher judgments for accountability purposes (Asp, 1998). Teachers collect a variety of evidence of student progress towards district standards. Teacher-developed materials that include samples of work, evaluation criteria, and possible assessment tasks guide them. The county uses these judgments to communicate to parents and district-level monitors and decision makers. Examples and research can help inform large-scale assessment models so that systems produce useful data that inform the necessary purposes while not creating obstacles for quality teaching and learning. Policy and decision makers must look to and learn from reforms underway. After examining large scale testing practices, Asp (1998) offers keys to building compatibility between classroom and large-scale summative assessment systems. His recommendations include the following: make large-scale assessment more accessible to classroom teachers; embed large-scale assessment in the instructional program of the classroom in a meaningful way; and use multiple measures at several levels within the system to assess individual student achievement (pp. 41-42). When data on individual achievement is not the desired aim (as is often the case when accountability concerns focus on an aggregate level, such as the school, district or region), the use of sampling procedures to test fewer students and to test less frequently can be options. The assessment systems and features discussed above are not flawless, yet there is much to learn from the experiences of these reforms. Current strategies and systems need to be modified without compromising the goal of a more aligned system. Changes of any kind will require support from the system and resources for designing and evaluating options, informing and training teachers and administrators, and educating the public KEY POINTS Tensions between formative and summative assessment do exist, but there are ways in which these tensions can be reduced. Some productive steps for reducing tensions include relying on a variety of assessment forms and measures and considering the purposes for the assessment and the subsequent form the assessment and its reporting takes. Test results should be used appropriately, not to make other judgments that are not justified by student scores on such tests.
OCR for page 77
Classroom Assessment and the National Science Education Standards A testing program should include criterion-referenced exams and reflect the quality and depth of curriculum advocated by the standards. For accountability purposes, external testing should not be designed in such a way as to be detrimental to learning, such as by limiting curricular and teaching activities. A teacher's position in the classroom provides opportunities to gain useful information for use in both formative and summative assessments. These teacher assessments need to be developed and tapped to best utilize the information that only teachers possess to augment even the best designed paper-and-pencil or performance-based test. System-level changes are needed to reduce tensions between formative and summative assessments.
OCR for page 78
Classroom Assessment and the National Science Education Standards This page in the original is blank.
Representative terms from entire chapter: