Chapter 1: Introduction
Assessment has long been viewed as a tool for effecting change in classrooms, and consequently as a factor that can influence the learning of mathematics (Resnick & Nolan, 1995; Romberg, Zarinnia, & Williams, 1990). Since the release of the Curriculum and Evaluation Standards for School Mathematics (National Council of Teachers of Mathematics [NCTM], 1989), many individuals and organizations have been working to develop assessments that represent the kind of mathematics and support the kind of mathematical teaching and learning envisioned by those standards. Now, after 10 years, some of that work has come to fruition with the availability of some examples of largescale standardsbased assessments. In both purpose and design, these assessments are very different from traditional normreferenced tests. Much has been learned about standardsbased assessment during the development process.
This booklet is intended to highlight a few lessons that have relevance and implications for the classroom. It is directed toward an audience of supervisors of mathematics teachers, mathematics teachers, designers of mathematics tasks and assessments, and administrators. Particular issues arise with assessments designed to support ''reformist" approaches to the teaching and learning of mathematics. The implications of these issues for task development and assessment design, classroom practice, and assessment policy will be discussed here. Task designers should be able to draw upon the ideas presented here in creating balanced and equitable assessments for all students. Mathematics teachers and their supervisors will be able to use these lessons in designing and
administering classroom assessments. The materials will help administrators provide teachers with the support necessary for enhancing the teaching and learning of mathematics. This chapter begins by defining standardsbased assessment and other key terms, and then compares standardsbased assessment with traditional normreferenced tests. The remainder of the chapter describes the organization of the booklet.
In this booklet, experience and research from two assessment development projects, Balanced Assessment and New Standards, will be drawn upon to offer guidance about the development and implementation of assessments. Throughout, four particularly important ideas will be addressed: balance, opportunity to perform, opportunity to learn, and alignment. The intent is not to describe all of the details associated with largescale assessment development but rather to discuss how these four basic ideas can be used to inform the development and implementation of innovative assessment instruments.
Standardsbased assessments
The assessments we are concerned with here are those that are necessary in a standardsbased system of education. In such a system, assessments need to be designed to assess whether students meet publicly negotiated and agreedupon standards. The standards to which an assessment is referenced might be those of a state, a district, or the National Council of Teachers of Mathematics (i.e., NCTM, 1989). It does not necessarily matter what or whose standards are chosen, but in our view what is important is that the selected standards promote a broad and balanced approach to learning mathematics, where conceptual understanding and mathematical skills are both emphasized.
Standards will be more likely to have a positive effect on mathematics learning if assessment, curriculum, and instruction are aligned with them (Webb, 1997). Assessments that are aligned to standards are referred to as standardsbased assessments. When curriculum, instruction, and assessment are aligned to the same set of standards, those standards lay out what a student should know and be able to do. Curriculum and instruction provide opportunities for students to learn the mathematics that the standards ask them to learn and to acquire the knowhow for them to show what they have learned. Assessments provide opportunities for students to perform and allow inferences to be made about what students know and can do in mathematics.
When instruction and assessment are aligned to standards, the role of the teacher is central. What goes on in the classroom is crucial to the enhancement of learning mathematics (Black &
Wiliam, 1998). The teacher is a critical interpreter in this effort. She must interpret what the standards are asking students to do and translate these expectations into learning opportunities for students. Through classroom assessment, she must interpret how students are learning what they are supposed to learn. Then, in the light of this assessment, she must give students feedback and create subsequent learning opportunities for the whole range of students under her care. An important aspect of the teacher's role is to enable all students, not just the most academically talented, to learn the mathematics that is specified
Unlike traditional tests, standardsbased assessments will be tightly rather than loosely linked to the curriculum. Some examples of assessments are the tightly linked to curriculum are the system of Advanced Placement (AP) Examinations that are designed by the College Board. For these exams, a syllabus or course guide specifies what students are to learn, and then the endofcourse examinations assess how well they have learned that material. Most other countries have exams that are based directly on syllabi that often comprise two years of study. These high stakes tests include very specific and often demanding questions that require a variety of types of responses. The U.S., in contrast, depends predominantly on multiplechoice tests.
In a standardsbased system, the student is actively involved in the learning process. The student can strengthen opportunities to learn by working to understand mathematics rather than being concerned only with completing the work and getting a grade. Students need to learn that standardsbased assessments are designed so students may succeed through hard work. It is then the student's job to learn how to do that hard work.
Differences between normreferenced and standardsbased assessments
Purpose.
Traditional normreferenced tests are intended to sort students by rank or to compare the performance of one cohort of students against that of another cohort. Thus, on a normreferenced test, half the students are expected to have scores that are below average. Standardsbased assessments, on the other hand, are intended to assess whether students can do what a specified set of standards asks them to do. On a standardsbased assessment, then, it is possible for all students to meet the standard.
Normreferenced tests report how a student or group of students compare to the "norm." In contrast, standardsbased assessments are criterionreferenced, in the sense that test designers set an absolute level of performance and report whether a student has met that level. The standards, then, are the criteria against which student performance is measured.
Preparation.
One of the crucial differences between traditional normreferenced tests and standardsbased assessments is that students are expected to study for standardsbased assessments, and students who work hard to meet clearly defined learning expectations should be able to do well. Traditional normreferenced tests are descended from the earliest intelligence tests, for which subjects were never expected to study. The expectation that students should succeed on standardsbased assessments through hard work has important implications for their design, and the Assessment Standards for School Mathematics (NCTM, 1995) goes a long way toward defining explicit standards for such assessments.
Range of skills.
Standardsbased assessments need to be designed to assess a much broader range of mathematics than has traditionally been assessed. If assessment is to drive a renewed approach to teaching and learning, it needs to incorporate a broad and significant range of mathematics to provide all students with the opportunities to solve a variety of worthwhile problems, reason mathematically, understand concepts, develop technical skills, make connections among mathematical ideas, and communicate about mathematics (NCTM, 1995, p. 11).
Variety of tasks.
To assess a considerably broader range of mathematics in depth, a standardsbased assessment needs to include short or elaborate constructedresponse items, in which students must formulate and perhaps explain their answers. For this reason, standardsbased assessments call for a shift away from reliance on multiplechoice and other selectedresponse items, for which the student chooses the correct answer from a list. Unfortunately, the term openended is sometimes used to describe constructedresponse tasks. This can be confusing, because openended better describes one specific type of constructedresponse task. Three types of constructedresponse tasks are described as follows:

Closed—these are usually short with one obvious solution path that leads to a single correct answer.

Openmiddle—these have more than one solution path, although they all lead to a single correct answer.

Openended—these have multiple solution paths that lead to many different answers.
Closed and openmiddle constructedresponse tasks are useful in creating an examination that must be standardized across an entire state or district. Openended items are more useful for assessment that does not need to be entirely standardized across a whole cohort of students.
Variety of student products.
This greater variety of task types brings in its wake a greater variety of student products. In traditional tests, student products usually conform to a single type: the selection of a correct response from among a number of choices. In contrast, in much broader and balanced standardsbased assessments, students are asked to construct the correct responses to some tasks and to select the correct response to others.
Standardsbased assessment calls for a greater variety of student products in order to give them the opportunity to show the full range of what they know and can do. In making this change, an assessor is embracing the challenge not only of assessing those aspects of mathematics that are easy to assess, but also of finding ways to assess those aspects of mathematics that are essential to assess.
Balancing the instrument.
Besides these essential differences between task types and student products, there are important differences in the design and construction of standardsbased assessments and traditional tests. Briefly, to construct a traditional normreferenced test, a set of items is drawn from a large pool of multiplechoice items. Earlier field trials of this large pool allow a pvalue to be assigned to each item. (An item's pvalue is the proportion of students that select the correct response.) Items are then selected for a test such that the distribution of pvalues approximates a normal distribution and so that the domain to be assessed is covered. In constructing standardsbased assessments, such statistical analysis plays a lesser role. Instead, what is key is the selection of tasks that reflect the depth, range, and structure of mathematics represented in the student expectations (or standards) on which the assessment is based. Each task is developed to ensure that it has a reasonable score distribution across populations and that it does not unfairly disadvantage or advantage any one particular student group. Each task is then scored using a rubric—a scoring guide—referenced to an explicitly stated set of standards.
Performance sampling.
These differences in design of standardsbased assessments and traditional normreferenced tests have implications for performance sampling—choosing a reasonable set of items from among the many that assess part of the domain to be tested. Virtually all assessments entail a sampling of student performance, unless the goals are few and are narrowly specified. Normreferenced tests purport to cover the essential ingredients by using as many selectedresponse items as possible. A 50minute test can include as many as 40 items because students are given only a short time to select a response to each multiplechoice item. Standardsbased assessments, in contrast, seek to sample the essential mathematics by carefully selecting a balanced set of tasks. Students need considerably longer than a few minutes to construct a response to a worthwhile complex task,
and so a 50minute assessment might be made up of only a few items. But a few complex tasks cannot sample the domain in the same way as a large number of multiplechoice items. A judicious way of handing the dilemma of performance sampling is to create an assessment that contains a combination of elaborate constructedresponse tasks, selectedresponse items, and short constructedresponse exercises. The specific combination of tasks that would then be placed on an examination would be selected to reflect the depth, range, structure, and balance of the standards to which the examination was referenced. The inclusion of a variety of task types can also allow assessment of aspects of mathematics, such as mathematical communication, that are difficult to assess with an exam that consists only of multiplechoice items, for example.
Scoring as value judgments.
Another important difference between normreferenced and standardsbased assessments is the location and visibility of the value judgments necessarily associated with assessment. All assessment involves value judgments about what is important to assess and how important is each aspect. When a student's performance on an assessment task is evaluated, feedback can be given or a score can be assigned. Both of these involve value judgments, and in this sense, evaluation of student performances defines what counts as important. In standardsbased assessments, these value judgments can be explicit and can be shared among teachers and students. In fact, making the necessary value judgments can become the responsibility of the professional peer group. In contrast, with normreferenced tests, the necessary value judgments remain the responsibility of test designers because they are embedded in the design and choice of items and their distracters (the available incorrect responses), and are less visible to teachers and students.
Score reporting.
Normreferenced and standardsbased assessments also differ in the kinds of data produced. Normreferenced data report, often through a percentile score, how a student or group of students compare to a reference population. Standardsbased assessments, on the other hand, report data that are referenced to some agreedupon standard. The sets of scores that are generated by the task rubrics are aggregated to provide a total mathematics score and perhaps some subscores (subsets of the total score). When the assessment is standardsbased, an individual's aggregate scores can be given meaning in terms of the standards. For example, nonoverlapping ranges of aggregated scores can be designated as exceeds the standard, meets the standard, nearly meets the standard, or far below the standard. Such score classification is based on professional judgments, tasks, rubrics, and score distributions, and so is arrived at by a process that includes judgmental, standardsreferenced, and normative elements.
A hidden danger
If the function of standardsbased assessments is to raise standards, care must be taken to ensure that they do not have a narrowing effect on curriculum and instruction (Schoenfeld, 1988). Indeed, the optimism surrounding the alignment of instruction and assessment is tainted by concern that when student learning expectations are defined by overly narrow assessments with consequences for teachers, students, or both, the impact on instruction and learning will not be uniformly positive (Romberg, Zarinnia, & Williams, 1990).
There is reason to worry that the pressure of consequences attached to highstakes assessments will lead teachers and students to seek the most efficient way they can conceive of to reach their specified targets. Unfortunately, this often results in an enormous investment of classroom time in preparation for the test. For example, the New York Times recently carried this report:
The Bronx Division of High Schools asked Kaplan to train its teachers to help students tackle the state's new English Regents exam, which is being introduced in June and will become a condition for graduation next year. (Hartocollis, 1999, p. B1)
If the test is too narrow or omits important aspects of mathematical learning, there is little doubt that so much focus on testtaking strategies will create a learning environment antithetical to the wider educational goals that are envisaged in a standardsbased system. If, however, teachers were asked about the most efficient strategies for preparing students for worthwhile and challenging assessments, it is quite likely they would stress that there is no alternative to learning the mathematics that is specified. It is this quality that standardsbased assessments are intended to cultivate.
Organization of this booklet
Chapter 2 outlines a model for standardsbased assessment. The model incorporates elements that we feel are essential if the assessment is to enhance instruction and learning. The main thrust of this chapter is that assessments must be balanced. To be balanced, assessments must assess those aspects of mathematics that are important, rather than confining themselves to those aspects that might be easy to assess. The sense of balance that informs our model follows from standards and principles advanced by the Mathematical Sciences Education Board (MSEB) and NCTM (NCTM, 1989, 1995, 1998; NRC, 1989, 1993b).
The model presented in Chapter 2 was developed with the benefit of many years of assessment development experience of a
large number of professional designers and teachers. Other models have been developed and used, although many of the differences are only in the details. The process, the experience, and the research from which this model is derived are presented in Chapter 3.
Chapter 3 draws upon work with hundreds of teachers and thousands of students to construct a practical guide to assessment development. Here, the reader can come to understand the experience and research that has guided our process. This chapter analyzes and illustrates those aspects of task design that have been found to detract from providing students with real opportunities to perform. It also describes how the theory and practice of assessment development have reached new understandings of the interactions between students and assessment tasks, so that when seemingly good tasks fail to produce good results in field trials and the source of the failure is extraneous to the mathematics, the task may often be revised in ways that maintain the important mathematical ideas that the task was intended to assess. When students' performances on assessment tasks are scored, the results lead to inferences about what they do or do not know. Essentially, this chapter deals with issues that are relevant to the validity of these inferences.
In Chapter 4, the focus shifts away from the task on its drawing board and toward the task as seen in the social milieu of the classroom. Here, the main focus is on issues that influence opportunity to learn. The chapter illustrates just how difficult students find nonroutine assessment tasks when they do not work on them regularly in class. Students often make tasks more difficult when they try to force the use of the mathematics that they have learned most recently. And students' own evaluations of assessment tasks often indicate that they have very clearly defined notions about what counts as appropriate behavior for the mathematics classroom.
Chapter 4 also covers two central problems teachers confront:

how little mathematics many of their students seem to know, and,

how much content they, as teachers, are expected to cover.
These two pressures have led many teachers to adopt various coping strategies, even though such strategies are considered to be far from perfect. For example, their students' lack of mathematical knowledge has led some teachers to concentrate on basic skills rather than broaden the aspects of mathematics that they teach in their classrooms. The pressure to cover an exceedingly large amount of material also has led many teachers to make less than satisfactory choices about their curriculum. Paul Black and Dylan Wiliam's recent article Inside the Black BoxRaising Stan
dards Through Classroom Assessment (1998) is used to identify the additional support that teachers must be given if they are to create opportunities for students to learn mathematics in a balanced way. Black and Wiliam urge those interested in raising standards to open up the black box of the classroom. They argue persuasively for the efficacy of formative assessment in raising standards. If formative assessment is to be effective, Black and Wiliam note, it must be integrated into rather than bolted on top of current instructional practices.
Finally, Chapter 5 addresses the issue of aligning instruction and assessment with standards. The presentation here is substantially informed by the recent and comprehensive monograph Criteria for Alignment of Expectations and Assessments in Mathematics and Science Education (Webb, 1997). Webb leaves little doubt that, for assessment to be effective, instruction must be aligned with standards.