A
Assessment
While many themes are woven into the proposed SERP research agenda, assessment of the outcomes of learning and instruction merits special attention. High-quality evidence that permits practitioners, researchers, and policy makers to ask and answer critical questions about the outcomes of learning and instruction—what students know and are able to do—is critical to advancing a SERP R&D agenda in any domain of instruction.
THREE BROAD PURPOSES
There are three broad purposes for educational assessment:
-
Formative assessment for use in the classroom to assist learning. As described in the chapter, teachers need assessment data on their students to guide the instructional process.
-
Summative assessment for use at the classroom, school, or district level to determine student attainment levels. These include tests at the end of a unit or a school year to determine what individual students have achieved.
-
Assessment for program evaluation used in making comparisons across classrooms or schools. These assessments include standardized tests to determine the outcomes from different instructional programs or to compare performance across schools.
The first requirement for developing quality assessments is
that the concepts and skills that signal progress toward mastery of a subject be understood and specified. In various areas of the curriculum, such as early reading, early mathematics, and high school physics, substantial work has already been done in this regard. In some cases researchers have capitalized on such knowledge to develop the elements of an assessment strategy, although that work has generally concentrated on the development of materials for formative assessment.1 In contrast, research and theory have not been used to develop similarly valid assessment tools for many other areas of mathematics, for reading comprehension, or for numerous aspects of elementary and middle school science. These design and development activities constitute part of the prospective R&D agendas we have outlined for each separate content domain.
Assessment of the overall outcomes of instruction—summative assessment—is important to the R&D agenda because it allows testing of program effectiveness. But it is important more broadly because the content of those assessments can drive instructional practice. The Force Concept Inventory in physics (see Chapter 4) illustrates the potential for a summative assessment tool based on cognitive and instructional research to have a powerful, positive impact on the redesign of instruction. It has served simultaneously as an evaluation tool to determine the effectiveness of a new instructional approach. In most instructional areas, however, little progress has been made in developing assessment tools that support instruction in this way.
Assessment of the impact of long-term programs of R&D, such as those that would be supported by SERP, is also important. For decision-making purposes, the public policy makers need information to determine the return on investing in an enterprise such as SERP.2
1 |
This work includes assessment of components of early reading (see Chapter 2), development of the Number Knowledge Test and integration into the Number Worlds instructional program (see Chapter 3), and work on conceptual understanding in physics, which is incorporated into the Diagnoser software tool (see Chapter 4). |
2 |
Although we focus in this report on learning outcomes, for public policy purposes data are also needed on the costs of achieving those outcomes. The point of bringing together work on teaching, learning, organization, and policy in the SERP context is to ensure that knowledge is available in all these domains to support decision making. |
The education system generally fails to distinguish the requirements of formative, summative, and program evaluation assessments. What is needed is not only greater sophistication in designing assessments to better serve specific purposes, but also coordination within and between the levels of assessment. A well-designed assessment system would allow for the bidirectional flow of information among the levels.
Current large-scale standardized tests used by most states to assess academic achievement fall short in important respects (National Research Council, 2001c). The models of learning and measurement underlying such tests are generally shallow, raising doubts about the quality of the evidence they can provide about student learning or the impact of instructional programs. If it is to support student learning and provide reliable measures of program effectiveness, SERP must undertake research on, and development of, informative and coordinated assessment systems. The SERP program affords a unique opportunity to pursue research and development on integrated assessment systems because it will involve projects and individuals who are concerned with the range of assessment purposes. Research and development initiatives appear in the agenda for all three subjects. This appendix discusses the common elements of those initiatives.
ELEMENTS OF AN ASSESSMENT R&D AGENDA
Regardless of their purpose, quality assessment instruments depend on the same three components: (1) theories and data about content-based cognition that indicate the knowledge and skills that should be tested, (2) tasks and observations that can provide information on whether the student has mastered the knowledge and skills of interest, and (3) qualitative and quantitative techniques for scoring responses that capture fairly the differences in knowledge and skill among the students being tested. Research and development related to each of the three components is needed in order for assessments to provide reliable indicators of student achievement. For example, researchers have developed sophisticated models of student cognition in various areas of the curriculum, but in many cases this has
not been translated into sets of tasks that can be used for assessment purposes. Even in subject domains for which characteristics of expertise have been identified, the understanding of patterns of growth required for assessment purposes, which would enable one to identify landmarks on the way to competence, is often lacking.
To develop assessments that are fair—that are comparably valid across different groups of students—it is crucial that patterns of learning for different populations of students be studied. Much of the development of cognitive theories has been conducted with restricted groups of students (i.e., mostly middle-class whites). In many cases it is not clear whether current theories of developing knowledge and expertise apply equally well with diverse populations of students, including those who have been poorly served in the education system, underrepresented minority students, English-language learners, and students with disabilities. While there are typical learning pathways, often there is not a single pathway to competence. Furthermore, students will not necessarily respond in similar ways to assessment probes designed to diagnose knowledge and understanding. These kinds of natural variations among individuals need to be better understood through empirical study and incorporated into the cognitive models of learning that serve as a basis for assessment design.
Sophisticated models of learning must be paired with methods of eliciting responses from students that effectively reveal what they know, as well as tools for comparing and scoring those responses. Current measurement methods offer greater potential for drawing inferences about student competence than is often realized (National Research Council, 2001c). It is possible, for example, to characterize student achievement in terms of multiple aspects of proficiency rather than a single score; chart students’ progress over time, instead of simply measuring performance at a particular point in time; deal with multiple paths or alternative methods of valued performance; model, monitor, and improve judgments based on informed evaluations; and model performance not only at the level of students, but also at the levels of groups, classes, schools, and states.
Much remains to be done, however, to improve the use of assessment in practice. Iterative cycles of research and development will be required to capture critical dimensions of knowledge in assessment tools and protocols that can be used effec-
tively by those who have limited psychometric expertise. Research must explore ways that teachers can be assisted in integrating new forms of assessment into their instructional practices and how they can best make use of the results from such assessments. It is particularly important that such work be done in close collaboration with practicing teachers who have varying backgrounds and levels of teaching experience.
This iterative work on new forms of assessment must explore their accessibility to teachers and practicality for classroom use, and their efficiency in large-scale testing contexts. For policy purposes, it is particularly important to study how new forms of assessment affect student learning, teacher practice, and education decision making. Also to be studied are ways that school structures (e.g., length of time of classes, class size, and the opportunity for teachers to work together) impact the feasibility of implementing new types of assessments and their effectiveness. A SERP network of field sites makes the pursuit of such an agenda possible.