In this chapter, we turn to the question of how to design a full assessment system and consider the components that should be included to adequately evaluate students’ science achievement. The assessment system we envision builds on discussion in the previous chapters of the report.
Chapter 2 explores the assessment challenges associated with evaluating students’ proficiency on the performance expectations of the Next Generation Science Standards (NGSS) and emphasizes that because of the breadth and depth of those expectations, students will need multiple opportunities to demonstrate their proficiencies using a variety of assessment formats and strategies. The chapter also discusses the types of tasks that are best suited to assessing students’ application of scientific and engineering practices in the context of disciplinary core ideas and crosscutting concepts, as well as simultaneously assessing the connections across concepts and disciplines. The committee concludes that tasks composed of multiple interrelated questions would best serve this purpose. Chapter 3 describes approaches to developing these types of tasks so that they provide evidence to support the desired inference. Chapters 4 and 5 present examples and discuss strategies for developing assessments for use, respectively, in the classroom and to provide evidence for monitoring purposes.
We propose that an assessment system should be composed both of assessments designed to support classroom teaching and learning (Chapter 4) and those designed for monitoring purposes (Chapter 5). In addition, the system should include a series of indicators to monitor that the students are provided with adequate opportunity to learn science in the ways laid out in A Framework for K-12 Science Education: Practices,
Crosscutting Concepts, and Core Ideas (National Research Council, 2012a, Chapter 11, hereafter referred to as “the framework”) and the Next Generation Science Standards: For States, By States (NGSS Lead States, 2013, Appendix D). Such a system might take various forms and would include a range of assessment tools that have each been designed and validated to serve specific purposes and to minimize unintended negative consequences. Our intention is not to prescribe a single design for such a system, but to offer guidance for ensuring that any given system design supports attainment of the framework’s vision for science learning and student proficiency envisioned in the framework and the NGSS.
We begin with the rationale for a systems approach to assessment, describing how an assessment system influences student learning and curriculum and instruction directly and indirectly and discussing the influence that accountability goals can have on the design of an assessment system. In the last section we describe a set of components and the characteristics that an effective assessment system should have and recommend strategies for developing such a system.
As discussed throughout this report, the purposes for which information about student learning is needed should govern the design and use of assessments. These purposes may include
- guiding and informing teachers’ day-to-day instructional decisions;
- providing feedback to students, as well as their parents and teachers, on students’ academic progress;
- illustrating sound instructional and assessment activities that are consistent with the framework and the NGSS;
- monitoring the science achievement of students across schools, districts, states, and/or the nation to inform resource allocations, identify exemplary practices, and guide educational policy;
- contributing to the valid evaluation of teachers, principals, and schools;
- determining whether students meet the requirements for a high school diploma; and
- evaluating the effectiveness of specific programs (e.g., new science curricula and professional development to support the transition to the NGSS).
Implicit in each assessment purpose are one or more mechanisms through which the assessment is intended to have some beneficial effect. That is, assess-
ments are a means to an end, not an end themselves. For example, an assessment that periodically informs students and their parents about student progress might be intended to stimulate students’ motivation so that they study more, provide feedback students can use to focus their studies on their weak areas, and engage parents in student learning so they can provide appropriate supports when students are not making the expected level of progress. Similarly, providing teachers with quick-turnaround feedback on student learning might be intended to help them pace instruction in an optimal way, highlight individual learning difficulties so they can provide individualized remediation, and guide ongoing refinement of curriculum and instructional practices. Assessments that provide overall information about student learning might be used to evaluate the quality of instruction at the school, district, or state level in order to determine where to focus policy interventions. Assessments used for accountability purposes may be designed to hold teachers or schools and their principals accountable for ensuring that students achieve the specified level of progress.
Some of these action mechanisms are direct, in that the information provided by the test scores is used to inform decisions, such as to guide instruction or to make decisions about student placement. Other mechanisms are indirect, in that the testing is intended to influence the behavior of students, teachers, or educational administrators by providing them with incentives to improve test performance (hence, achievement). Assessments can provide teachers and administrators with examples of effective assessment practices that can be incorporated into instruction. Systems that involve teachers in the assessment design and scoring process provide them with an opportunity to learn about the ways students learn certain concepts or practices and about the principles and practices of valid assessment. Similarly, students who must pass an examination to receive a high school diploma may work harder to learn the content to be tested than they would without that requirement. Elementary grade teachers might invest more time and effort in science teaching if science test results were among the factors considered in accountability policies.1 Other action mechanisms are even more indirect. For example, the content and format of testing send signals to textbook writers, teachers, and students about what it is important to learn (Haertel, 2013; Ho, 2013). Test questions that are made public and media reports of student results may help educate both educational professionals and the broader public about science learning and its importance.
1We acknowledge that both of these uses are controversial.
Clearly, no single assessment could possibly serve the broad array of purposes listed above. Different assessment purposes require different kinds of assessment data, at different levels of detail, and produced with different frequency. Teachers and students, for example, need fine-grained, ongoing information unique to their classroom contexts to inform immediate instructional decision making; policy makers need more generalized data both on student learning outcomes and on students’ opportunities to learn.
The arguments for the value of an assessment system have been made before (e.g., National Research Council, 2001). A systems approach to science assessment was advocated and described in considerable detail in Systems for State Science Assessment (National Research Council, 2005) and is reinforced in the new framework (National Research Council, 2012a). More recently, a systems approach was recommended in connection with the implementation of the Common Core State Standards in Criteria for High-Quality Assessments (Darling-Hammond et al., 2013). These reports all call for a balanced, integrated, and coherent system in which varied assessment strategies, each intended to answer different kinds of questions and provide different degrees of specificity, produce results that complement one another. In particular, the framework makes clear that an effective system of science assessment will include both assessments that are grounded in the classroom and assessments that provide information about the effectiveness of instruction and the overall progress of students’ science learning.
The challenges of covering the breadth and the depth of the NGSS performance expectations amplify the need for a systems approach. The selection and design of system components should consider the constructs and purpose(s) each measure is to serve and the ways in which the various measures and components will operate to support the improvement of student learning. There are many ways to design an effective assessment system, but all should begin with careful consideration of the way that the assessment data are to be used, the type of information that is needed to support those uses (in the shape of a menu of different types of reports), and how the various components of the system work together.
It important to point out that no assessment system operates in a vacuum. As argued in previous reports (National Research Council, 2001, 2005; Darling-Hammond et al., 2013), an assessment system should be designed to be coherent
with instruction and curriculum.2 The committee believes that curriculum design decisions should precede assessment design decisions. That is, decisions about which practices, crosscutting concepts, and core ideas will be taught need to be made before one can determine what will be assessed and how it will be assessed.
The NGSS illustrate an extensive set of performance expectations at every grade level. As we note in Chapter 2, it is unrealistic to suppose that each possible combination of the three dimensions will be addressed. Thus, in designing curricula, difficult decisions will have to be made and priorities set for what content to teach and assess.
In the United States, curricular decisions are made differently in each state: in some states, these decisions are made at the state level; in others, they are made at the district level or school level. Although the NGSS imply certain approaches toward curriculum design, education policy makers in different jurisdictions will make different decisions about what is the optimal curriculum for addressing the framework. Different curricula will likely reflect different priorities and different decisions about what to include.
These state differences pose a challenge for external assessments3 when the assessment purpose is to compare performance across different jurisdictions, such as across states that have adopted different curricula or across schools and districts in states with local control over curricula. When external assessments are used to make comparisons, they will need to be designed to be valid, reliable, and fair despite the fact that students have been exposed to different curricula and different combinations of scientific practices, crosscutting concepts, and disciplinary core ideas. Students who have been exposed to any curriculum that is intended to be aligned with the framework and the NGSS should be able to show what they know and can do on assessments intended to support comparative judgments.
Devising assessments that can produce comparable scores that reflect complex learning outcomes for students who have studied different curricula is always a challenge. Test content needs to be neither too unfamiliar nor too familiar if it is to measure the intended achievement constructs. The challenge is to limit and balance the ways in which curriculum exposure may bias the results of an assessment that is to be used to make comparisons across student groups. These challenges in assessment design are not unique to science assessment. Test devel-
2“Curriculum” refers to the particular material through which students learn about scientific practices, crosscutting concepts, and core ideas.
3We use the term to mean assessments developed outside of the classroom, such as by the state or the district. External assessments are generally used for monitoring purposes (see Chapter 5).
opers in the United States have long had to deal with the challenge of developing external assessments that are fair, reliable, and valid for students who have studied different curricula. However, covering the full breadth and depth of the NGSS performance expectations is an additional challenge and will require a careful and methodical approach to assessment design.
The science assessments developed to measure proficiency on the NGSS performance expectations will likely be used for accountability purposes, so it is important to consider the ways in which accountability policies might affect the ways in which the assessments operate within the system. The incentives that come with accountability can serve to support or undermine the goals of improving student learning (National Research Council, 2011b; Koretz, 2008). It is likely that whoever is held accountable in a school system will make achieving higher test scores a major goal of science teaching.
In practice, accountability policies often result in “teaching to the test,” so that testing tends to drive curriculum and instruction, even though the avowed intention may be for curriculum and instruction to drive testing (Koretz, 2005, 2008). The result of accountability testing, too often, has been a narrowing of the curriculum to match the content and format of what is to be tested, which has led to coverage of superficial knowledge at the expense of understanding and inquiry practices that are not assessed (Dee et al., 2013). Schools and classrooms serving students with the greatest educational needs are often those presented with the most ambitious challenges for improvement and thus also face the greatest pressure to “teach to the test.” Thus, it is extremely important that the tests used for accountability purposes measure the learning that is most valuable.
As we have discussed in Chapters 4 and 5, the three-dimensional learning described in the framework and the NGSS cannot be well assessed without some use of the more extended engagements that are really only possible in a classroom environment. We emphasize that the assessments used for monitoring purposes will need to include both on-demand and classroom-embedded assessment components (see Chapter 5).4 Thus, if accountability policies are part of the
4These two types of assessments were discussed in Chapter 5. We use them to mean the following. On-demand assessments are external assessments mandated by the state (such as the statewide large-scale assessments currently in place). They are developed and/or selected by the state and given at a time
science education system, it will be important that they incorporate results from a variety of types of assessments. When external, on-demand assessments dominate in an assessment system and are the sole basis for accountability, curriculum and instruction are most likely to become narrowed to reflect only the material and testing formats that are represented on those assessments (Koretz, 2005, 2008).
There is very limited evidence that accountability policies to date, which focus largely—if not solely—on external (large-scale) assessments, have led to improved student achievement (National Research Council, 2011b). In contrast, numerous studies document the positive effects on learning from the use of classroom assessment to guide teaching and learning (Black and Wiliam, 1998; Kingston and Nash, 2011; National Research Council, 2007). Assessment that closely aligns with a curriculum that engages students in three-dimensional science learning will return the focus to what is most important—the direct support of students’ learning.
A key consideration in developing an assessment system is the design of reports of assessment results. The reporting of assessment results is frequently taken for granted, but consideration of this step is critical. Information about students’ progress is needed at all levels of the system. Parents, teachers, school and district administrators, policy makers, the public, and students need clear, accessible, and timely information. In a systems approach, many different kinds of information need to be available, but not all audiences need the same information. Thus questions about how various kinds of results will be combined and reported to different audiences and how reporting can support sound, valid interpretations of results need to be considered early in the process of the design of an assessment system.
Reporting of assessment results can take many forms—from graphical displays to descriptive text and from a series of numbers to detailed analysis of what the numbers mean. Depending on the needs of different audiences, results can be presented in terms of individual standards (or performance expectations) or in terms of clusters of standards. Results can describe the extent to which students have met established criteria for performance, and samples of student work can be provided.
determined by the state. Classroom-embedded assessments are external assessments developed and/or selected by the state or the district. They are given at a time determined by the district or school. See Chapter 5 for additional details about our use of these terms.
The types of assessments we advocate will generate new kinds of information. If the information is not presented in a way that is accessible and easy to use for those who need it, it will not serve its intended purpose. For example, if a series of complex performance tasks results in a single reported score, users will not be receiving the information the assessment was designed to produce. Thus, it is important that the reporting of assessment results be designed to meet the needs of the intended audiences and the decisions they face and address all of the specifications that guided the design and development of the assessment. For example, to be useful to teachers, assessment results should address instructional needs. Assessment reports should be linked to the primary goals of the framework and the NGSS so that users can readily see how the specific results support intended inferences about important goals for student learning. It is also important that the information provide clear guidance about the degree of uncertainty associated with the reported results.
The topic of developing reports of assessment results has been explored by a number of researchers: see, for example, Deng and Yoo (2009); Goodman and Hambleton (2004); Hambleton and Slater (1997); Jaeger (1996); National Research Council (2006); Wainer (2003).
The committee concludes that a science assessment system should include three components: (1) assessments designed for use in the classroom as part of day-today instruction, (2) assessments designed for monitoring purposes that include both on-demand and classroom-embedded components, and (3) a set of indicators designed to monitor the quality of instruction to ensure that students have the opportunity to learn science as envisioned in the framework. The first two components are only briefly considered below since they are the focus of extended discussion in Chapters 4 and 5. We emphasize below the third component—a set of indicators of opportunity to learn.
The approach to science assessment that we envision is different from those that are now commonly used (although it is indeed an extension and coordination of aspects of many current assessment systems). For instance, classroom-generated assessment information has not been used for monitoring science learning in the United States. Adopting an assessment system that includes a classroom-embedded component will require a change in the culture of assessment, particularly in the level of responsibility entrusted to teachers to plan, implement, and score assessments. In Chapter 5, we discuss ways to enhance the comparability of assess-
ment information gathered at the local level by using moderation strategies5 and methods for conducting audits to ensure that the information is of high quality. In addition, it will be important to routinely collect information to document the quality of classroom instruction in science, to monitor that students have had the opportunity to learn science in the way called for in the new framework, and to ensure that schools have the resources needed to support that learning. Documentation of the quality of classroom instruction is one indicator of opportunity to learn (see below).
The changes in science education envisioned in the framework and the NGSS begin in the classroom. Instruction that reflects the goals of the framework and the NGSS will need to focus on developing students’ skills and dispositions to use scientific and engineering practices to progress in their learning and to solve problems. Students will need to engage in activities that require the use of multiple scientific practices in developing a particular core idea and will need to experience the same practices in the context of multiple core ideas. The practices have to be used in concert with one another, for example, supporting an explanation with an argument or using mathematics to analyze data from an investigation.
Approaches to classroom assessment are discussed in detail in Chapter 4. Here, we emphasize their importance in an assessment system. As noted above, assessment systems have traditionally focused on large-scale external assessments, often to the exclusion of the role of classroom assessments. Achieving the goals of the framework and the NGSS will require an approach in which classroom assessment receives precedence. This change means focusing resources on the development and validation of high-quality materials to use as part of classroom teaching, learning, and assessment, complemented with a focus on developing the capacity of teachers to integrate assessments into instruction and to interpret the results to guide their teaching decisions.
In Chapter 4, we highlight examples of the types of classroom assessments that should be part of a system, and we emphasize that it is possible to develop assessment tasks that measure three-dimensional learning as envisioned in the
5Moderation is a set of processes designed to ensure that assessments are administered and scored in comparable ways. The aim of moderation is to ensure comparability; that is, that students who take the same subject in different schools or with different teachers and who attain the same standards will be recognized as being at the same level of achievement.
framework and the NGSS. It is worth noting, however, that each example is the product of multiple cycles of development and testing to refine the tasks, the scoring systems, and their interpretation and use by teachers. Thus, the development of high-quality classroom assessment that can be used for formative and summative purposes should be treated as a necessary and significant resource investment in classroom instructional supports, curriculum materials, and professional development for teachers.
In Chapter 5, we discuss assessments that are used to monitor or audit learning and note that it is not feasible to cover the full breadth and depth of the NGSS performance expectations for a given grade level with a single external (large-scale) assessment. The types of assessment tasks that are needed take time to administer, and several will be required in order to adequately sample the set of performance expectations for a given grade level. In addition, some practices, such as demonstrating proficiency in carrying out an investigation, will be difficult to assess in the conventional formats used for on-demand external assessments. Thus, states need to rely on a combination of two types of external assessment strategies for monitoring purposes: on-demand assessments (those developed outside the classroom and administered at a time mandated by the state) and classroom-embedded assessments (those developed outside the classroom and administered at a time determined by the district or school that fits the instructional sequence in the classroom).
A primary challenge in designing any monitoring assessment is in determining how to represent the domain to be assessed, given that (1) it will be difficult to cover all of the performance expectations for a given grade level without some type of sampling and (2) the monitoring assessments will be given to students who will have studied different curricula. There are various options: each has certain strengths but also some potential drawbacks.
One option is to sample the standards but not reveal which performance expectations will be covered by the assessment. This option encourages teachers to cover all of the material for a given grade, but it could lead to a focus on covering the full breadth of the material at the expense of depth. Another option is to make teachers and students aware of which subset of the performance expectations will be assessed in a particular time frame. Although this option encourages teachers to cover some performance expectations in depth, it also gives teachers an incentive to ignore areas that are not to be assessed. A third option is to make the sample
choices public and to rotate the choices over time. This option helps to ensure that certain performance expectations are not consistently ignored, but it creates churning in instructional planning and also complicates possibilities for making comparisons across time.
It would also be possible to offer schools constrained choices from the full range of performance expectations, perhaps through attempts to prioritize the performance expectations. For example, schools might be encouraged to cover at least some particular number of disciplinary core ideas from given domains or offered a menu of sets of core ideas (perhaps with associated curriculum supports) from which to choose. Giving schools a constrained set of choices could allow for more flexibility, autonomy, and perhaps creativity. Providing them with a menu could also make it easier to ensure coherence across grade levels and to provide curriculum materials aimed at helping students meet key performance expectations.
Each option brings different advantages and disadvantages. Selecting the best option for a given state, district, or school context will depend on at least two other decisions. The first is whether to distribute the standards to be tested across the classroom-embedded component or in the on-demand component of the monitoring assessment: that is, which performance expectations would be covered in the classroom-embedded component and which in the on-demand component. The second is the extent to which there is state, district, or local school control over which performance expectations to cover. There is no strong a priori basis on which to recommend one option over the others, and thus states will need to use other decision criteria. We suggest two key questions that could guide a choice among possible strategies for representation of the standards: Will the monitoring assessment be used at the school, district, or state level? Which components of the monitoring assessment system (classroom embedded and on demand) will have choices associated with them?
The work of identifying indicators of progress toward major goals for education in science, technology, engineering, and mathematics (STEM)—is already underway and is described in a recent report Monitoring Progress Toward Successful K-12 Education (National Research Council, 2012b). The report describes a proposed set of indicators for K-12 STEM education that includes the goal of monitoring the extent to which state science assessments measure core concepts and practices and are in line with the new framework. The report includes a
number of indicators that we think are key elements of a science accountability system: program inspections, student and teacher surveys, monitoring of teachers’ professional development, and documentation of classroom assignments of students’ work. These indicators would document such variables as time allocated to science teaching, adoption of instructional materials that reflect the NGSS and the framework’s goals, and classroom coverage of content and practice outlined in these documents. Such indicators would be a critical tool for monitoring the equity of students’ opportunities to learn.
A program of inspection of science classrooms could serve an auditing function, with a subset of schools sampled for an annual visit. The sample of schools could be randomly chosen, following a sampling design that accurately represents state-level science program characteristics. Schools with low scores on monitoring tests (or with low test scores relative to the performance expected based on other measures, such as achievement in other subject areas, socioeconomic status, etc.) would be more heavily sampled. Inspection would include documentation of resources (e.g., science space, textbooks, budgets for expendable materials), teacher qualifications, and time devoted to science instruction, including opportunities to engage in scientific and engineering practices. Peer review by highly qualified teachers (e.g., teachers with subject certification from the National Board for Professional Teaching Standards), who have had extensive training in the appropriate knowledge and skills for conducting such reviews, could be a component of an inspection program. These inspections would have to be designed not to be used in a punitive way, but to provide findings that could be used to guide schools’ plans for improvement and support decisions about funding and resources.6 We note that if such a program of inspection is implemented, forethought must be given to how recommendations for improvement can be supported.
Surveys of students and teachers could provide additional information about classrooms, as well as other variables such as students’ level of engagement or teachers’ content knowledge. The results of surveys used at selected grade levels together with data collected through a large-scale system component could also provide valuable background information and other data, and such surveys could be conducted online. Student surveys would have to be individually anonymous:
6Accreditation systems in the United States and other countries already use many of these strategies. For information about an organization that operates such a system in the United States and elsewhere, AdvancED, see http://www.advanc-ed.org/ [September 2013].
they would not include names but would be linked to schools. Student surveys could also be linked to teachers or to student demographic characteristics (e.g., race and ethnicity, language background, gender). If parallel versions of some questions are included on teacher and student questionnaires, then those responses could be compared. Questions could probe such issues as the amount of time spent on science instruction; opportunities for constructing explanations, argumentation, discussion, reasoning, model building, and formulation of alternative explanations; and levels of students’ engagement and interest. Surveys for teachers could include questions about time spent in professional development or other professional learning opportunities.
The time provided for teacher collaboration and quality professional development designed to improve science teaching practices could also be monitored. Monitoring strategies could include teacher surveys completed at professional development events focused on science or school reporting of time and resources dedicated to supporting teachers’ learning related to science.
Documentation of curriculum assignments or students’ work might include portfolios of assignments and student work that could also provide information about the opportunity to learn (and might also be scored to provide direct information about student science achievement). The collected work could be rated for purposes of monitoring and improvement. Alternatively, the work could be used to provide an incentive for teachers to carefully consider aspects of the NGSS and the three-dimensional learning described in the framework (see Mitchell et al., 2004; Newmann et al., 1998; Newmann and Associates, 1996). Such a system of evaluation of the quality and demand of student assignments was used in Chicago and clearly showed that levels of achievement were closely tied to the intellectual demands of the work assigned to students (Newmann et al., 1998).
As stated, a comprehensive science assessment system will include some measures that are closely linked to instruction and used primarily in classrooms for both formative and summative purposes (see Chapter 4). It will also include some measures designed to address specific monitoring purposes (see Chapter 5), including some that may be used as part of accountability policies. We recognize that adopting this approach would be a substantial change from what is currently done in most states and would require some careful consideration of how to assemble the components of an assessment system so that they provide useful and usable information for the wide variety of assessment purposes.
External on-demand assessments are more familiar to most people than other types of assessments. Moving from reliance on a single test to a comprehensive science assessment system to meet the NGSS goals is a big change. It will require policy makers to reconsider the role that assessment plays in the system: specifically, policy makers will need to consider which purposes require on-demand assessments that are given to all students in the state and which do not. We note that, for many purposes, there is no need to give the same test to all students in the state: matrix sampling, as discussed in Chapter 5, is a legitimate, viable, and often preferable option. And for other purposes, assessments that are more closely connected to classrooms and a specific curriculum are likely to be better choices than on-demand assessments.
Several connected sets of questions can guide thinking about the components of an assessment system:
- What is the purpose of the system and how will it serve to improve student learning?
- For what purposes are assessment components needed?
- How will the assessment and the use of the results help to improve student learning?
- What results will be communicated to the various audiences?
- How will the results be used, by whom, and what decisions will be based on them?
- How will the results from different components relate to each other?
- What role will accountability play in the system?
- Who will be held accountable for what?
- How will accountability policies serve to improve student learning?
- Given the intended use of each of the assessment components in the system, at what levels (i.e., individual or group) will scores be needed?
- Will information be needed about individuals or groups, such as those taught by particular teachers or who attend particular schools?
- Do all students in the state need to take the same assessment component or can sampling of students and/or content be used?
- What level of standardization of different components is needed to support the intended use?
- Do these uses require that certain assessment components be designed, administered, and scored by the state in a way that it is standardized across all school systems in the state?
- Can school systems be given some choice about the exact nature of the assessment components, such as when they are given, and how they will be scored?
- What procedures will be used to monitor the quality of instruction and assessment in the system to ensure that students have access to high-quality instruction and the necessary resources?
The answers to these interrelated questions will help policy makers design an assessment system that meets their priorities.
In the following two sections we present a rough sketch of two alternative models for an assessment system.7 As context for considering these alternative assessment models it is useful to note the ways in which they differ from the current system used in most states, the type of system that most students in this country experience. Currently, in most states, a single external (large-scale) assessment—designed or selected by the state—is given for monitoring purposes once in each grade span in elementary, middle, and high school. The assessment is composed predominantly of questions that assess factual recall. The assessment is given to all students and used to produce individual scores. Scores are aggregated to produce results at the group level. Classroom assessment receives relatively little attention in the current system, although this may vary considerably across schools depending on the resources available.
Although this is only a general sketch of the typical science assessment system in this country, it is not the type of system that we are recommending. In our judgment, this “default” system serves the purpose of producing numbers (test scores) that can be used to track science achievement on a limited range of content, but it cannot be used to assess learning in alignment with the vision of science learning in the framework or the NGSS.
As discussed above, the design of an assessment system should be based on a carefully devised plan that considers the purpose of each of the system components and how they will serve to improve student learning. The design should consider the types of evidence that are needed to achieve the intended purposes and
7These examples draw upon a presentation by Kathleen Scalise at the Invitational Research Symposium on Science Assessment sponsored by the Educational Testing Service, available at http://www.k12center.org/events/research-meetings/science.html [November 2013].
support the intended inference, and the types of assessment tasks needed to provide this evidence. In conceptualizing the system, we consider four critical aspects:
- The system should include components designed to provide guidance for classroom teaching and learning.
- It should include components designed for monitoring program effectiveness.
- It should have multiple and convergent forms of evidence for use in holding schools accountable for meeting learning goals.
- The various components should signify and exemplify important goals for student learning.
In the default system sketched above, results from large-scale standardized tests are used both for monitoring student learning and for program evaluation. The questions it includes signify the type of tasks students should be able to answer, which are not aligned with science learning envisioned in the framework and the NGSS. Test scores provide little information to guide instructional decision making. The examples in the next two sections provide only a rough sketch of two alternative systems—not all of the details that would need to be developed and worked out prior to implementation—but one can clearly see their differences with the current default model.
In Chapter 5, we describe two approaches to on-demand assessments (mixed-item formats with written responses and mixed-item formats with performance tasks) and three approaches to classroom-embedded assessment that could be used for monitoring purposes (replacement units, collections of performance tasks, and portfolios of work samples and work projects). In the system examples below, we explore ways to make use of these options in designing the monitoring assessment component of a system.
We assume that the assessment system would incorporate the advice offered in Systems for State Science Assessment (National Research Council, 2006) for designing a coherent system. That is, the system should be horizontally, vertically, and developmentally coherent. Horizontally, the curriculum, instruction, and assessment are aligned with the standards, target the same goals for learning, and work together to support students’ developing science literacy (National Research Council, 2006, p. 5). Vertically, all levels of the education system—classroom, school, school district, and state—are based on a shared vision of the goals for science education, the purposes and uses of assessment, and what constitutes competent performance. Developmentally, the system takes account of how students’
science understanding develops over time and the scientific content knowledge, abilities, and understanding that are needed for learning to progress at each stage of the process. (For further details about developing a comprehensive, coherent science assessment system, see National Research Council, 2006.)
We also assume that states and local education agencies would adopt NGSS-aligned curricula that incorporate the vision of science education conceptualized in the framework and would ensure that the system includes high-quality instructional materials and resources (including classroom assessments), that they would design suitable means of reporting the results of the assessments to appropriate audiences, and that teachers and administrators would receive comprehensive professional development so that they are well prepared for full implementation of a new system. Furthermore, we assume that available resources and professional development support the use of formative assessment as a regular part of instruction, relying on methods such as those described in Chapter 4. These features should be part of all science assessment systems. In the descriptions below, we focus on strategies for making use of the types of classroom and monitoring assessment strategies discussed in Chapters 4 and 5 of this report.
In this model, the monitoring assessment would be given once in each grade span (elementary, middle, and high school, e.g., grades 4, 8, and 10) and would consist of two components. The first component would be one of the on-demand assessment options we suggest in Chapter 5. In this approach, a test that makes use of mixed-item formats including some constructed-response tasks (such as those currently used for the New England Common Assessment Program or on the New York state assessments or that were used in the past for the Maryland School Performance Assessment Program, see Chapter 5), would be used as an on-demand component. The second component would include several classroom-embedded assessments incorporated into replacement units (see Chapter 5).
For this model, the on-demand component would be administered in a way that makes use of both the fixed-form and matrix-sampling administration approaches. All students at a tested grade would take a common test form that uses selected-response and constructed-response questions (including some technology-enhanced questions, if feasible). Every student would also have to complete one of several performance assessment tasks, administered through a matrix-sampling design. The common, fixed-form test would yield score reports
for individual students; the matrix-sampled portion would provide school-level scores.
Both parts of the monitoring assessment would be developed by the state. The state would determine when the on-demand assessment is given, but the district (or other local education agency) would make decisions about when the classroom-embedded assessment components would be scheduled and could select from among a set of options for the topics. Both parts of the monitoring assessment would be scored at the state level, although the state might decide to use teachers as scorers.
Although the assessments in the classroom-embedded component could be administered in a standardized way, one complication of this design is that it would be difficult to keep the assessments secure since they would be administered at different times of the school year. Thus, they would need to be designed in such a way that prior exposure to the assessment tasks would not interfere with measuring the intended constructs (performance expectations). In addition, further work would be needed on the best ways to combine results from the classroom-embedded component and the on-demand component.
Another decision would involve which performance expectations should be covered in the on-demand component and which ones would be covered in the classroom-embedded component. For example, the on-demand component could use currently available standardized tests for the disciplinary core ideas, adding in a set of complex tasks that also address a sampling of the scientific and engineering practices and crosscutting concepts. The classroom-embedded component could then assess a broader sample of the scientific and engineering practices and crosscutting concepts in the context of certain disciplinary core ideas.
In addition to the tasks used for the monitoring assessment, the state (or possibly a collaboration of states) would develop collections of tasks that could be used in the classroom to support formative and summative assessment purposes. The tasks would be designed to be aligned with the NGSS performance expectations and could be available for use in the classroom for a variety of purposes, such as to enliven instruction or to track progress (of course, the same tasks should not be simultaneously used for both). Teachers would be trained to score these tasks, and they would serve as examples for teachers to model as they develop their own assessments to use for classroom and instruction purposes.
Accountability policies would be designed to include indicators of opportunity to learn as discussed above, such as evidence that teachers have access to professional development and quality curricular materials and administrative sup-
ports, that they are implementing instruction and assessment in ways that align with the framework, and that all students have access to appropriate materials and resources.
Thus, in this example system, the classroom assessment component includes banks of tasks associated with specific performance expectations that demonstrate the learning goals for students and that are available for use in the classroom for instructional decision making. The monitoring component includes classroom-embedded and on-demand elements that allow for judgments about students’ learning and for evaluation of program effectiveness. Results from the monitoring assessments, as well as indicators of opportunity to learn, would be used for holding districts and schools accountable for progress in meeting learning goals. The consistency of the information from the different parts of the assessment system would be used to monitor the system for variation in science learning outcomes across districts and schools.
For this example, the on-demand component would consist of the mixed-item types option described in Chapter 5 that makes use of some selected-response questions and some short answer and extended constructed-response questions (such as the types of question formats on the advanced placement biology test discussed in Chapter 5 or some of the formats included in the taxonomy in Figure 5-6, in Chapter 5). The on-demand component would be administered as a fixed-form test that produces scores for individuals. Instead of replacement units, the classroom-embedded component would involve portfolios assembled to include examples of work in response to tasks specified by the state. The state would be in charge of scoring the assessments, including the portfolios, although it would be best if teachers were involved in the scoring.
This example shares some of the same complications as Example 1. Decisions will be needed as to which performance expectations will be covered in the on-demand assessment and which ones would be covered in the portfolios. It would also be difficult to maintain the security of the portfolio tasks if they are completed over the course of several weeks. In addition, assembling portfolios and evaluating the student work included in them is time and resource intensive. A research and development effort would be needed to investigate the best way to combine scores from the two types of assessments.
In addition to the monitoring assessment, portfolios could be used at each grade level to document students’ progress. States or districts might collaborate to
determine appropriate portfolio assignments and scoring rubrics; alternatively, an item bank of tasks and scoring rubrics could be developed to support classroom assessment. Decisions about the exact materials to be included in the portfolios would be determined by the state, the district, or the school. The portfolios would be scored at the district level by teachers who had completed training procedures as prescribed by the state for the monitoring assessment. The portfolios could be used as part of the data for assigning student grades.
As in Example 1, above, accountability would rely on results from the monitoring assessments as well as indicators of opportunity to learn. Samples of portfolios would be sent to the state for review of the quality of the assignments given to the students and the feedback teachers give them, providing one measure of opportunity to learn that could be combined with others, such as evidence that teachers have access to professional development and quality curricular materials and administrative supports, that they are implementing instruction and assessment in ways that align with the framework, and that all students have access to appropriate materials and resources.
Thus, in this system, the descriptions of materials to be included in portfolios exemplify the learning goals for students and are available to use in the classroom for instructional decision making. The external assessment allows for monitoring students’ learning and evaluating program effectiveness. Results from the monitoring assessments as well as indicators of opportunity to learn would be used for holding schools accountable for meeting learning goals.
In this chapter, we have discussed the importance of a systems approach to developing science assessments and described the system components that will be needed to adequately assess the breadth and depth of the NGSS.
CONCLUSION 6-1 A coherently designed multilevel assessment system is necessary to assess science learning as envisioned in the framework and the Next Generation Science Standards and provide useful and usable information to multiple audiences. An assessment system intended to serve accountability purposes and also support learning will need to include multiple components: (1) assessments designed for use in the classroom as part of day-to-day instruction, (2) assessments designed for monitoring purposes that include both on-demand and classroom-embedded components, and (3) a set of indicators designed to monitor the quality of instruction to ensure that students
have the opportunity to learn science as envisioned in the framework. The design of the system and its individual components will depend on multiple decisions, such as choice of content and practices to be assessed, locus of control over administration and scoring decisions, specification of local assessment requirements, and the level and types of auditing and monitoring. These components and choices can lead to the design of multiple types of assessment systems.
We also note that designing reports of assessment results that are clear and understandable and useful for the intended purpose is an essential and critical aspect of the system design.
CONCLUSION 6-2 Assessment reporting is a critical element of a coherent system. How and to whom results will be reported are questions that need to be considered during the first stages of designing an assessment system because those answers will guide almost all subsequent decisions about the design of each of the system’s assessment components and their relationship to each other.
Given the widespread concerns expressed above about adequate representation and coverage of the NGSS performance expectations, we make three recommendations related to the monitoring of student learning and the opportunity-to-learn functions that a state assessment system should be designed to support. Recommendations about the classroom assessment function are in Chapter 4; this function is one of the three pillars of any coherent state system even though it is not the primary focus of the recommendations in this chapter.
RECOMMENDATION 6-1 To adequately address the breadth and depth of the performance expectations contained in the Next Generation Science Standards, state and local policy makers should design their assessment systems so information used for monitoring purposes is obtained from both on-demand assessments developed by the state and a complementary set of classroom-embedded assessments developed either by the state or by districts, with state approval. To signify and make visible their importance, the monitoring assessment should include multiple performance-based tasks of three-dimensional science learning. When appropriate, computer-based technology should be used in monitoring assessments to broaden and deepen the range of
performances demanded on the tasks in both the classroom-embedded and on-demand components.
The system design approach contained in Recommendation 6-1 will be necessary to fully cover the NGSS performance expectations for a given grade. Including a classroom-embedded component as part of the monitoring of student learning will demonstrate the importance of three-dimensional science learning and assessment to local educators while simultaneously providing them with examples and data to support ongoing improvements in instruction and learning.
RECOMMENDATION 6-2 States should routinely collect information to monitor the quality of classroom instruction in science, including the extent to which students have the opportunity to learn science in the ways called for in the framework, and the extent to which schools have the resources needed to support student learning. This information should be collected through inspections of school science programs, surveys of students and teachers, monitoring of teacher professional development programs, and documentation of curriculum assignments and student work.
For some monitoring purposes, individual student scores are not needed, only group-level scores. Whenever individual-level scores are not needed, the use of matrix-sampling procedures should be considered. Matrix sampling provides an efficient way to cover the domain more completely, can make it possible to use a wider array of performance-based tasks as well as equating techniques. In addition, hybrid models—that include some items or tasks common to all students and others that are distributed across students using matrix sampling—could also be used for monitoring functions (such as described above for Example 1).
RECOMMENDATION 6-3 In planning the monitoring elements of their system, state and local policy makers should design the on-demand and classroom-embedded assessment components so that they incorporate the use of matrix-sampling designs whenever appropriate (rather than requiring that every student take every item), especially for systems monitoring purposes. Variation in matrix-sampling designs—such as some that include a few questions or tasks common to all students and others that are distributed across students—should be considered for optimizing the monitoring process.
We caution against systems that place a primary focus on the monitoring assessment; rather, we encourage policy makers to take a balanced approach in allocating resources for each component of an assessment system. To ensure that all of the resources for developing assessments are not devoted to the monitoring component of the assessment system, we encourage policy makers to carefully consider the frequency with which the monitoring assessment is administered.
RECOMMENDATION 6-4 State and local policy makers should design the monitoring assessments in their systems so that they are administered at least once, but no more than twice, in each grade span (K-5, 6-8, 9-12), rather than in every grade every year.
Designing the links among the components of an assessment system, particularly between the on-demand components and the classroom-embedded assessment information, will be a key challenge in the development of an assessment system. Such links will be especially important if the information is to be used for accountability purposes. As noted throughout this report, if significant consequences are attached only to the on-demand assessments, instructional activities are likely to be focused on preparation for those assessments (teaching to the test). The kinds of learning objectives that can only be assessed using classroom-embedded assessments, such as student-designed investigations, are too important to exclude from the purview of the assessment monitoring and accountability system. Since the kinds of linkages that are needed have not yet been implemented in the United States, education decision makers face a challenge in trying to meet the goals of the Next Generation Science Standards.
RECOMMENDATION 6-5 Policy makers and funding agencies should support research on strategies for effectively using and integrating information from on-demand and classroom-embedded assessments for purposes of monitoring and accountability.