National Academies Press: OpenBook

State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops (2010)

Chapter: 2 Improving Assessments - Questions and Possibilities

« Previous: 1 Introduction
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

2
Improving Assessments—Questions and Possibilities

There is no shortage of ideas about how states might improve their assessment and accountability systems. The shared goal for any such change is to improve instruction and student learning, and there are numerous possible points of attack. But, as discussed in Chapter 1, many observers argue that it is important to consider the system as a whole in order to develop a coherent, integrated approach to assessment and accountability. Although any one change—higher quality assessments, better developed standards, or more focused curricula—might have benefits, it would not lead to the desired improvement on its own. Nevertheless, it is worth looking closely at possibilities for each of the two primary elements of an accountability system: standards and assessments.

STANDARDS FOR BETTER INSTRUCTION AND LEARNING: AN EXAMPLE

Much is expected of education standards as a critical system component because they define both broad goals and specific expectations for students. They are intended to guide classroom instruction, the development of curricula and supporting materials, assessments, and professional development. Yet evaluations of many sets of standards have found them wanting, and they have rarely had the effects that were hoped and expected for them (National Research Council, 2008).

Shawn Stevens used the example of science to describe the most important attributes of excellent standards. She began by enumerating the most

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

prevalent criticisms of existing national, state, and local science standards: that they include too much material, do not establish clear priorities among the material included, and provide insufficient interpretation of how the ideas included should be applied. Efforts to reform science education have been driven by standards—and have yielded improvements in achievement—but existing standards do not generally provide a guide for the development of coherent curricula. They do not support students in developing an integrated understanding of key scientific ideas, she said, which has been identified as a key reason that U.S. students do not perform as well as many of their international peers (Schmidt, Wang, and McKnight, 2005). Stevens described a model she and two colleagues developed that is based on current understanding of science learning, as well as a proposed process for developing such standards and using them to develop assessments (Krajcik, Stevens, and Shin, 2009).

Stevens and her colleagues began with the recommendations in a National Research Council (2005) report on designing science assessment systems: standards need to describe performance expectations and proficiency levels in the context of a clear conceptual framework and be built on sound models of student learning. Standards should be clear, detailed, and complete; reasonable in scope; and both rigorous and scientifically accurate. She briefly summarized the findings from research on learning in science that are particularly relevant to the development of such standards.

Integrated understanding of science is built on a relatively small number of foundational ideas that are central across the scientific disciplines—referred to as “big ideas”—such as the principle that the natural world is composed of a number of interrelated systems (see also National Research Council, 1996, 2005; Stevens, Sutherland, and Krajcik, 2009). These kinds of ideas allow both scientists and students to explain many sorts of observations and to identify connections among facts, concepts, models, and principles (National Research Council, 2005; Smith et al., 2006). Understanding of the big ideas helps learners develop detailed conceptual frameworks that, in turn, make it possible to undertake scientific tasks, such as solving problems, making predictions, observing patterns, and organizing and structuring new information. Learning complex ideas takes time and often happens as students work on tasks that force them to synthesize new observations with what they already knew. Students draw on a foundation of existing understanding and experiences as they gradually assemble bodies of factual knowledge and organize it according to their growing conceptual understanding.

The most important implication of these findings for standards is that they must be elaborated so that educators can connect them with instruction, instructional materials, and assessments, Stevens explained. That is, not only should a standard describe the subject matter it is critical for students to know, it should also describe how students should be able to use and apply that knowledge. For example, standards should not only describe the big ideas in

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

declarative sentences, but also elaborate on the underlying concepts that are critical to developing a solid understanding of each big idea. Moreover, these concepts and ideas should be revisited throughout K-12 schooling so that knowledge and reasoning become progressively more refined and elaborated. Standards need to reflect these stages of learning. And because prior knowledge is so important to developing understanding, it is important that standards are specific about the knowledge students will need at particular stages to support each new level of understanding. Stevens also noted that standards need to address common misunderstandings and difficulties students have learning particular content so that instruction can explicitly target them; research has documented some of these. The approach of Stevens and her colleagues to designing rigorous science standards is based on previous work in the design of curricula and assessments, which they call construct-centered design (McTighe and Wiggins, 1998; Mislevy and Riconscente, 2005; Krajcik, McNeill, and Reiser, 2008; Shin, Stevens, and Krajcik, in press). The name reflects the goal of making the ideas and skills (constructs) that students are expected to learn, and that teachers and researchers want to measure, the focus for aligning curriculum, instruction, and assessment. Stevens described the six elements in the construct-centered design process that function in an interactive, iterative way, so that information gained from any element can be used to clarify or modify the product of another.

The first step is to identify the construct. The construct might be a concept (evolution or plate tectonics), a theme (e.g., size and scale or consistency and change), or a scientific practice (learning about the natural world in a scientific way). Stevens used the example of forces and interactions on the molecular and nano scales—that is, the idea that all interactions can be described by multiple types of forces, but that the relative impact of each type of force changes with scale—to illustrate the process.

The second step is to articulate the construct, on the basis of expert knowledge of the discipline and related learning research. This, Stevens explained, means explicitly identifying the concepts that are critical for developing understanding of a particular construct and defining the successive targets for students would reach in the course of their schooling, as they progress toward full understanding of the construct. This step is important for guiding the instruction at various levels.

Box 2-1 shows an example of the articulation of one critical concept that is important to understanding the sample construct (regarding forces and interactions on the molecular and nano scale). This articulation, which describes the upper level of K-12 understanding, is drawn from research on this topic; such research has not been conducted for many areas of science knowledge.

The articulation of the standard for this construct would also address the kinds of misconceptions students are likely to bring to this topic, which are also drawn from research on learning of this subject matter. For example, students

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

BOX 2-1

Articulation of the Idea That Electrical Forces Govern Interactions Between Atoms and Molecules

  • Electrical forces depend on charge. There are two types of charge—positive and negative. Opposite charges attract; like charges repel.

  • The outer shell of electrons is important in inter-atomic interactions. The electron configuration in the outermost shell/orbital can be predicted from the Periodic Table.

  • Properties, such as polarizability, electron affinity, and electronegativity, affect how a certain type of atom or molecule will interact with another atom or molecule. These properties can be predicted from the Periodic Table.

  • Electrical forces generally dominate interactions on the nano-, molecular, and atomic scales.

  • The structure of matter depends on electrical attractions and repulsions between atoms and molecules.

  • An ion is created when an atom (or group of atoms) has a net surplus or deficit of electrons.

  • Certain atoms (or groups of atoms) have a greater tendency to be ionized than others.

  • A continuum of electrical forces governs the interactions between atoms, molecules, and nanoscale objects.

  • The attractions and repulsions between atoms and molecules can be due to charges of integer value, or partial charges. The partial charges may be due to permanent or momentary dipoles.

  • When a molecule has a permanent electric dipole moment, it is a polar molecule.

  • Instantaneous induced-dipole moments occur when the focus of the distribution shifts momentarily, thus creating a partial charge. Induced-dipole interactions result from the attraction between the instantaneous electric dipole moments of neighboring atoms or molecules.

  • Induced-dipole interactions occur between all types of atoms and molecules, but increase in strength with an increasing number of electrons.

  • Polarizability is a measure of the potential distortion of the electron distribution. Polarizable atoms and ions exhibit a propensity toward undergoing distortions in their electron distribution.

  • In order to predict and explain the interaction between two entities, the environment must also be considered.

SOURCE: Krajcik, Stevens, and Shin (2009, p. 11).

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

might believe that hydrogen bonds occur between two hydrogen atoms or not understand the forces responsible for holding particles together in the liquid or solid state (Stevens, Sutherland, and Kajcik, 2009). This sort of information can help teachers make decisions about when and how to introduce and present particular material and help curriculum designers plan instructional sequences.

The third step is to specify the way students will be expected to use the understanding that has been identified and articulated. Stevens and her colleagues call this step developing “claims” about the construct. Claims identify the reasoning or cognitive actions students would do to demonstrate their understanding of the construct. (For this step, too, developers would need to rely on research on learning for the particular subject.) Students might be expected to be able to provide examples of particular phenomena, explain patterns in data, or develop and test hypotheses. For example, a claim related to the example in Box 2-1 might be: “Students should be able to explain the attraction between two objects in terms of the generation of opposite charges caused by an imbalance of electrons.”

The fourth step is to specify what sorts of evidence will constitute proof that students have gained the knowledge and skills described. A claim might be used at more than one level because understanding is expected to develop sequentially across grades, Stevens stressed. Thus, it is the specification of the evidence that makes clear the degree and depth of students’ understanding that are expected at each level. Table 2-1 shows the claim regarding opposite charges in the context of the cognitive activity and critical idea under which it nests, as well the evidence of understanding of this claim that might be expected of senior high school students. The evidence appropriate for say, middle school students, would be less sophisticated.

The fifth step is to specify the learning and assessment tasks that students need to demonstrate, based on the elaborated description of the knowledge and skills students need. The “task” column in Table 2-1 shows examples.

The sixth step is to review and revise all the products to ensure that they are well aligned with one another. Such a review might include internal quality checks conducted by the standards developers, as well as feedback from teachers or from content or assessment experts. Stevens said that pilot tests and field trials provide essential information, and review is critical to success.

Stevens and her colleagues were also asked to examine the draft versions of the common core standards for 12th grade English and mathematics that were developed by the Council of Chief State School Officers and the National Governors Association to assess how closely they conform to the construct-centered design approach. Their analysis noted that both sets of standards do describe how the knowledge they call for would be used by students, but that the English standards do not describe what sorts of evidence would be necessary to prove that a student had met the standards. The mathematics standards appeared to provide more elaboration.

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

TABLE 2-1 Putting a Claim About Student Learning in Context

Critical Idea

Cognitive Activity

Claim

Evidence

Task

Intermolecular Forces

Construct an explanation.

Students will be able to explain attraction between two objects in terms of the production of opposite charges caused by an imbalance of electrons.

Student work product will include

  • Students explain the production of charge by noting that only electrons move from one object to another object.

  • Students note that neutral matter normally contains the same number of electrons and protons.

  • Students note that electrons are negative charge carriers and that the destination object of the electrons will become negative, as it will have more electrons than protons.

  • Students recognize that protons are positive charge carriers and that the removal of electrons causes the remaining material to have an imbalance in positive charge.

  • Students cite the opposite charges of the two surfaces as producing an attractive force that hold the two objects together.

Learning Task:

Students are asked to predict how pieces of tape will be attracted or repulsed by each other.

Assessment Task:

Students are asked to explain why the rubbing of fur against a balloon causes the fur to stick to the balloon.

SOURCE: Krajcik, Stevens, and Shin (2009, p. 14).

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

ASSESSMENTS FOR BETTER INSTRUCTION AND LEARNING: AN EXAMPLE

Assessments, as many have observed, are the vehicle for implementing standards, and, as such, have been blamed for virtually every shortcoming in education. There may be no such thing as an assessment task to which no one will object, Mark Wilson said, but it is possible to define what makes an assessment task good—or rather how it can lead to better instruction and learning. He provided an overview of current thinking about innovative assessment and described one example of an effort to apply that thinking, in the BEAR (Berkeley Evaluation and Assessment Research) System (Wilson, 2005).

An assessment task may affect instruction and learning in three ways, Wilson said. First, the inclusion of particular content or skills signifies to teachers, parents, and policy makers what should be taught. Second, the content or structure of an item conveys information about the sort of learning that is valued in the system the test represents. “Do we want [kids] to know how to select from four options? Do we want them to know how they can develop a project and work on it over several weeks and come up with an interesting result and present it in a proper format? These are the sorts of things we learn from the contents of the item,” Wilson explained. And, third, the results for an item, together with the information from other items in a test, provide information that can spur particular actions.

These three kinds of influences often align with three different perspectives—(1) policy makers perhaps most interested in signaling what is most important in the curriculum, (2) teachers and content experts most interested in the messages about implications for learning and instruction, and (3) assessment experts most interested in the data generated. All three perspectives need to be considered in a discussion of what makes an assessment “good.”

For Wilson, the most important characteristic of a good assessment system is coherence. A coherent system is one in which each element (of both summative and formative assessments) measures consistent constructs and contributes distinct but related information that educators can use. Annual, systemwide, summative tests receive the most attention, he pointed out, but the great majority of assessments that students deal with are those that teachers use to measure daily, weekly, and monthly progress. Therefore, classroom assessments are the key to effective instruction and learning, he emphasized. Building classroom assessments is more challenging than building large-scale summative ones—yet most resources go to the latter.

In some sense, Wilson pointed out, most state assessment systems are coherent. But he describes the current situation for many of them as “threat coherence,” in which “large-scale summative assessment is used as a driving and constraining force, strait-jacketing classroom instructions and curriculum.” He maintained that in many cases the quality of the tests and the decisions about what they should cover are not seen as particularly important—what matters is

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

that they provide robust data and clear guidance for teachers. This sort of situation presents teachers with a dilemma: the classroom tests they use may either parallel the large-scale assessment or be irrelevant for accountability purposes. Thus, they can either focus on the tested material despite possible misgivings about what they are neglecting, or they can view preparing for the state test and teaching as two separate endeavors.

More broadly, Wilson said, systems that are driven by large-scale assessments risk overlooking important aspects of the curricula that cannot be adequately assessed using multiple-choice tests (just as some content cannot be easily assessed using projects or portfolios). Moreover, if the results of one systemwide assessment are used as the sole or principal indicator of performance on a set of standards that may describe a hundred or more constructs, it is very unlikely that student achievement on any one standard can be assessed in a way that is useful for educational planning. Results from such tests would support a very general conclusion about how students are doing in science, for example, but not more specific conclusions about how much they have learned in particular content areas, such as plate tectonics, or how well they have developed particular skills, such as making predictions and testing hypotheses.

Another way in which systems might be coherent is through common items, Wilson noted. For example, items used in a large-scale assessment might be used for practice in the classroom, or slightly altered versions of the test items might be used in interim assessments, to monitor students’ likely performance on the annual assessment. The difficulty he sees with this approach to system coherence is that a focus on what it takes to succeed with specific items may distract teachers and students from the actual intent behind content standards.

When the conceptual basis—the model of student learning—underlying all assessments (whether formative or summative) is consistent, then the system is coherent in a more valuable way. It is even possible to go a step beyond this sort of system coherence, to what Wilson called “information coherence.” In such a system, one would make sure not only that assessments are all developed from the same model of student learning, but also that they are explicitly linked in other ways. For example, a standardized task could be administered to students in many schools and jurisdictions, but delivered in the classroom. The task would be designed to provide both immediate formative information that teachers can use and consistent information about how well students meet a particular standard. Teachers would be trained in a process that ensured a degree of standardization in both test administration and evaluation of the results, and statistical controls could be used to monitor and verify the results. Similarly, classroom work samples based on standardized assignments could be centrally scored. The advantage of this approach is that it derives maximum advantage from each activity. The assessment task generates information and is also an important instructional activity; the nature of the assessment also com-

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
FIGURE 2-1 The BEAR System.

FIGURE 2-1 The BEAR System.

SOURCE: Wilson (2009, slide #20).

municates very directly with teachers about what sorts of learning and instruction are valued by the system (see Wilson, 2004).1

The BEAR System (Wilson, 2005) is an example of an assessment system that is designed to have information coherence. It is based on four principles, each of which has a corresponding “building block,” see Figure 2-1. These elements function in a cycle, so that information gained from each phase of the process can be used to improve other elements. Wilson noted that current accountability systems rarely allow for this sort of continuous feedback and refinement, but that it is critical (as in any engineering system) to respond to results and developments that could not be anticipated.

The construct map defines what is to be assessed, and Wilson described it as a visual metaphor for the ways that students’ understanding develops and, correspondingly, how their responses to items might change. Table 2-2 is an example of a construct map for an aspect of statistics, the capacity to consider certain statistics (such as a mean or a variance) as a measure of the qualities of a sample distribution.

1

Another approach to using assessments conducted throughout the school year to provide accountability data is the Cognitively-Based Assessment of, for and as Learning (CBAL), a program currently under development at the Educational Testing Service (for more information, see http://eric.ed.gov/PDFS/ED507810.pdf [accessed August 2010].

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

TABLE 2-2 Sample Construct Map: Conception of Statistics

Conception of Statistics (CoS3): Objective

Student Tasks Specific to CoS3

Student/Teacher Response

CoS3. Consider statistics as measure of qualities of a sample distribution.

CoS3(f) Choose statistics by considering qualities of a particular sample.

– “It is better to calculate median because this data set has an extreme outlier. The outlier increases the mean a lot.”

CoS3(e) Attribute magnitude or location of a statistic to processes generating the sample.

– A student attributes a reduction in median deviation to a change in the tool used to measure an attribute.

CoS3(d) Investigate the qualities of a statistic.

– “Nick’s spreadness method is good because it increases when a data set is more spread-out.”

CoS3(c) Generalize the use of a statistic beyond its original context of application or invention.

– Students summarize different data sets by applying invented measures.

– Students use average deviation from the median to explore the spreadness of the data.

CoS3(b) Invent a sharable measurement process to quantify a quality of the sample.

– “In order to find the best guess, I count from the lowest to the highest and from the highest to the lowest at the same time. If I have an odd total number of data, the point where the two counting methods meet will be my best guess. If I have an even total number, the average of the two last numbers of my two counting methods will be the best guess.”

CoS3(a) Invent an idiosyncratic measurement process to quantify a quality of the sample based on tacit knowledge that the other may not share.

– “In order to find the best guess, I first looked at which number has more than others and I got 152 and 158 both repeated twice. I picked 158 because it looks more reasonable to me.”

SOURCE: Wilson (2009, slide #23).

The item design specifies how students will be stimulated to respond and is the means by which the match between curriculum and assessment is established. Wilson described it as a set of principles that allow one to observe students under a set of standard conditions. Most critical is that the design specifications make it possible to observe each of the levels and sublevels described in the construct map. Box 2-2 shows a sample item that assesses one of the statistical concepts in the construct map in Table 2-2.

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

BOX 2-2

Sample Item Assessing Conceptions of Statistics

Kayla’s Project


Kayla completes four projects for her social studies class. Each is worth 20 points.

Kayla’s Projects

Points Earned

Project 1

16 points

Project 2

18 points

Project 3

15 points

Project 4

???

The mean score Kayla received for all four projects was 17.


Use this information to find the number of points Kayla received on Project 4. Show your work.


SOURCE: Wilson (2009, slide #25).

The “outcome space” on the lower right portion of Figure 2-1 is a general guide to the way students’ responses to items developed in relation to a particular construct map will be valued. The more specific guidance developed for a particular item is used as the actual scoring guide, Wilson explained, which is designed to ensure that all of the information elicited by the task is easy for teachers to interpret. Figure 2-2 is the scoring guide for the “Kayla” item, with sample student work to illustrate the levels of performance.

The final element of the process is to collect the data and link it back to the goals for the assessment and the construct maps. The system relies on a multidimensional way of organizing statistical evidence of the quality of the assessment, such as its reliability, validity, and fairness. Item response models show students’ performance on particular elements of the construct map across time and also allow for comparison within a cohort of students or across cohorts.

Wilson closed by noting that a goal for almost any large-scale test is to provide information that teachers can use in the classroom, but that this goal requires that large-scale and classroom assessments are constructed to provide information in a coherent way. He acknowledged, however, that implementing a system such as BEAR is not a small challenge. Doing so requires a deep analysis of the relationship between student learning, the curriculum, and instructional practices—a level of analysis not generally undertaken as part of test develop-

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
FIGURE 2-2 Scoring guide for sample item.

FIGURE 2-2 Scoring guide for sample item.

SOURCE: Wilson (2009, slide #27).

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

ment. Doing so also requires a readiness to revise both curricula and standards in response to the empirical evidence that assessments provide.

INNOVATIONS AND TECHNICAL CHALLENGES

Stephen Lazer reflected on the technical and economic challenges of pursuing innovative assessments on a large scale from the point of view of test developers. He began with a summary of current goals for improving assessments:

  • increase use of performance tasks to measure a growing array of skills and obtain a more nuanced picture of students;

  • rely much less on multiple-choice formats because of limits on what they can measure and their perceived impact on instruction;

  • use technology to measure content and skills not easily measured using paper-and-pencil formats and to tailor assessments to individuals; and

  • incorporate assessment tasks that are authentic—that is, that ask students to do tasks that might be done outside of testing and are worthwhile learning activities in themselves.

If this is the agenda for improving assessments, he pointed out, “it must be 1990.” He acknowledged that this was a slight overstatement: progress has been made since 1990, and many of these ideas were not applied to K-12 testing until well after 1990. Nevertheless, many of the same goals were the focus of reform two decades ago, and an honest review of what has and has not worked well during 20 years of work on innovative assessments can help increase the likelihood of success in the future.

A review of these lessons should begin with clarity about what, exactly, an innovative assessment is, he said. For some, Lazer suggested, it might be any item format other than multiple choice, yet many constructed-response items are not particularly innovative because they only elicit factual recall. Some assessments incorporate tasks with innovative presentation features, but they may not actually measure new sorts of constructs or produce richer information about what students know and can do. Some computer-based assessments fall into this category. Because they are colorful and interesting, some argue that they are more engaging to students, but they may not differ in more substantive ways from the assessments they are replacing. If students click on a choice, rather than filling in a bubble, “we turn the computer into an expensive pageturner,” he pointed out. Moreover, there is no evidence that engaging assessments are more valid or useful than traditional ones, and the flashiness may disguise the wasting of a valuable opportunity.

What makes an assessment innovative, Lazer argued, is that it expands measurement beyond the constructs that can be measured easily with multiple-choice items. Both open-ended and performance-based assessments offer pos-

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

sibilities for doing this, as does technology. Performance assessments offer a number of possibilities: the opportunity to assess in a way that is more directly relevant to the real-world application of the skills in question, the opportunity to obtain more relevant instructional feedback, a broadening of what can be measured, and the opportunity to better integrate assessment and instruction.


Use of Computers and Technology Computers make it possible to present students with a task that could not otherwise be done—for example, by allowing students to demonstrate geography skills using an online atlas when distributing printed atlases would have been prohibitively expensive. Equally important, though, is that students will increasingly be expected to master technological skills, particularly in science, and those kinds of skills can only be assessed using technology. Even basic skills, such as writing, for which most students now use computers almost exclusively whether at home or at school, may need to be assessed by computer to ensure valid results. Computer-based testing and electronic scoring also make it possible to tailor the difficulty of individual items to a test taker’s level and skills. Furthermore, electronic scoring can provide results quickly and may make it easier to derive and connect formative and summative information from items.


Cost Perhaps the most significant challenge to using these sorts of innovative approaches is cost, Lazer said. Items of this sort can be time consuming and expensive to develop, particularly when there are few established procedures for this work. Although some items can be scored by machine, many require human scoring, which is significantly more expensive and also adds to the time required to report results. Automated scoring of open-ended items holds promise for reducing the expense and turnaround time, Lazer suggested, but this technology is still being developed. The use of computers may also have hidden costs. For example, very few state systems have enough computers in classrooms to test large numbers of students simultaneously. When students cannot be tested simultaneously, the testing window must be longer, and security concerns may mean that it is necessary to have wide pools of items and an extended reporting window. Many of these issues need further research.


Test Development Test developers know how to produce multiple-choice items with fairly consistent performance characteristics on a large scale, and there is a knowledge base to support some kinds of constructed-response items. But for other item types, Lazer pointed out, “there is really very little in the way of operational knowledge or templates for development.” For example, simulations have been cited as a promising example of innovative assessment, and there are many interesting examples, but most have been designed as learning experiments, not assessments. Thus, in Lazer’s view, the development of ongoing assessments using simulations is in its infancy. Standard techniques

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

for analyzing the way items perform statistically do not work as well for many constructed-response items as for multiple-choice ones—and not at all for some kinds of performance tasks. For many emerging item types, there is as yet no clear model for getting the maximum information out of them, so some complex items might yield little data.


The Role of Theoretical Model The need goes deeper than operational skills and procedures, however, Lazer said. Multiple-choice assessments allow test developers to collect data that support inferences about specific correlations—for example, between exposure to a particular curriculum and the capacity to answer a certain percentage of a fairly large number of items correctly—without requiring the support of a strong theoretical model. For other kinds of assessments, particularly performance assessments that may include a much smaller number of items or observations, a much stronger cognitive model of the construct being measured is needed. Without such a model, Lazer noted, one can write open-ended or computer-based items that are not very high in quality, something he suggested happened too frequently in the early days of performance assessment. Moreover, even a well-developed model is no guarantee of the quality of the items. The key challenge is not to mistake authenticity for validity. Validity depends on the claim one wants to make, it is very important that the construct be defined accurately and that the item truly measures the skills and knowledge it is intended to measure.

It can also be much more difficult to generalize from assessments that rely on a relatively small number of tasks. Each individual task may measure a broader construct than do items on conventional tests, but at the cost of yielding a weaker measure of the total domain, of which the construct is one element. And since the items are likely to be more time consuming for students, they will complete fewer of them. There is likely to be a fairly strong person-task interaction, particularly if the task is heavily contextualized.

It is also important to be clear about precisely what sorts of claims the data can support. For example, it may not be possible to draw broad conclusions about scientific inquiry skills from students’ performance in a laboratory simulation related to a pond ecosystem. With complex tasks, such as simulations, there may also be high interdependence among the observations that are collected, which also undermines the reliability of each one. These are not reasons to avoid this kind of item, Lazer said, but he cautioned that it is important to be aware that if generalizability is low enough, the validity of the assessment is jeopardized. Assessing students periodically over a time span, or restricting item length, are possible ways of minimizing these disadvantages, but each of these options presents other potential costs or disadvantages.


Scoring Lazer underlined that human scoring introduces another source of possible variation and limits the possibility of providing rapid results. In gen-

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

eral, the complexity of scoring for some innovative assessments is an important factor to consider in a high-stakes, “adequate yearly progress environment,” in which high confidence in reliability and interrater reliability are very important. A certain degree of control over statistical quality is important not only for comparisons among students, but also for monitoring trends. Value-added modeling and other procedures for examining a student’s growth over time and the effects of various inputs, such as teacher quality, also depend on a degree of statistical precision that can be difficult to achieve with some emerging item types.


Equating Lazer noted that a related issue is equating, which is normally accomplished through the reuse of a subset of a test’s items. Many performance items cannot be reused because they are memorable. And even when they can be reused, it can be difficult to ensure that they are scored in exactly the same way across administrations. Although it might be possible to equate by using other items in a test, if they are of a different type (e.g., multiple choice), the two parts may actually measure quite different constructs, so the equating could actually yield erroneous results. For similar reasons, it can be very difficult to field test these items, and though this is a mundane aspect of testing, it is very important for maintaining the quality of the information collected.


Challenges for Students Items that make use of complex technology can pose a challenge to students taking the tests, Lazer said. It may take time for students to learn to use the interface and perform the activities the test requires, and the complexity may affect the results in undesired ways. For example, some students may score higher because they have greater experience and facility with the technology, even if their skill with the construct being tested is not better than that of other students.


Conflicting Goals For Lazer, the key concern with innovative assessment is the need to balance possibly conflicting goals. He repeated what others have noted—that the list of goals for new approaches is long: assessments should be shorter and cheaper and provide results quickly; they should include performance assessment; they should be adaptive; and they should support teacher and principal evaluation. For him, this list highlights that some of the challenges are “know-how” ones—that can presumably be surmounted with additional effort and resources. Others are facts of life. Psychometricians may be working from outdated cognitive models, and this can be corrected. But it is unlikely that further research and development will make it possible to overcome the constraints imposed when both reliability and validity, for example, are important to the results.

“This doesn’t mean we should give up and keep doing what we’re doing,”

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

Lazer concluded. These are not insurmountable conflicts, but each presents a tradeoff. He said the biggest challenge is acknowledging the choices. To develop an optimal system will require careful thinking about the ramifications of each feature. Above all, Lazer suggested, “we need to be conscious of the limits of what any single test can do.” He enthusiastically endorsed the systems approaches described earlier, in which different assessment components are designed to meet different needs, but in a coherent way.

LOOKING FORWARD

A few themes emerged in the general discussion. One was that the time and energy required for the innovative approaches described—specifying the domain, elaborating the standards, validating that model of learning—is formidable. Taking this path would seem to require a combination of time, energy, and expertise that is not typically devoted to test development. However, the BEAR example seems to marry the expertise of content learning, assessment design, and measurement in a way that offers the potential to be implemented in a relatively efficient way. The discussion of technical challenges illustrated the many good reasons that the current testing enterprise seems to be stuck in what test developers already know how to do well.

This situation raised two major questions for workshop participants. First: Is the $350 million total spending planned for the Race to the Top Initiative enough? Several participants expressed the view that assessment is an integral aspect of education, whether done well or poorly, but that its value could be multiplied exponentially if resources were committed to develop a coherent system. Second: What personnel should be involved in the development of new assessment systems? The innovation that is required may be more than can be expected from test publishers. A participant joked that the way forward might lie in putting cognitive psychologists and psychometricians in a locked room until they resolved their differences—the challenge of balancing the competing imperatives each group raises is not trivial.

A final word from another participant, that “it’s the accountability, stupid,” reinforced a number of other comments. That is, the need to reduce student achievement to a single number derives from the punitive nature of the current accountability system. It is this pressure that is responsible for many of the constraints on the nature of assessments. Changing that context might make it easier to view assessments in a new light and pursue more creative forms and uses.

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×

This page intentionally left blank.

Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 15
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 16
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 17
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 18
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 19
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 20
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 21
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 22
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 23
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 24
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 25
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 26
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 27
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 28
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 29
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 30
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 31
Suggested Citation:"2 Improving Assessments - Questions and Possibilities." National Research Council. 2010. State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops. Washington, DC: The National Academies Press. doi: 10.17226/13013.
×
Page 32
Next: 3 Recent Innovative Assessments »
State Assessment Systems: Exploring Best Practices and Innovations: Summary of Two Workshops Get This Book
×
Buy Paperback | $47.00 Buy Ebook | $37.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Educators and policy makers in the United States have relied on tests to measure educational progress for more than 150 years, and have used the results for many purposes. They have tried minimum competency testing; portfolios; multiple-choice items, brief and extended constructed-response items; and more. They have contended with concerns about student privacy, test content, and equity--and they have responded to calls for tests to answer many kinds of questions about public education and literacy, international comparisons, accountability, and even property values.

State assessment data have been cited as evidence for claims about many achievements of public education, and the tests have also been blamed for significant failings. States are now considering whether to adopt the "common core" academic standards, and are also competing for federal dollars from the Department of Education's Race to the Top initiative. Both of these activities are intended to help make educational standards clearer and more concise and to set higher standards for students. As standards come under new scrutiny, so, too, do the assessments that measure their results.

This book summarizes two workshops convened to collect information and perspectives on assessment in order to help state officials and others as they review current assessment practices and consider improvements.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!