2
Improving Assessments—Questions and Possibilities

There is no shortage of ideas about how states might improve their assessment and accountability systems. The shared goal for any such change is to improve instruction and student learning, and there are numerous possible points of attack. But, as discussed in Chapter 1, many observers argue that it is important to consider the system as a whole in order to develop a coherent, integrated approach to assessment and accountability. Although any one change—higher quality assessments, better developed standards, or more focused curricula—might have benefits, it would not lead to the desired improvement on its own. Nevertheless, it is worth looking closely at possibilities for each of the two primary elements of an accountability system: standards and assessments.

STANDARDS FOR BETTER INSTRUCTION AND LEARNING: AN EXAMPLE

Much is expected of education standards as a critical system component because they define both broad goals and specific expectations for students. They are intended to guide classroom instruction, the development of curricula and supporting materials, assessments, and professional development. Yet evaluations of many sets of standards have found them wanting, and they have rarely had the effects that were hoped and expected for them (National Research Council, 2008).

Shawn Stevens used the example of science to describe the most important attributes of excellent standards. She began by enumerating the most



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 15
2 Improving Assessments— Questions and Possibilities There is no shortage of ideas about how states might improve their assess- ment and accountability systems. The shared goal for any such change is to improve instruction and student learning, and there are numerous pos- sible points of attack. But, as discussed in Chapter 1, many observers argue that it is important to consider the system as a whole in order to develop a coherent, integrated approach to assessment and accountability. Although any one change—higher quality assessments, better developed standards, or more focused curricula—might have benefits, it would not lead to the desired improvement on its own. Nevertheless, it is worth looking closely at possibili - ties for each of the two primary elements of an accountability system: standards and assessments. STANDARDS FOR BETTER INSTRUCTION AND LEARNINg: AN EXAMPLE Much is expected of education standards as a critical system component because they define both broad goals and specific expectations for students. They are intended to guide classroom instruction, the development of cur- ricula and supporting materials, assessments, and professional development. Yet evaluations of many sets of standards have found them wanting, and they have rarely had the effects that were hoped and expected for them (National Research Council, 2008). Shawn Stevens used the example of science to describe the most impor- tant attributes of excellent standards. She began by enumerating the most 

OCR for page 15
 STATE ASSESSMENT SYSTEMS prevalent criticisms of existing national, state, and local science standards: that they include too much material, do not establish clear priorities among the material included, and provide insufficient interpretation of how the ideas included should be applied. Efforts to reform science education have been driven by standards—and have yielded improvements in achievement—but existing standards do not generally provide a guide for the development of coherent curricula. They do not support students in developing an integrated understanding of key scientific ideas, she said, which has been identified as a key reason that U.S. students do not perform as well as many of their interna- tional peers (Schmidt, Wang, and McKnight, 2005). Stevens described a model she and two colleagues developed that is based on current understanding of science learning, as well as a proposed process for developing such standards and using them to develop assessments (Krajcik, Stevens, and Shin, 2009). Stevens and her colleagues began with the recommendations in a National Research Council (2005) report on designing science assessment systems: stan - dards need to describe performance expectations and proficiency levels in the context of a clear conceptual framework and be built on sound models of student learning. Standards should be clear, detailed, and complete; reasonable in scope; and both rigorous and scientifically accurate. She briefly summarized the findings from research on learning in science that are particularly relevant to the development of such standards. Integrated understanding of science is built on a relatively small number of foundational ideas that are central across the scientific disciplines—referred to as “big ideas”—such as the principle that the natural world is composed of a number of interrelated systems (see also National Research Council, 1996, 2005; Stevens, Sutherland, and Krajcik, 2009). These kinds of ideas allow both scientists and students to explain many sorts of observations and to identify connections among facts, concepts, models, and principles (National Research Council, 2005; Smith et al., 2006). Understanding of the big ideas helps learners develop detailed conceptual frameworks that, in turn, make it possible to undertake scientific tasks, such as solving problems, making pre - dictions, observing patterns, and organizing and structuring new information. Learning complex ideas takes time and often happens as students work on tasks that force them to synthesize new observations with what they already knew. Students draw on a foundation of existing understanding and experiences as they gradually assemble bodies of factual knowledge and organize it according to their growing conceptual understanding. The most important implication of these findings for standards is that they must be elaborated so that educators can connect them with instruction, instructional materials, and assessments, Stevens explained. That is, not only should a standard describe the subject matter it is critical for students to know, it should also describe how students should be able to use and apply that knowledge. For example, standards should not only describe the big ideas in

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES declarative sentences, but also elaborate on the underlying concepts that are critical to developing a solid understanding of each big idea. Moreover, these concepts and ideas should be revisited throughout K-12 schooling so that knowledge and reasoning become progressively more refined and elaborated. Standards need to reflect these stages of learning. And because prior knowledge is so important to developing understanding, it is important that standards are specific about the knowledge students will need at particular stages to support each new level of understanding. Stevens also noted that standards need to address common misunderstandings and difficulties students have learning particular content so that instruction can explicitly target them; research has documented some of these. The approach of Stevens and her colleagues to designing rigorous science standards is based on previous work in the design of curricula and assessments, which they call construct-centered design (McTighe and Wiggins, 1998; Mislevy and Riconscente, 2005; Krajcik, McNeill, and Reiser, 2008; Shin, Stevens, and Krajcik, in press). The name reflects the goal of making the ideas and skills (constructs) that students are expected to learn, and that teachers and researchers want to measure, the focus for aligning cur- riculum, instruction, and assessment. Stevens described the six elements in the construct-centered design process that function in an interactive, iterative way, so that information gained from any element can be used to clarify or modify the product of another. The first step is to identify the construct. The construct might be a concept (evolution or plate tectonics), a theme (e.g., size and scale or consistency and change), or a scientific practice (learning about the natural world in a scientific way). Stevens used the example of forces and interactions on the molecular and nano scales—that is, the idea that all interactions can be described by multiple types of forces, but that the relative impact of each type of force changes with scale—to illustrate the process. The second step is to articulate the construct, on the basis of expert knowl- edge of the discipline and related learning research. This, Stevens explained, means explicitly identifying the concepts that are critical for developing under- standing of a particular construct and defining the successive targets for stu - dents would reach in the course of their schooling, as they progress toward full understanding of the construct. This step is important for guiding the instruc - tion at various levels. Box 2-1 shows an example of the articulation of one critical concept that is important to understanding the sample construct (regarding forces and interac- tions on the molecular and nano scale). This articulation, which describes the upper level of K-12 understanding, is drawn from research on this topic; such research has not been conducted for many areas of science knowledge. The articulation of the standard for this construct would also address the kinds of misconceptions students are likely to bring to this topic, which are also drawn from research on learning of this subject matter. For example, students

OCR for page 15
 STATE ASSESSMENT SYSTEMS BOX 2-1 Articulation of the Idea That Electrical Forces Govern Interactions Between Atoms and Molecules • lectrical forces depend on charge. There are two types of charge—positive E and negative. Opposite charges attract; like charges repel. • he outer shell of electrons is important in inter-atomic interactions. The elec- T tron configuration in the outermost shell/orbital can be predicted from the P eriodic Table. • roperties, such as polarizability, electron affinity, and electronegativity, affect P how a certain type of atom or molecule will interact with another atom or mol- ecule. These properties can be predicted from the Periodic Table. • lectrical forces generally dominate interactions on the nano-, molecular, and E atomic scales. • he structure of matter depends on electrical attractions and repulsions T b etween atoms and molecules. • n ion is created when an atom (or group of atoms) has a net surplus or deficit A of electrons. • ertain atoms (or groups of atoms) have a greater tendency to be ionized than C others. • continuum of electrical forces governs the interactions between atoms, mol- A ecules, and nanoscale objects. • he attractions and repulsions between atoms and molecules can be due to T charges of integer value, or partial charges. The partial charges may be due to permanent or momentary dipoles. • hen a molecule has a permanent electric dipole moment, it is a polar W molecule. • nstantaneous induced-dipole moments occur when the focus of the distribution I shifts momentarily, thus creating a partial charge. Induced-dipole interactions result from the attraction between the instantaneous electric dipole moments of neighboring atoms or molecules. • nduced-dipole interactions occur between all types of atoms and molecules, I but increase in strength with an increasing number of electrons. • olarizability is a measure of the potential distortion of the electron distribution. P Polarizable atoms and ions exhibit a propensity toward undergoing distortions in their electron distribution. • n order to predict and explain the interaction between two entities, the environ- I ment must also be considered. SOURCE: Krajcik, Stevens, and Shin (2009, p. 11).

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES might believe that hydrogen bonds occur between two hydrogen atoms or not understand the forces responsible for holding particles together in the liquid or solid state (Stevens, Sutherland, and Kajcik, 2009). This sort of information can help teachers make decisions about when and how to introduce and present particular material and help curriculum designers plan instructional sequences. The third step is to specify the way students will be expected to use the understanding that has been identified and articulated. Stevens and her col- leagues call this step developing “claims” about the construct. Claims identify the reasoning or cognitive actions students would do to demonstrate their understanding of the construct. (For this step, too, developers would need to rely on research on learning for the particular subject.) Students might be expected to be able to provide examples of particular phenomena, explain patterns in data, or develop and test hypotheses. For example, a claim related to the example in Box 2-1 might be: “Students should be able to explain the attraction between two objects in terms of the generation of opposite charges caused by an imbalance of electrons.” The fourth step is to specify what sorts of evidence will constitute proof that students have gained the knowledge and skills described . A claim might be used at more than one level because understanding is expected to develop sequentially across grades, Stevens stressed. Thus, it is the specification of the evidence that makes clear the degree and depth of students’ understanding that are expected at each level. Table 2-1 shows the claim regarding opposite charges in the context of the cognitive activity and critical idea under which it nests, as well the evidence of understanding of this claim that might be expected of senior high school students. The evidence appropriate for say, middle school students, would be less sophisticated. The fifth step is to specify the learning and assessment tasks that students need to demonstrate, based on the elaborated description of the knowledge and skills students need. The “task” column in Table 2-1 shows examples. The sixth step is to review and revise all the products to ensure that they are well aligned with one another. Such a review might include internal quality checks conducted by the standards developers, as well as feedback from teachers or from content or assessment experts. Stevens said that pilot tests and field trials provide essential information, and review is critical to success. Stevens and her colleagues were also asked to examine the draft versions of the common core standards for 12th grade English and mathematics that were developed by the Council of Chief State School Officers and the National Governors Association to assess how closely they conform to the construct- centered design approach. Their analysis noted that both sets of standards do describe how the knowledge they call for would be used by students, but that the English standards do not describe what sorts of evidence would be neces - sary to prove that a student had met the standards. The mathematics standards appeared to provide more elaboration.

OCR for page 15
0 STATE ASSESSMENT SYSTEMS TABLE 2-1 Putting a Claim About Student Learning in Context Critical Cognitive Idea Activity Claim Evidence Task Intermolecular Construct an Students Student work product Learning Forces explanation. will be able will include Task: to explain – Students explain the Students attraction production of charge are asked between two by noting that only to predict objects in electrons move from how pieces terms of the one object to another of tape production object. will be of opposite – Students note that attracted charges neutral matter or repulsed caused by an normally contains by each imbalance of the same number of other. electrons. electrons and protons. – Students note that Assessment electrons are negative Task: charge carriers and Students that the destination are asked object of the electrons to explain will become negative, why the as it will have more rubbing of electrons than fur against protons. a balloon – Students recognize causes that protons are the fur to positive charge stick to the carriers and that the balloon. removal of electrons causes the remaining material to have an imbalance in positive charge. – Students cite the opposite charges of the two surfaces as producing an attractive force that hold the two objects together. SOURCE: Krajcik, Stevens, and Shin (2009, p. 14).

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES ASSESSMENTS FOR BETTER INSTRUCTION AND LEARNINg: AN EXAMPLE Assessments, as many have observed, are the vehicle for implementing standards, and, as such, have been blamed for virtually every shortcoming in education. There may be no such thing as an assessment task to which no one will object, Mark Wilson said, but it is possible to define what makes an assess - ment task good—or rather how it can lead to better instruction and learning. He provided an overview of current thinking about innovative assessment and described one example of an effort to apply that thinking, in the BEAR (Berkeley Evaluation and Assessment Research) System (Wilson, 2005). An assessment task may affect instruction and learning in three ways, Wilson said. First, the inclusion of particular content or skills signifies to teachers, parents, and policy makers what should be taught. Second, the con - tent or structure of an item conveys information about the sort of learning that is valued in the system the test represents. “Do we want [kids] to know how to select from four options? Do we want them to know how they can develop a project and work on it over several weeks and come up with an interesting result and present it in a proper format? These are the sorts of things we learn from the contents of the item,” Wilson explained. And, third, the results for an item, together with the information from other items in a test, provide informa - tion that can spur particular actions. These three kinds of influences often align with three different perspectives— (1) policy makers perhaps most interested in signaling what is most important in the curriculum, (2) teachers and content experts most interested in the messages about implications for learning and instruction, and (3) assessment experts most interested in the data generated. All three perspectives need to be considered in a discussion of what makes an assessment “good.” For Wilson, the most important characteristic of a good assessment system is coherence. A coherent system is one in which each element (of both summa - tive and formative assessments) measures consistent constructs and contributes distinct but related information that educators can use. Annual, systemwide, summative tests receive the most attention, he pointed out, but the great major- ity of assessments that students deal with are those that teachers use to measure daily, weekly, and monthly progress. Therefore, classroom assessments are the key to effective instruction and learning, he emphasized. Building classroom assessments is more challenging than building large-scale summative ones—yet most resources go to the latter. In some sense, Wilson pointed out, most state assessment systems are coherent. But he describes the current situation for many of them as “threat coherence,” in which “large-scale summative assessment is used as a driving and constraining force, strait-jacketing classroom instructions and curriculum.” He maintained that in many cases the quality of the tests and the decisions about what they should cover are not seen as particularly important—what matters is

OCR for page 15
 STATE ASSESSMENT SYSTEMS that they provide robust data and clear guidance for teachers. This sort of situ - ation presents teachers with a dilemma: the classroom tests they use may either parallel the large-scale assessment or be irrelevant for accountability purposes. Thus, they can either focus on the tested material despite possible misgivings about what they are neglecting, or they can view preparing for the state test and teaching as two separate endeavors. More broadly, Wilson said, systems that are driven by large-scale assess- ments risk overlooking important aspects of the curricula that cannot be ade - quately assessed using multiple-choice tests (just as some content cannot be easily assessed using projects or portfolios). Moreover, if the results of one sys - temwide assessment are used as the sole or principal indicator of performance on a set of standards that may describe a hundred or more constructs, it is very unlikely that student achievement on any one standard can be assessed in a way that is useful for educational planning. Results from such tests would support a very general conclusion about how students are doing in science, for example, but not more specific conclusions about how much they have learned in par- ticular content areas, such as plate tectonics, or how well they have developed particular skills, such as making predictions and testing hypotheses. Another way in which systems might be coherent is through common items, Wilson noted. For example, items used in a large-scale assessment might be used for practice in the classroom, or slightly altered versions of the test items might be used in interim assessments, to monitor students’ likely per- formance on the annual assessment. The difficulty he sees with this approach to system coherence is that a focus on what it takes to succeed with specific items may distract teachers and students from the actual intent behind content standards. When the conceptual basis—the model of student learning—underlying all assessments (whether formative or summative) is consistent, then the system is coherent in a more valuable way. It is even possible to go a step beyond this sort of system coherence, to what Wilson called “information coherence.” In such a system, one would make sure not only that assessments are all devel - oped from the same model of student learning, but also that they are explicitly linked in other ways. For example, a standardized task could be administered to students in many schools and jurisdictions, but delivered in the classroom. The task would be designed to provide both immediate formative information that teachers can use and consistent information about how well students meet a particular standard. Teachers would be trained in a process that ensured a degree of standardization in both test administration and evaluation of the results, and statistical controls could be used to monitor and verify the results. Similarly, classroom work samples based on standardized assignments could be centrally scored. The advantage of this approach is that it derives maximum advantage from each activity. The assessment task generates information and is also an important instructional activity; the nature of the assessment also com -

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES FIgURE 2-1 The BEAR System. SOURCE: Wilson (2009, slide #20). municates very directly with teachers about what sorts of learning and instruc - tion are valued by the system (see Wilson, 2004).1 The BEAR System (Wilson, 2005) is an example of an assessment system that is designed to have information coherence. It is based on four principles, each of which has a corresponding “building block,” see Figure 2-1. These elements function in a cycle, so that information gained from each phase of the process can be used to improve other elements. Wilson noted that current accountability systems rarely allow for this sort of continuous feedback and refinement, but that it is critical (as in any engineering system) to respond to results and developments that could not be anticipated. The construct map defines what is to be assessed, and Wilson described it as a visual metaphor for the ways that students’ understanding develops and, correspondingly, how their responses to items might change. Table 2-2 is an example of a construct map for an aspect of statistics, the capacity to consider certain statistics (such as a mean or a variance) as a measure of the qualities of a sample distribution. 1Another approach to using assessments conducted throughout the school year to provide accountability data is the Cognitively-Based Assessment of, for and as Learning (CBAL), a pro - gram currently under development at the Educational Testing Service (for more information, see http://eric.ed.gov/PDFS/ED507810.pdf [accessed August 2010].

OCR for page 15
 STATE ASSESSMENT SYSTEMS TABLE 2-2 Sample Construct Map: Conception of Statistics Conception of Statistics (CoS3): Student Tasks Objective Specific to CoS3 Student/Teacher Response CoS3(f) Choose statistics – “It is better to calculate median by considering qualities of because this data set has an extreme a particular sample. outlier. The outlier increases the mean a lot.” CoS3(e) Attribute – A student attributes a reduction in magnitude or location of median deviation to a change in the a statistic to processes tool used to measure an attribute. generating the sample. CoS3(d) Investigate the – “Nick’s spreadness method is good qualities of a statistic. because it increases when a data set is more spread-out.” CoS3(c) Generalize the – Students summarize different data use of a statistic beyond sets by applying invented measures. its original context of – Students use average deviation from CoS3. Consider application or invention. the median to explore the spreadness statistics as of the data. measure of CoS3(b) Invent a sharable – “In order to find the best guess, I qualities of measurement process to count from the lowest to the highest a sample quantify a quality of the and from the highest to the lowest at distribution. sample. the same time. If I have an odd total number of data, the point where the two counting methods meet will be my best guess. If I have an even total number, the average of the two last numbers of my two counting methods will be the best guess.” CoS3(a) Invent an – “In order to find the best guess, I idiosyncratic measurement first looked at which number has process to quantify a more than others and I got 152 and quality of the sample based 158 both repeated twice. I picked on tacit knowledge that 158 because it looks more reasonable the other may not share. to me.” SOURCE: Wilson (2009, slide #23). The item design specifies how students will be stimulated to respond and is the means by which the match between curriculum and assessment is established. Wilson described it as a set of principles that allow one to observe students under a set of standard conditions. Most critical is that the design specifications make it possible to observe each of the levels and sublevels described in the construct map. Box 2-2 shows a sample item that assesses one of the statistical concepts in the construct map in Table 2-2.

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES BOX 2-2 Sample Item Assessing Conceptions of Statistics Kayla’s Project Kayla completes four projects for her social studies class. Each is worth 20 points. Kayla’s Projects Points Earned Project 1 16 points Project 2 18 points Project 3 15 points Project 4 ??? The mean score Kayla received for all four projects was 17. Use this information to find the number of points Kayla received on Project 4. Show your work. SOURCE: Wilson (2009, slide #25). The “outcome space” on the lower right portion of Figure 2-1 is a general guide to the way students’ responses to items developed in relation to a par- ticular construct map will be valued. The more specific guidance developed for a particular item is used as the actual scoring guide, Wilson explained, which is designed to ensure that all of the information elicited by the task is easy for teachers to interpret. Figure 2-2 is the scoring guide for the “Kayla” item, with sample student work to illustrate the levels of performance. The final element of the process is to collect the data and link it back to the goals for the assessment and the construct maps. The system relies on a multidi- mensional way of organizing statistical evidence of the quality of the assessment, such as its reliability, validity, and fairness. Item response models show students’ performance on particular elements of the construct map across time and also allow for comparison within a cohort of students or across cohorts. Wilson closed by noting that a goal for almost any large-scale test is to provide information that teachers can use in the classroom, but that this goal requires that large-scale and classroom assessments are constructed to provide information in a coherent way. He acknowledged, however, that implementing a system such as BEAR is not a small challenge. Doing so requires a deep analysis of the relationship between student learning, the curriculum, and instructional practices—a level of analysis not generally undertaken as part of test develop -

OCR for page 15
 FIgURE 2-2 Scoring guide for sample item. SOURCE: Wilson (2009, slide #27).

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES ment. Doing so also requires a readiness to revise both curricula and standards in response to the empirical evidence that assessments provide. INNOvATIONS AND TECHNICAL CHALLENgES Stephen Lazer reflected on the technical and economic challenges of pursu- ing innovative assessments on a large scale from the point of view of test devel - opers. He began with a summary of current goals for improving assessments: • increase use of performance tasks to measure a growing array of skills and obtain a more nuanced picture of students; • rely much less on multiple-choice formats because of limits on what they can measure and their perceived impact on instruction; • use technology to measure content and skills not easily measured using paper-and-pencil formats and to tailor assessments to individuals; and • incorporate assessment tasks that are authentic—that is, that ask stu - dents to do tasks that might be done outside of testing and are worth- while learning activities in themselves. If this is the agenda for improving assessments, he pointed out, “it must be 1990.” He acknowledged that this was a slight overstatement: progress has been made since 1990, and many of these ideas were not applied to K-12 testing until well after 1990. Nevertheless, many of the same goals were the focus of reform two decades ago, and an honest review of what has and has not worked well during 20 years of work on innovative assessments can help increase the likelihood of success in the future. A review of these lessons should begin with clarity about what, exactly, an innovative assessment is, he said. For some, Lazer suggested, it might be any item format other than multiple choice, yet many constructed-response items are not particularly innovative because they only elicit factual recall. Some assessments incorporate tasks with innovative presentation features, but they may not actually measure new sorts of constructs or produce richer information about what students know and can do. Some computer-based assessments fall into this category. Because they are colorful and interesting, some argue that they are more engaging to students, but they may not differ in more substan - tive ways from the assessments they are replacing. If students click on a choice, rather than filling in a bubble, “we turn the computer into an expensive page- turner,” he pointed out. Moreover, there is no evidence that engaging assess- ments are more valid or useful than traditional ones, and the flashiness may disguise the wasting of a valuable opportunity. What makes an assessment innovative, Lazer argued, is that it expands measurement beyond the constructs that can be measured easily with multiple- choice items. Both open-ended and performance-based assessments offer pos -

OCR for page 15
 STATE ASSESSMENT SYSTEMS sibilities for doing this, as does technology. Performance assessments offer a number of possibilities: the opportunity to assess in a way that is more directly relevant to the real-world application of the skills in question, the opportunity to obtain more relevant instructional feedback, a broadening of what can be measured, and the opportunity to better integrate assessment and instruction. Use of Computers and Technology Computers make it possible to present stu- dents with a task that could not otherwise be done—for example, by allowing students to demonstrate geography skills using an online atlas when distributing printed atlases would have been prohibitively expensive. Equally important, though, is that students will increasingly be expected to master technological skills, particularly in science, and those kinds of skills can only be assessed using technology. Even basic skills, such as writing, for which most students now use computers almost exclusively whether at home or at school, may need to be assessed by computer to ensure valid results. Computer-based testing and elec - tronic scoring also make it possible to tailor the difficulty of individual items to a test taker’s level and skills. Furthermore, electronic scoring can provide results quickly and may make it easier to derive and connect formative and summative information from items. Cost Perhaps the most significant challenge to using these sorts of innovative approaches is cost, Lazer said. Items of this sort can be time consuming and expensive to develop, particularly when there are few established procedures for this work. Although some items can be scored by machine, many require human scoring, which is significantly more expensive and also adds to the time required to report results. Automated scoring of open-ended items holds promise for reducing the expense and turnaround time, Lazer suggested, but this technology is still being developed. The use of computers may also have hidden costs. For example, very few state systems have enough computers in classrooms to test large numbers of students simultaneously. When students cannot be tested simultaneously, the testing window must be longer, and secu - rity concerns may mean that it is necessary to have wide pools of items and an extended reporting window. Many of these issues need further research. Test Development Test developers know how to produce multiple-choice items with fairly consistent performance characteristics on a large scale, and there is a knowledge base to support some kinds of constructed-response items. But for other item types, Lazer pointed out, “there is really very little in the way of operational knowledge or templates for development.” For example, simula - tions have been cited as a promising example of innovative assessment, and there are many interesting examples, but most have been designed as learn - ing experiments, not assessments. Thus, in Lazer’s view, the development of ongoing assessments using simulations is in its infancy. Standard techniques

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES for analyzing the way items perform statistically do not work as well for many constructed-response items as for multiple-choice ones—and not at all for some kinds of performance tasks. For many emerging item types, there is as yet no clear model for getting the maximum information out of them, so some complex items might yield little data. The Role of a Theoretical Model The need goes deeper than operational skills and procedures, however, Lazer said. Multiple-choice assessments allow test developers to collect data that support inferences about specific correlations— for example, between exposure to a particular curriculum and the capacity to answer a certain percentage of a fairly large number of items correctly—without requiring the support of a strong theoretical model. For other kinds of assess- ments, particularly performance assessments that may include a much smaller number of items or observations, a much stronger cognitive model of the con - struct being measured is needed. Without such a model, Lazer noted, one can write open-ended or computer-based items that are not very high in quality, something he suggested happened too frequently in the early days of perfor- mance assessment. Moreover, even a well-developed model is no guarantee of the quality of the items. The key challenge is not to mistake authenticity for validity. Validity depends on the claim one wants to make, it is very important that the construct be defined accurately and that the item truly measures the skills and knowledge it is intended to measure. It can also be much more difficult to generalize from assessments that rely on a relatively small number of tasks. Each individual task may measure a broader construct than do items on conventional tests, but at the cost of yielding a weaker measure of the total domain, of which the construct is one element. And since the items are likely to be more time consuming for students, they will complete fewer of them. There is likely to be a fairly strong person-task interaction, particularly if the task is heavily contextualized. It is also important to be clear about precisely what sorts of claims the data can support. For example, it may not be possible to draw broad conclusions about scientific inquiry skills from students’ performance in a laboratory simu - lation related to a pond ecosystem. With complex tasks, such as simulations, there may also be high interdependence among the observations that are col - lected, which also undermines the reliability of each one. These are not reasons to avoid this kind of item, Lazer said, but he cautioned that it is important to be aware that if generalizability is low enough, the validity of the assessment is jeopardized. Assessing students periodically over a time span, or restricting item length, are possible ways of minimizing these disadvantages, but each of these options presents other potential costs or disadvantages. Scoring Lazer underlined that human scoring introduces another source of possible variation and limits the possibility of providing rapid results. In gen -

OCR for page 15
0 STATE ASSESSMENT SYSTEMS eral, the complexity of scoring for some innovative assessments is an important factor to consider in a high-stakes, “adequate yearly progress environment,” in which high confidence in reliability and interrater reliability are very impor- tant. A certain degree of control over statistical quality is important not only for comparisons among students, but also for monitoring trends. Value-added modeling and other procedures for examining a student’s growth over time and the effects of various inputs, such as teacher quality, also depend on a degree of statistical precision that can be difficult to achieve with some emerging item types. Equating Lazer noted that a related issue is equating, which is normally accomplished through the reuse of a subset of a test’s items. Many perfor- mance items cannot be reused because they are memorable. And even when they can be reused, it can be difficult to ensure that they are scored in exactly the same way across administrations. Although it might be possible to equate by using other items in a test, if they are of a different type (e.g., multiple choice), the two parts may actually measure quite different constructs, so the equating could actually yield erroneous results. For similar reasons, it can be very difficult to field test these items, and though this is a mundane aspect of testing, it is very important for maintaining the quality of the information collected. Challenges for Students Items that make use of complex technology can pose a challenge to students taking the tests, Lazer said. It may take time for students to learn to use the interface and perform the activities the test requires, and the complexity may affect the results in undesired ways. For example, some stu- dents may score higher because they have greater experience and facility with the technology, even if their skill with the construct being tested is not better than that of other students. Conflicting Goals For Lazer, the key concern with innovative assessment is the need to balance possibly conflicting goals. He repeated what others have noted—that the list of goals for new approaches is long: assessments should be shorter and cheaper and provide results quickly; they should include perfor- mance assessment; they should be adaptive; and they should support teacher and principal evaluation. For him, this list highlights that some of the challenges are “know-how” ones—that can presumably be surmounted with additional effort and resources. Others are facts of life. Psychometricians may be working from outdated cognitive models, and this can be corrected. But it is unlikely that further research and development will make it possible to overcome the constraints imposed when both reliability and validity, for example, are impor- tant to the results. “This doesn’t mean we should give up and keep doing what we’re doing,”

OCR for page 15
 IMPROVING ASSESSMENTS—QUESTIONS AND POSSIBILITIES Lazer concluded. These are not insurmountable conflicts, but each presents a tradeoff. He said the biggest challenge is acknowledging the choices. To develop an optimal system will require careful thinking about the ramifications of each feature. Above all, Lazer suggested, “we need to be conscious of the limits of what any single test can do.” He enthusiastically endorsed the systems approaches described earlier, in which different assessment components are designed to meet different needs, but in a coherent way. LOOKINg FORWARD A few themes emerged in the general discussion. One was that the time and energy required for the innovative approaches described—specifying the domain, elaborating the standards, validating that model of learning—is formi - dable. Taking this path would seem to require a combination of time, energy, and expertise that is not typically devoted to test development. However, the BEAR example seems to marry the expertise of content learning, assessment design, and measurement in a way that offers the potential to be implemented in a relatively efficient way. The discussion of technical challenges illustrated the many good reasons that the current testing enterprise seems to be stuck in what test developers already know how to do well. This situation raised two major questions for workshop participants. First: Is the $350 million total spending planned for the Race to the Top Initiative enough? Several participants expressed the view that assessment is an integral aspect of education, whether done well or poorly, but that its value could be multiplied exponentially if resources were committed to develop a coherent system. Second: What personnel should be involved in the development of new assessment systems? The innovation that is required may be more than can be expected from test publishers. A participant joked that the way forward might lie in putting cognitive psychologists and psychometricians in a locked room until they resolved their differences—the challenge of balancing the competing imperatives each group raises is not trivial. A final word from another participant, that “it’s the accountability, stupid,” reinforced a number of other comments. That is, the need to reduce student achievement to a single number derives from the punitive nature of the cur- rent accountability system. It is this pressure that is responsible for many of the constraints on the nature of assessments. Changing that context might make it easier to view assessments in a new light and pursue more creative forms and uses.

OCR for page 15