Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 13
Best Practices for State Assessment Systems, Part I: Summary of a Workshop 2 Improving Assessments—Possibilities and Questions There is no shortage of ideas about how states might improve their assessment and accountability systems. The shared goal for any such change is to improve instruction and student learning, and there are numerous possible points of attack. But, as discussed in Chapter 1, many observers argue that it is important to consider the system as a whole in order to develop a coherent, integrated approach to assessment and accountability. Any one sort of improvement—higher quality assessments, better developed standards, or more focused curricula—might have benefits but would not lead to the desired improvement on its own. Nevertheless, it is worth looking closely at possibilities for each of the two primary elements of an accountability system: standards and assessments. DEVELOPING STANDARDS THAT LEAD TO BETTER INSTRUCTION AND LEARNING: AN EXAMPLE Much is expected of education standards. They are widely viewed as a critical tool in public education because they define both broad goals and specific expectations for students. They are intended to guide classroom instruction, the development of curricula and supporting materials, assessments, and professional development. Yet evaluations of many sets of standards have found them wanting, and they have rarely had the effects that were hoped for (National Research Council, 2008). Shawn Stevens used the example of science to describe the most important attributes of excellent standards. She began by enumerating the most prevalent
OCR for page 14
Best Practices for State Assessment Systems, Part I: Summary of a Workshop criticisms of existing national, state, and local science standards—that they include too much material, do not establish clear priorities among the material included, and provide insufficient interpretation of how the ideas included should be applied. Efforts to reform science education have been driven by standards—and have yielded improvements in achievement—but existing standards do not generally provide a guide for the development of coherent curricula. They do not support students in developing an integrated understanding of key scientific ideas, she said, which has been identified as a key reason that U.S. students do not perform as well as many of their international peers (Schmidt, Wang, and McKnight, 2005). Stevens and her colleagues have developed a model for science standards that are based on current understanding of science learning in order to address some of these weaknesses. She described the model as well as a proposed process for developing such standards and a process for using such standards to develop assessments (Krajcik, Stevens, and Shin, 2009). Stevens and her colleagues began with the recommendations in a National Research Council (2005) report on designing science assessment systems: standards need to describe performance expectations and proficiency levels in the context of a clear conceptual framework, and be built on sound models of student learning. They should be clear, detailed, and complete; reasonable in scope; and both rigorous and scientifically accurate. She briefly summarized the findings from research on learning in science that are particularly relevant to the development of standards. Integrated understanding of science is built on a relatively small number of foundational ideas that are central across the scientific disciplines—such as the principle that the natural world is composed of a number of interrelated systems—referred to as “big ideas” (see also National Research Council, 1996, 2005; Stevens, Sutherland, and Krajcik, 2009). These are the sorts of ideas that allow both scientists and students to explain many sorts of observations and to identify connections among facts, concepts, models, and principles (National Research Council, 2005; Smith et al., 2006). Understanding of the big ideas helps learners develop more detailed conceptual frameworks that, in turn, make it possible to undertake scientific tasks, such as solving problems, making predictions, observing patterns, and organizing and structuring new information. Learning complex ideas takes time and often happens as students work on tasks that force them to synthesize new observations with what they already knew. Students draw on a foundation of existing understanding and experiences as they gradually assemble bodies of factual knowledge and organize it according to their growing conceptual understanding. The most important implication of these findings for standards is that they must be elaborated so that educators can connect them with instruction, instructional materials, and assessments, Stevens explained. That is, not only
OCR for page 15
Best Practices for State Assessment Systems, Part I: Summary of a Workshop should a standard describe the subject matter it is critical for students to know, it should also describe how students should be able to use and apply that knowledge. For example, standards not only should describe the big ideas using declarative sentences, but they should also elaborate on the underlying concepts that are critical to developing a solid understanding of each big idea. Moreover, these concepts and ideas should be revisited throughout K-12 schooling so that knowledge and reasoning become progressively more refined and elaborated. Standards need to reflect these stages of learning. And because prior knowledge is so important to developing understanding, it is important that standards are specific about the knowledge students will need at particular stages to support each new level of understanding. Standards should also address common misunderstandings and difficulties students have learning particular content so that instruction can explicitly target them, Stevens noted. For example, standards can identify some of the most common misconceptions students have about particular material—and perhaps point teachers to more detailed documentation of the “non-normative” ideas researchers have identified. The approach Stevens and her colleagues have developed for designing rigorous science standards reflects this model and is also based on previous work in the design of curricula and assessments, which they call construct-centered design (Wiggins and McTighe, 1998; Mislevy and Riconscente, 2005; Krajcik, McNeill, and Reiser, 2008; Shin, Stevens, and Krajcik, in press). The name reflects the goal of making the ideas and skills (constructs) that students are expected to learn and that teachers and researchers want to measure the focus for aligning curriculum, instruction, and assessment. The construct-centered design process has six elements that function in an interactive, iterative way, so that information gained from any element can be used to clarify or modify the product of another. The first step is to identify the construct. The construct might be a concept (evolution or plate tectonics), theme (e.g., size and scale or consistency and change), or a scientific practice (learning about the natural world in a scientific way). Stevens used the example of forces and interactions on the molecular and nano scales—that is, the idea that all interactions can be described by multiple types of forces, but that the relative impact of each type of force changes with scale—to illustrate the process The second step is to articulate the construct, based on expert knowledge of the discipline and related learning research. This, Stevens explained, means explicitly identifying the concepts that are critical for developing understanding of a particular construct and defining the successive targets students would reach in the course of their schooling, as they progress toward full understanding of the construct. This step is important for guiding the instruction at various levels. Box 2-1 shows an example of the articulation of one critical concept that is important to understanding the sample construct (regarding forces and interac-
OCR for page 16
Best Practices for State Assessment Systems, Part I: Summary of a Workshop BOX 2-1 Articulation of the Idea That Electrical Forces Govern Interactions Between Atoms and Molecules Electrical forces depend on charge. There are two types of charge—positive and negative. Opposite charges attract; like charges repel. The outer shell of electrons is important in inter-atomic interactions. The electron configuration in the outermost shell/orbital can be predicted from the Periodic Table. Properties such as polarizability, electron affinity, and electronegativity affect how a certain type of atom or molecule will interact with another atom or molecule. These properties can be predicted from the Periodic Table. Electrical forces generally dominate interactions on the nano-, molecular, and atomic scales. The structure of matter depends on electrical attractions and repulsions between atoms and molecules. An ion is created when an atom (or group of atoms) has a net surplus or deficit of electrons. Certain atoms (or groups of atoms) have a greater tendency to be ionized than others. A continuum of electrical forces governs the interactions between atoms, molecules, and nanoscale objects. The attractions and repulsions between atoms and molecules can be due to charges of integer value, or partial charges. The partial charges may be due to permanent or momentary dipoles. When a molecule has a permanent electric dipole moment, it is a polar molecule. Instantaneous induced-dipole moments occur when the focus of the distribution shifts momentarily, thus creating a partial charge. Induced-dipole interactions result from the attraction between the instantaneous electric dipole moments of neighboring atoms or molecules. Induced-dipole interactions occur between all types of atoms and molecules, but increase in strength with an increasing number of electrons. Polarizability is a measure of the potential distortion of the electron distribution. Polarizable atoms and ions exhibit a propensity toward undergoing distortions in their electron distribution. In order to predict and explain the interaction between two entities, the environment must also be considered. SOURCE: Krajcik, Stevens, and Shin (2009, p. 11). tions on the molecular and nano scale). This articulation is drawn from research in this particular topic (such research has not been conducted for many areas of science knowledge), and it describes the upper level of K-12 understanding. The articulation of the standard for this construct would also address the kinds of misconceptions students are likely to bring to this topic, which are also
OCR for page 17
Best Practices for State Assessment Systems, Part I: Summary of a Workshop drawn from research on learning of this subject matter. For example, students might believe that hydrogen bonds occur between two hydrogen atoms or not understand the forces responsible for holding particles together in the liquid or solid state (Stevens, Sutherland, and Kajcik, 2009). This sort of information can help teachers make decisions about when and how to introduce and present particular material, and help curriculum designers plan instructional sequences. The third step is to specify the way students will be expected to use the understanding that has been identified and articulated, a step that Stevens and her colleagues call developing “claims” about the construct. Claims identify the reasoning or cognitive actions students would do to demonstrate their understanding of the construct. (Here also, developers would hope to be able to rely on research on learning in the area in question.) Students might be expected to be able to provide examples of particular phenomena, explain patterns in data, or develop and test hypotheses. For example, a claim related to the example in Box 2-1 might be: “Students should be able to explain the attraction between two objects in terms of the generation of opposite charges caused by an imbalance of electrons.” The fourth step is to specify what sorts of evidence will constitute proof that students have gained the knowledge and skills described. A claim might be used at more than one level because understanding is expected to develop sequentially across grades, Stevens stressed. Thus, it is the specification of the evidence that makes clear the degree and depth of understanding that are expected at each level. Table 2-1 shows the claim regarding opposite charges in the context of the cognitive activity and critical idea under which it nests, as well the evidence of understanding of this claim that might be expected of senior high school students. The evidence appropriate at a less advanced level, say for middle school students, would be less sophisticated. The fifth step is to specify the learning and assessment tasks that students need to demonstrate, based on the elaborated description of the knowledge and skills students need. The “task” column in Table 2-1 shows examples. The sixth step of the process is to review and revise each product to ensure that they are well aligned with one another. Such a review might include internal quality checks conducted by the developers, as well as feedback from teachers or from content or assessment experts. Pilot tests and field trials provide essential information, and review is critical to success, Stevens explained. Stevens noted that she and her colleagues were also asked to examine the draft versions of the common core standards for 12th grade English and mathematics developed by the Council of Chief State School Officers and the National Governors Association, to assess how closely they conform to the construct-centered design approach. They noted that both sets of standards do describe how the knowledge they call for would be used by students, but that the English standards do not describe what sorts of evidence would be neces-
OCR for page 18
Best Practices for State Assessment Systems, Part I: Summary of a Workshop TABLE 2-1 A Claim in Context Critical Idea Cognitive Activity Claim Evidence Task Intermolecular Forces Construct an explanation Students will be able to explain attraction between two objects in terms of the production of opposite charges caused by an imbalance of electrons. Student work product will include Students explain the production of charge by noting that only electrons move from one object to another object. Students note that neutral matter normally contains the same number of electrons and protons. Students note that electrons are negative charge carriers and that the destination object of the electrons will become negative, as it will have more electrons than protons. Students recognize that protons are positive charge carriers and that the removal of electrons causes the remaining material to have an imbalance in positive charge. Students cite the opposite charges of the two surfaces as producing an attractive force that hold the two objects together. Learning Task: Students are asked to predict how pieces of tape will be attracted or repulsed by each other. Assessment Task: Students are asked to explain why the rubbing of fur against a balloon causes the fur to stick to the balloon. SOURCE: Krajcik, Stevens, and Shin (2009, p. 14).
OCR for page 19
Best Practices for State Assessment Systems, Part I: Summary of a Workshop sary to prove that a student had met the standards. The mathematics standards appeared to provide more elaboration. DEVELOPING ASSESSMENTS THAT LEAD TO BETTER INSTRUCTION AND LEARNING: AN EXAMPLE Assessments, as many have observed, are the vehicle for implementing standards, and, as such, have been blamed for virtually every shortcoming in education. There may be no such thing as an assessment task to which no one will object, Mark Wilson said, but it is possible to define what makes an assessment task good—or rather how it can lead to better instruction and learning. He provided an overview of current thinking about innovative assessment and described one example of an effort to apply that thinking, in the BEAR (Berkeley Evaluation and Assessment Research) System (Wilson, 2005). In Wilson’s view, an assessment task may affect instruction and learning in three ways. First, the inclusion of particular content or skills signifies to teachers, parents, and policy makers what should be taught. Second, the content or structure of an item conveys information about the sort of learning that is valued in the system the test represents. “Do we want [kids] to know how to select from four options? Do we want them to know how they can develop a project and work on it over several weeks and come up with an interesting result and present it in a proper format? These are the sorts of things we learn from the contents of the item,” Wilson explained. And, third, the results for an item, together with the information from other items in a test, provide information that can spur particular actions. These three kinds of influences often align with three different perspectives—with policy makers perhaps most interested in signaling what is most important in the curriculum, teachers and content experts most interested in the messages about implications for learning and instruction, and assessment experts most interested in the data generated. All three perspectives need to be considered in a discussion of what makes an assessment “good.” For Wilson, the most important characteristic of good assessment is coherence. A coherent system is one in which each element (including the summative and formative assessments) measures consistent constructs and contributes distinct but related information that educators can use. Annual, systemwide, summative tests receive the most attention, he pointed out, but the great majority of assessments that students deal with are those that teachers use to measure daily, weekly, and monthly progress. Therefore, classroom assessments are the key to effective instruction and learning, he emphasized. Building classroom assessments is more challenging than building large-scale summative ones—yet most of the resources go to the latter. In some sense, Wilson pointed out, most state systems are coherent. But he describes the current situation in many as “threat coherence,” in which
OCR for page 20
Best Practices for State Assessment Systems, Part I: Summary of a Workshop “large-scale summative assessment is used as a driving and constraining force, strait-jacketing classroom instruction and curriculum.” He maintained that in many cases the quality of the tests and the decisions about what they should cover are not seen as particularly important—what matters is that they provide robust data and clear guidance for teachers. This sort of situation presents teachers with a dilemma—the classroom tests they use may either parallel the large-scale assessment or be irrelevant for accountability purposes. Thus, they can either focus on the tested material despite possible misgivings about what they are neglecting, or they can view preparing for the state test and teaching as two separate endeavors. More broadly, Wilson said, systems that are driven by their large-scale assessments risk overlooking important aspects of the curricula that cannot be adequately assessed using multiple-choice tests (just as some content cannot be easily assessed using projects or portfolios). Moreover, if the results of one system-wide assessment are used as the sole or principal indicator of performance on a set of standards that may describe a hundred or more constructs, it is very unlikely that student achievement on any one standard can be assessed in a way that is useful for educational planning. Results from such tests would support a very general conclusion about how students are doing in science, for example, but not more specific conclusions about how much they have learned in particular content areas, such as plate tectonics, or how well they have developed particular skills, such as making predictions and testing hypotheses. Another way in which systems might be coherent is through common items, Wilson noted. For example, items used in a large-scale assessment might be used for practice in the classroom, or slightly altered versions of the test items might be used in interim assessments, to monitor students’ likely performance on the annual assessment. The difficulty he sees with this approach to system coherence is that a focus on what it takes to succeed with specific items may distract teachers and students from the actual intent behind content standards. When the conceptual basis—the model of student learning—underlying all assessments (whether formative or summative) is consistent, then the system is coherent in a more valuable way. It is even possible to go a step beyond this sort of system coherence, to what Wilson called “information coherence.” If this is the goal, one would make sure not only that assessments are all developed from the same model of student learning, but also that they are explicitly linked in other ways. For example, a standardized task could be administered to students in many schools and jurisdictions, but delivered in the classroom. The task would be designed to provide both immediate formative information that teachers can use and consistent information about how well students meet a particular standard. Teachers would be trained in a process that ensured a degree of standardization in both administration of the task and evaluation of the results, and statistical controls could be used to monitor and verify the
OCR for page 21
Best Practices for State Assessment Systems, Part I: Summary of a Workshop results. Similarly, classroom work samples based on standardized assignments could be centrally scored. The advantage of this approach is that it derives maximum advantage from each activity. The assessment task generates information and is also an important instructional activity; the nature of the assessment also communicates very directly with teachers about what sorts of learning and instruction are valued by the system (see Wilson, 2004).1 The BEAR System (Wilson, 2005) is an example of a system designed to have information coherence. It is based on four principles, each of which has a corresponding “building block” (see Figure 2-1). These elements function in a cycle, so that information gained from each phase of the process can be used to improve other elements. Wilson noted that current accountability systems rarely allow for this sort of continuous feedback and refinement, but that it is critical (as in any engineering system) to respond to results and developments that could not be anticipated. The construct map defines what is to be assessed, and Wilson described it as a visual metaphor for the ways students’ understanding develops, and, correspondingly, how their responses to items might change. Table 2-2 is an example of a construct map for an aspect of statistics, the capacity to consider certain statistics (such as a mean or a variance) as a measure of the qualities of a sample distribution. The item design specifies how students will be stimulated to respond and is the means by which the match between the curriculum and the assessment is established. Wilson described it as a set of principles that allow one to observe students under a set of standard conditions. Most critical is that the design specifications make it possible to observe each of the levels and sublevels described in the construct map. Box 2-2 shows a sample item that assesses one of the statistical concepts in the construct map above. The “outcome space” is a general guide to the way students’ responses to items developed in relation to a particular construct map will be valued. The more specific guidance developed for a particular item is used as the actual scoring guide, Wilson explained, which is designed to ensure that all of the information elicited by the task is easy for teachers to interpret. Figure 2-2 is the scoring guide for the “Kayla” item, with sample student work to illustrate the levels of performance. The final element of the process is to collect the data and link it back to the goals for the assessment and the construct maps. The system relies on a multidimensional way of organizing statistical evidence of the quality of the assessment, such as its reliability, validity, and fairness. Item response models show students’ 1 Another approach to using assessments conducted throughout the school year to provide accountability data is the Cognitively-Based Assessment of, for, and as Learning (CBAL), a program currently under development at the Educational Testing Service: see http://www.uk.toeic.eu/toeic/uk/news/?news=900&view=detail [April 2010].
OCR for page 22
Best Practices for State Assessment Systems, Part I: Summary of a Workshop FIGURE 2-1 The BEAR System. SOURCE: Wilson (2009). performance on particular elements of the construct map across time and also allow for comparison within a cohort of students or across cohorts. Wilson closed by noting that a goal for almost any large-scale test is to provide information that teachers can use in the classroom, but that this vision cannot be a reality unless the large-scale and classroom assessments are constructed to provide information in a coherent way. However, he acknowledged that implementing a system such as BEAR is not a small challenge. Doing so requires a deep analysis of the relationship between student learning, the curriculum, and instructional practices—a level of analysis not generally undertaken as part of test development. Doing so also requires a readiness to revise both curricula and standards in response to the empirical evidence that assessments provide. TECHNICAL CHALLENGES Stephen Lazer reflected on the technical challenges of pursuing innovative assessments on a large scale from the point of view of test developers. He began with a summary of current goals for improving assessments: increase use of performance tasks to measure a growing array of skills and obtain a more nuanced picture of students; rely much less on multiple-choice formats because of limits to what they can measure and their perceived impact on instruction;
OCR for page 23
Best Practices for State Assessment Systems, Part I: Summary of a Workshop TABLE 2-2 Sample Construct Map Conception of Statistics (CoS3): Objective Student Tasks Specific to CoS3 Student/Teacher Response CoS3. Consider statistics as measure of qualities of a sample distribution. CoS3(f) Choose statistics by considering qualities of a particular sample. – “It is better to calculate median because this data set has an extreme outlier. The outlier increases the mean a lot.” CoS3(e) Attribute magnitude or location of a statistic to processes generating the sample. – A student attributes a reduction in median deviation to a change in the tool used to measure an attribute. CoS3(d) Investigate the qualities of a statistic. – Nick’s spreadness method is good because it increases when a data set is more spread-out.” CoS3(c) Generalize the use of a statistic beyond its original context of application or invention. – Students summarize different data sets by applying invented measures. – Students use average deviation from the median to explore the spreadness of the data. CoS3(b) Invent a sharable measurement process to quantify a quality of the sample. – “In order to find the best guess, I count from the lowest to the highest and from the highest to the lowest at the same time. If I have an odd total number of data, the point where the two counting methods meet will be my best guess. If I have an even total number, the average of the two last numbers of my two counting methods will be the best guess.” CoS3(a) Invent an idiosyncratic measurement process to quantify a quality of the sample based on tacit knowledge that the other may not share. – “In order to find the best guess, I first looked at which number has more than others and I got 152 and 158 both repeated twice. I picked 158 because it looks more reasonable to me.” SOURCE: Wilson (2009). use technology to measure content and skills not easily measured using paper-and-pencil formats and to tailor assessments to individuals; and incorporate assessment tasks that are authentic—that is, that ask students to do tasks that might be done outside of testing and are worthwhile learning activities in themselves. If this is the agenda for improving assessments, he joked “it must be 1990.” He acknowledged that this was a slight overstatement. Progress has been made
OCR for page 24
Best Practices for State Assessment Systems, Part I: Summary of a Workshop BOX 2-2 Sample Item Assessing Conceptions of Statistics Kayla’s Project Kayla completes four projects for her social studies class. Each is worth 20 points. Kayla’s Projects Points Earned Project 1 16 points Project 2 18 points Project 3 15 points Project 4 ??? The mean score Kayla received for all four projects was 17. Use this information to find the number of points Kayla received on Project 4. Show your work. SOURCE: Wilson (2009). FIGURE 2-2 Scoring guide for sample item. SOURCE: Wilson (2009).
OCR for page 25
Best Practices for State Assessment Systems, Part I: Summary of a Workshop since 1990, and many of these ideas were not applied to K-12 testing until well after 1990. Nevertheless, many of the same goals were the focus of reform two decades ago, and an honest review of what has and has not worked well during the past 20 years of work on innovative assessments can help increase the likelihood of success in the future. A review of these lessons should begin with clarity about what, exactly, an innovative assessment is. For some, Lazer suggested, it might be any item format other than multiple choice, yet many constructed-response items are not particularly innovative because they only elicit factual recall. Some assessments incorporate tasks with innovative presentation features, but they may not actually measure new sorts of constructs or produce richer information about what students know and can do. Some computer-based assessments fall into this category. Because they are colorful and interesting, some argue that they are more engaging to students, but they may not differ in more substantive ways from the assessments they are replacing. If students click on a choice, rather than filling in a bubble, “we turn the computer into an expensive page-turner,” he pointed out. Moreover, there is no evidence that engaging assessments are more valid or useful than traditional ones, and the flashiness may disguise the wasting of a valuable opportunity. What makes an assessment innovative, in Lazer’s view, is that it expands measurement beyond the constructs that can be measured easily with multiple-choice items. Both open-ended and performance-based assessments offer possibilities for doing this, as does technology. Performance assessments offer a number of possibilities: the opportunity to assess in a way that is more directly relevant to the real-world application of the skills in question, the opportunity to obtain more relevant instructional feedback, a broadening of what can be measured, and the opportunity to better integrate assessment and instruction. Use of Computers and Technology Computers make it possible to present students with a task that could not otherwise be done—for example, by allowing students to demonstrate geography skills using an online atlas, when distributing printed atlases would have been prohibitively expensive. Equally important, though, is that students will increasingly be expected to master technological skills, particularly in science, and those kinds of skills can only be assessed using such technology. Even basic skills, such as writing, for which most students now use computers almost exclusively whether at home or at school, may need to be assessed by computer to ensure valid results. Computer-based testing and electronic scoring also make it possible to tailor the difficulty of individual items to a test taker’s level and skills. Furthermore, electronic scoring can provide results quickly and may make it easier to derive and connect formative and summative information from items. What are the challenges to using these sorts of innovative approaches? Perhaps most significant is cost. Items of this sort can be time consuming and
OCR for page 26
Best Practices for State Assessment Systems, Part I: Summary of a Workshop expensive to develop, particularly when there are few established procedures for this work. Although some items can be scored by machine, many require human scoring, which is significantly more expensive and also adds to the time required to report results. Automated scoring of open-ended items holds promise for reducing the expense and turnaround time, Lazer suggested, but this technology is still being developed. The use of computers may have hidden costs as well. For example, very few state systems have enough computers in classrooms to test large numbers of students simultaneously. When students cannot be tested simultaneously, the testing window must be longer, and security concerns may mean that it is necessary to have wide pools of items and an extended reporting window. Further research is needed to address many of these questions. Test Development At present, professional test developers know how to produce multiple-choice items with fairly consistent performance characteristics on a large scale, and there is a knowledge base to support some kinds of constructed-response items. But for other item types, Lazer pointed out, “there is really very little in the way of operational knowledge or templates for development.” For example, simulations have been cited as a promising example of innovative assessment, and there are many interesting examples, but most have been designed as learning experiments, not assessments. Thus, in Lazer’s view, the development of ongoing assessments using simulations is in its infancy. Standard techniques for analyzing the way items perform statistically do not work as well for many constructed-response items as they do for multiple-choice items—and not at all for some kinds of performance tasks. For many emerging item types, there is as yet no clear model for getting the maximum information out of them, so some complex items might yield little data. The Role of a Theoretical Model The need goes deeper than operational skills and procedures, however. Multiple-choice assessments allow test developers to collect data that supports inferences about specific correlations—for example, between exposure to a particular curriculum and the capacity to answer a certain percentage of a fairly large number of items correctly—without requiring the support of a strong theoretical model. For other kinds of assessments, particularly performance assessments that may include a much smaller number of items, or observations, a much stronger cognitive model of the construct being measured is needed. Without such a model, Lazer noted, or even when a model is well developed, it is also quite possible to write open-ended or computer-based items that are not very high in quality, something he suggested happened too frequently in the early days of performance assessment. The key challenge is not to mistake authenticity for validity; since validity depends on the claim one wants to make, it is very important that the construct be defined
OCR for page 27
Best Practices for State Assessment Systems, Part I: Summary of a Workshop accurately and that the item truly measures the skills and knowledge it is intended to measure. It can also be much more difficult to generalize from assessments that rely on a smaller number of tasks. Each individual task may measure a broader construct than items on conventional tests do, but at the cost of yielding a weaker measure of the total domain, of which the construct is one element. And since the items are likely to be more time consuming for students, they will complete fewer of them. There is likely to be a fairly strong person-task interaction, particularly if the task is heavily contextualized. It is also important to be clear about precisely what sorts of claims the data can support. For example, it may not be possible to draw broad conclusions about scientific inquiry skills from students’ performance in a laboratory simulation related to a pond ecosystem. With complex tasks, such as simulations, there may also be high interdependence among the observations that are collected, which also undermines the reliability of each one. These are not reasons to avoid this kind of item, Lazer said, but he cautioned that it is important to be aware that if generalizability is low enough, the validity of the assessment is jeopardized. Assessing periodically over a time span, or restricting item length, are possible ways of minimizing these disadvantages, but each of these options presents other possible costs or disadvantages. Scoring Human scoring introduces another source of possible variation and limits the possibility of providing rapid results. In general, the complexity of scoring for some innovative assessments is an important factor to consider in a high-stakes, “adequate yearly progress” environment, in which high confidence in reliability and interrater reliability rates is very important. A certain degree of control over statistical quality is important not only for comparisons among students, but also for monitoring trends. Value-added modeling and other procedures for examining a student’s growth over time and the effects of various inputs, such as teacher quality, also depend on a degree of statistical precision that can be difficult to achieve with some emerging item types. Equating A related problem is equating, which is normally accomplished through the reuse of a subset of a test’s items. Many performance items cannot be reused because they are memorable. Even when they can be reused, it can be difficult to ensure that they are scored in exactly the same way across administrations. It could be possible to equate using other items in the test, but if they are of a different type (e.g., multiple choice), the two parts may actually measure quite different constructs, so the equating could actually yield erroneous results. For similar reasons, it can be very difficult to field-test these items, and though this is a mundane aspect of testing, it is very important for maintaining the quality of the information collected.
OCR for page 28
Best Practices for State Assessment Systems, Part I: Summary of a Workshop Challenges for Students Items that make use of complex technology can pose a challenge to students taking the tests as well. Lazer identified two cautions: first, it may take time for students to learn to use the interface and perform the activities the test requires, and, second, the complexity may affect the results in undesired ways. For example, some students may score higher because they have greater experience and facility with the technology, even if their skill with the construct being tested is not better than that of other students. Conflicting Goals For Lazer, the key concern with innovative assessment is the need to balance possibly conflicting imperatives. He suggested that the list of goals for new approaches is long: assessments should be shorter and cheaper and provide results quickly; they should include performance assessment; they should be adaptive; and they should support teacher and principal evaluation. What this list highlighted for him was that some of the challenges are “know-how” ones—that can presumably be surmounted with additional effort and resources. Others are facts of life. Psychometricians may be working from outdated cognitive models, and this can be corrected. But it is unlikely that further research and development will make it possible to overcome the constraints imposed when reliability and generalizability, for example, are important to the results. “This doesn’t mean we should give up and keep doing what we’re doing,” he concluded. These are not insurmountable conflicts, but each presents a tradeoff. For him the greatest risk is in failing to consider the choices. To develop an optimal system will require careful thinking about the ramifications of each feature. Above all, Lazer suggested, “we need to be conscious of the limits of what any single test can do.” He enthusiastically endorsed the systems approaches described earlier in the day, in which different assessment components are designed to meet different needs, but in a coherent way. DISCUSSION SUMMARY A few themes emerged in discussion. First, the time and energy required for the innovative approaches described—specifying the domain, elaborating the standards, validating that model of learning—is formidable. Taking this path would seem to require a combination of time, energy, and expertise that is not typically devoted to test development. However, the BEAR example seemed to marry the expertise of content learning, assessment design, and measurement in a way that offers the potential to be implemented in a relatively efficient way. The discussion of technical challenges illustrated the many good reasons that the current testing enterprise seems to be stuck in what test developers already know how to do well. This situation raised several questions for presenters and participants. First: Is the $350 million total spending planned for the Race to the Top Initiative
OCR for page 29
Best Practices for State Assessment Systems, Part I: Summary of a Workshop enough? Several participants expressed the view that assessment is an integral aspect of education, whether done well or poorly, but that its value could be multiplied exponentially if the resources were committed to develop a coherent system. Second: What personnel should be involved in the development of new assessment systems? The innovation required may be more than existing test publishers could be expected to produce on their own. A participant joked that the way forward might lie in locking cognitive psychologists and psychometricians up together until they resolved their differences—the challenge of balancing the competing imperatives each group raises is not trivial. A final word from another participant, that “it’s the accountability, stupid,” reinforced a number of other comments. That is, the need to reduce student achievement to a single number derives from the punitive nature of the current accountability system. It is this pressure that is responsible for many of the constraints on the nature of assessments. Changing that context might encourage states to view assessments in a new light and make use of creative thinking.
OCR for page 30
Best Practices for State Assessment Systems, Part I: Summary of a Workshop This page intentionally left blank.