Evaluation and Monitoring
The No Child Left Behind Act (NCLB) requires that states provide evidence of how well their assessment systems respond to mandates for establishing a single, statewide system comprised of:
challenging content standards,
challenging academic achievement standards, and
a single, statewide system of annual assessments that are of high technical quality, are aligned with academic content and achievement standards, are inclusive, and are effectively reported.
However, NCLB does not define particular standards of quality with regard to any of these dimensions, nor does it require that the state respond to any evidence of inadequacies with efforts to improve the quality of its assessment systems. Moreover, while NCLB requires states to evaluate various dimensions of their assessment systems, it does not explicitly ask them to evaluate the entire system, or to use accumulated evidence to determine whether and where improvement may be needed. Thus, in attending to the details of the legal requirements, a state may miss the broader question of whether and how well its policies and resources—and specifically its assessment system—are supporting progress in science achievement. NCLB makes clear that evaluation and monitoring are important. In this chapter the committee outlines the role these important functions play in a systems approach: ensuring that the system is well aligned clearly communicates valued standards for teaching and learning and provides accurate data for decision making.
Evaluation is an important feedback mechanism for the education system and must be an integral element in that system. The state must continually monitor and periodically evaluate the effectiveness of the education system as a whole, as well as the effects and effectiveness of each of its components—including the assessment system. The state will need to make sure not only that each component is functioning well independently, but also that the education system as a whole is operating as intended.
The chapter begins with an overview of professional and other standards for assessment quality, goes on to discuss the consequences and uses of assessment systems, and then looks at ways to incorporate evaluation throughout the assessment system.
EVALUATING THE TECHNICAL QUALITY OF ASSESSMENT INFORMATION
Any assessment system must, above all, provide accurate information. Users expect the information to be trustworthy and accurate and to provide a sound basis for actions. Validity, the term measurement experts use to express this essential quality of an assessment or an assessment system, refers to the extent to which an assessment’s results support meaningful inferences for intended purposes. The validity of such inferences rests on evidence that the assessment measures the constructs it was intended to measure and that the scores provide the information they were intended to provide. Thus, particular assessments cannot be classified as either valid or invalid in any absolute sense; it is the uses to which assessment results are put that are valid to a greater or lesser degree. An assessment that is valid for one purpose, such as providing a general indicator of tested students’ understanding of equilibrium, may be invalid for another purpose, such as providing details of students’ alternate conceptions about equilibrium that could be used to guide instruction. The same issues apply to the evaluation of assessment systems that produce a variety of information from multiple measures as apply to the use of multiple measures for assessing individuals, although the available methodologies have to be adapted for that purpose.
As discussed earlier, available evidence suggests that the science standards in many states are vague and not sufficiently specific to represent a clear target for assessment development or for curriculum and instruction (Cross, Rebarber, and Torres, 2004). However, the federal requirements do not ask states to revisit or refine their standards. The NCLB Peer Review Guidance (U.S. Department of Education, 2004) asks for evidence that states are improving the alignment of their assessments and standards over time and that they are filling gaps in their coverage of content domains. However, if a state’s standards are insufficiently clear for the purpose of determining with any degree of precision whether elements of the system are adequately aligned to them, or for the purpose of estab-
lishing priorities for curriculum, instruction, and assessment, then the required evaluations of alignment cannot serve their purpose.
We begin with a look at professional standards for assessment quality.
AERA, APA, and NCME Standards
The most recent edition of Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999) articulates professional standards regarding assessment validity and quality. This document describes specific standards for test construction, evaluation, and documentation; fairness in testing; and test applications. It makes clear that the sponsors of any assessment have the responsibility to ensure that adequate evidence supports the uses intended for the assessment. The Standards emphasizes validity as the most fundamental consideration in test development and use, and it identifies the sources of evidence supporting validity. The Standards explains that evidence based on analysis of test content, response processes, internal structure, and relations to other variables, as well as evidence based on the consequences of testing, are all important.
The Standards addresses other issues as well and provides specific guidance regarding reliability; measurement error; scaling, norms, and score comparability; the process of test development and revision; test administration, scoring, and reporting; and the need for supporting documentation for tests. Separate sections address the rights and responsibilities of test takers and specifically the standards that apply to the testing of those with limited English proficiency and those with disabilities.
The National Science Education Standards
The assessment standards defined in the National Science Education Standards (National Research Council, 1996), which reflect the views of professional scientists across the country, address many of the same concerns. They highlight four points:
Assessments must be consistent with the decisions they are designed to inform.
Achievement and opportunity to learn must be assessed.
The technical quality of the data collected should be well matched to the decisions and actions taken based on interpretations of those data.
Assessment practices must be fair.
Moreover, the standards explicitly include classroom assessments within their purview. The document was innovative in detailing the role of teachers as asses-
sors and the classroom functions of assessment, including improving classroom practice, planning curricula, developing self-directed learners, reporting student progress, and researching teaching practices (National Research Council, 1996).
CRESST Accountability Standards
Responding to the escalating use of assessment for accountability purposes and concerned about the validity of the systems being created, researchers at the Center for Research on Evaluation, Standards, and Student Testing (CRESST), a consortium of university-based experts in educational measurement, have advanced the idea of standards for accountability systems (Baker, Linn, Herman, and Koretz, 2002), specifically advocating that attention be paid to the system as a whole and not just to individual assessments. Drawing on the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999) as well as their own knowledge and experience and ethical considerations, the developers of the CRESST standards stress that accountability systems should be evaluated on the basis of multiple forms of evidence. Specifically, systems should be supported by rich and varied evidence of the validity of inferences based on assessment results, evidence that all elements of the system are aligned, and evidence that assessment is sensitive to instruction (that is, that good instruction yields higher performance on the assessment than does poor instruction). Standards are presented in five areas—system components, testing standards, stakes, public reporting, and evaluation—and dimensions against which accountability systems could be evaluated are provided for each. With regard to evaluation, the CRESST standards propose that (Baker et al., 2002, p. 4):
Longitudinal studies should be planned, implemented, and reported evaluating effects of the accountability program. Minimally, questions should determine the degree to which the system:
builds capacity of staff;
affects resource allocation;
supports high-quality instruction;
promotes student-equitable access to education;
affects teacher quality, recruitment, and retention; and
produces unanticipated outcomes.
The validity of test-based inferences should be subject to ongoing evaluation. In particular, evaluation should address:
aggregate gains in performance over time; and
impact on identifiable student and personnel groups.
Deeper Conceptions of Quality
Other dimensions also need to be considered in the evaluation of assessment systems. We have discussed the developmental nature of science learning and its implications for science assessment. In the committee’s view, assessments must be based on solid conceptions of the ways in which science learning develops over time. This grounding in learning deepens the conceptions of quality that can be applied to assessment. Assessment tasks and scoring rubrics that are designed to elicit the knowledge and cognitive processes that are consistent with the nature of learning provide a framework not only for the development of assessments but also for evaluation of the validity of interpretations based on the assessment’s results.
For science assessment to support learning, the ways in which learning develops must be considered as the assessment systems are evaluated and monitored. Once a means of designing assessments that are based on models of how student learning develops are put in place, the degree to which these models are reflected in the assessments must be evaluated. Because this deeper conception of what assessment is for has not yet been widely adopted by states, the means of evaluating how effectively an assessment system reflects models of learning are not as well established as other evaluation and monitoring practices. Nevertheless, the committee argues that if a system attempts to incorporate this approach the effectiveness with which it does so needs to be monitored.
EVALUATING THE CONSEQUENCES AND USES OF ASSESSMENT SYSTEMS
Although states have few sources of guidance as they consider monitoring and evaluating the system as a whole, in several areas considerable efforts have been made that will be helpful.
Validity of Gains
Focusing on an individual test rather than the system as a whole can cause a state to miss unintended consequences of testing. If instruction becomes overly focused on material that is tested on a single test and on the formats used on the test, improved test results may not represent gains in learning or progress toward meeting standards; rather, they may reflect students’ improved ability to respond to items on a particular kind of test. In such a case, the meaningfulness of test score gains is in question. Research shows that test scores in the first years after a new test is introduced are likely to show substantial increases, particularly if high stakes are involved, but that improvements tend to level off after that initial stage (Linn, 2003). If students have indeed improved, gains should be evident on other indicators. If not, the gains are suspect. An analysis of test results from the Ken-
tucky Instructional Results System program and those from the National Assessment of Educational Progress and the American College Testing Program (ACT) provides one example of a case in which a state’s assessment was not consistent with other indicators (Koretz and Baron, 1998). The Kentucky results showed dramatic upward trends, while the two national assessments showed modest improvement or level performance. Such contrasts raise such questions as whether the state test results reflect real learning or just the effects of test preparation or teaching to the test, and whether the national tests were adequate measures of the Kentucky curriculum.
In another example, California’s strong accountability system in reading and mathematics resulted in impressive initial improvement in test scores, with the majority of elementary schools meeting their target goals. Years 2 and 3 of the program saw diminishing returns, with substantially fewer schools reaching their goals, and results from 2004 showed no consistent trends (Herman and Perry, 2002). Some observers believe that patterns such as these illustrate the limits of what can be achieved primarily through test preparation, and that continuing improvement over the long term will require meaningful changes in the teaching and learning process. These findings suggest the need for states to continuously validate their gains and the meaning of their science scores over time.
Reliability of Scores from Year to Year
Ensuring the reliability of scores is another challenge facing those who monitor school performance from year to year. All test scores are fallible. Individual test scores reflect actual student capability but also are subject to errors introduced by students’ motivation and state of health on the day of the test, in how attentive they are to the cues and questions in the tests, in how well prepared they are for a particular test format, and in other factors. Test scores at the school level similarly reflect an amalgam of students’ actual knowledge and skills and error. Error can be introduced by unpredictable events, such as, for example, loud construction near the school, waves of contagious illness, and other factors that affect which students are actually tested. In addition, there is inevitably substantial volatility in scores from year to year that has nothing to do with student learning, but more to do with variations in the population of students assessed, particularly for smaller schools and schools with high transiency rates. This volatility makes it difficult to interpret changes in test scores from year to year, as these must be interpreted in light of these potential sources of measurement error. For example, in an analysis of Colorado’s reading and mathematics assessments, Linn and Haug (2002) found that less than 5 percent of the state’s schools showed consistent growth on the Colorado Student Assessment Program of at least 1 percentage point per year from 1997 to 2000, even though schools on average showed nearly a 5 percent increase over the three-year period in the number of
students deemed proficient. Combining two years of results for individual schools, as permitted by NCLB, reduces the volatility but does not eliminate the problem.
We have noted the importance of evaluating not only the alignment of the elements of the assessment system but also the elements of the larger system through which science education is delivered. This report has described an assessment system that includes assessments operating at multiple levels and aligned with the broad goals of the overarching system. Thus, alignment, when it is evaluated, must be considered both among the multiple assessment measures and other components of the assessment system, and between the assessment system and the goals of the system. Neither methods for doing this second kind of analysis nor indexes of the overall alignment are readily available. This is an area in which further research would be extremely beneficial.
Even in the absence of a complete research base, however, many aspects of alignment can be investigated. Curriculum and instructional materials, classroom assessments, and other available resources can be coordinated to ensure that they support science learning. Teacher preparation, professional development, and other supports also need to be aligned with standards, as well as students’ needs, to ensure that teachers have essential skills and knowledge. The selection of materials, personnel evaluations, preservice requirements, and other essential features of the education system can be coordinated with the defined learning goals and strategies.
Alignment between assessments and standards is a specific type of horizontal coherence. When they are well aligned, they can reinforce the education system’s goals for science learning; when poorly aligned, they can distort the standards and the instruction that is used to communicate the standards to students.
While the concept of alignment is straightforward, establishing whether standards and assessments are actually aligned is not. It may be relatively easy to determine whether a test provides an adequate measure of a simple construct, such as two-digit multiplication, but it is exceedingly difficult to measure the degree of alignment between an assessment and standards that include higher order scientific principles. The reason doing this is so difficult is that both the theoretical basis for alignment and the operational procedures for aligning tests and standards are still being developed.
On the theoretical side, evaluating alignment entails establishing the equivalence of the cognitive demands of assessment tasks (often multiple-choice test items) and the cognitive demands of state standards (usually prose statements about student knowledge and skills). On what basis might one decide that the assessment task and the standard are comparable? At present, there is no widely accepted framework for classifying or describing scientific understanding that could serve as a yardstick for comparing assessments and standards.
On the practical side, all existing alignment procedures are based on judgment. Educators look at assessments and standards and try to decide whether a given task (or set of tasks) seems to demand the knowledge and skills described in a given standard or set of standards. Such judgments are difficult and time-consuming to make, and different approaches to the process yield different results. Thus, while alignment is widely regarded as essential in a standards-based system, few are satisfied with current means of measuring it.
A number of researchers have developed practical procedures for judging alignment. Although they differ in specifics, each of the procedures restricts the scope of the comparison by focusing on a small number of key dimensions, and each provides operational definitions and training to improve the reliability of judgments by raters. Overall, many researchers who study alignment have concluded that the state tests they studied were less challenging and narrower in content than their standards. In a paper prepared for the committee, Rothman (2003) summarizes and analyzes six recent alignment studies: Norman L. Webb’s studies of alignment between standards and tests in mathematics and science (Webb, 1997, 1999, 2001); Karen K. Wixson’s studies of alignment between standards and tests in elementary reading (Wixson, Fisk, Dutro, and McDaniel, 2002); Andrew C. Porter’s tools for measuring the content of standards, tests, and instructional materials (Porter, 2002); Achieve’s studies of alignment of standards and tests (www.achieve.org); The Buros Center for Testing’s study of alignment between commercially available tests and state standards (Impara, 2001; Plake, Buckendahl, and Impara, 2004); and The American Association for the Advancement of Science’s Project 2061’s studies of standards, textbooks, and textbook tests (2002).
Rothman concludes (p. 16), “Although the six methods differ widely in their criteria for alignment and the procedures used to gauge alignment, they share the conclusion that, with a few exceptions, standards and tests are generally not well aligned. This conclusion contrasts with the results from studies by states and publishers, which typically show a higher degree of alignment.”
Further research on alignment is clearly needed. Determining the key dimensions that characterize alignment and examining the validity of methods that are used to set standards for alignment are two issues that should be given high priority by states. Practical procedures need to be developed to improve the reliability of ratings and to reduce the time burden associated with alignment studies. However, these shortcomings should not deter states from making immediate and concerted efforts to bring assessments in line with standards.
We note here that the creation of an assessment system may create additional challenges for alignment studies, although a systems approach could improve the overall alignment between standards and assessments. The designers of a science assessment system select the tests and tasks that constitute the system to align collectively with the breadth and depth of state science content standards, to address program monitoring and evaluation needs, and to provide evidence of stu-
dent competence. In an assessment system, therefore, alignment should be looked at across components. A single assessment may not be well aligned with the standards or the curriculum because it is narrowly focused, but it may be part of a more comprehensive collection of assessments that, as a set, are fully aligned with both.
The goals regarding the development of content and performance standards presented in Chapter 4 address important measures that states can take to improve the alignment between their assessments and standards, but we offer here several further points for states to consider.
First, it is clear that alignment is best addressed not as an afterthought in the development of a standards-based system but as a key goal from the beginning. It is far more effective to build in alignment as the elements of the system are put into place (or modified in response to NCLB requirements) than to try to engineer the elements into alignment after they have been developed (Webb, 1997). Second, the responsibility for ensuring that assessments are aligned to the standards and other aspects of the education system cannot be left to testing contractors and should rest with the states. By updating alignment studies whenever the standards or the tests change, states can monitor a contractor’s efforts to ensure alignment.
It is also important to note that improving alignment does not necessarily mean changing tests. Alignment is a characteristic of the relationship between standards and tests and thus can be adjusted by means of changes in either the standards or the tests. Moreover, as a number of alignment studies have shown, standards can be the cause of the problem (Rothman, 2003). If standards are too general, for example, many types of test items could be viewed as fitting under their very wide umbrellas. In such a case, individual items might seem to match standards, but the test as a whole might not measure the full range of knowledge and skills intended in the standards. It is important to consider the effects that any change to tests or standards might have on the comparability of information across years that is based on test results.
INCORPORATING EVALUATION THROUGHOUT THE ASSESSMENT SYSTEM
We turn now to some of the specific challenges of evaluating and monitoring each element of the assessment system.
Earlier in the report we described the characteristics that assessment systems should have. The first step in developing such an assessment system is to translate standards into assessment frameworks and specifications. These documents
should be extensively reviewed to ensure that they are well aligned with the standards. A part of this review should be a determination that assessment of all the standards has been provided for by means of large-scale or classroom-based assessments. The quality of these documents should be reviewed by both teachers and content experts who are not part of the system. Similarly, as assessment tasks are developed, they should be reviewed to be sure they are aligned with the specifications and to monitor the quality of their content, potential bias, and clarity.
Field testing of assessment tasks and tests provides the next step in evaluation, and a variety of types of evidence are needed to show the extent to which the assessments will provide reliable and accurate data and can support valid inferences for their intended purposes. Among the types of evidence that states should look for are the following (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999):
item analyses (to reveal relative difficulty and discrimination).
evidence of score and interrater reliability (to ensure that scoring standards are consistently applied).
evidence of fairness (e.g., through differential item functioning and content reviews).
evidence of quality of scaling.
evidence of the validity of scores (e.g., through qualitative analyses of the ways in which students respond to the assessment items, analyses of the internal structure of the assessment, and analyses of the relationship between assessment performance and other indicators or variables).
An example of this last kind of analysis might be a finding that students’ science scores on the statewide test correspond closely to their scores on a classroom measure of science understanding but do not correspond to their scores on a measure of reading. Such a finding would constitute one source of evidence that the state test did in fact measure science skills and knowledge, as opposed to another academic ability. Similarly, if students who had completed a physics course scored higher on a physics test than did students who were otherwise similar but had not taken the course, as one would expect, this would provide evidence for the validity of the physics test. This latter example is particularly important, in that it could be used to document the instructional sensitivity of the assessment, a critical, but too often overlooked, dimension of validity in the context of NCLB. That is, the legislation is premised on the assumption that teachers and schools can improve their teaching and instruction, and that such improvement will show up in higher test scores; however, if tests are not sensitive to the effects of good teaching, they cannot provide evidence of improvement or lack thereof. Such sensitivity cannot be assumed (Baker, Herman, and Linn, 2004).
Before assessments become operational, it is important to ensure that detailed specifications have been met and that the resulting tests are indeed aligned with standards. Various methods for determining alignment have been developed, and all share a similar set of procedures. An independent group of experts, composed of teachers and subject matter experts, is convened and asked to examine each item or task, rate the content focus and level of cognitive demand of the items, and note any extraneous issues—such as language difficulty—that could affect a student’s ability to respond. Taking into account the number of items needed to meet minimum measurement criteria, results are then summarized to show the extent of coverage of the standards in question and the balance of coverage (Porter, 2002; Webb, 1997a, 1997b, 1999, 2001, 2002).
Because it is difficult to assess alignment if standards are not clearly articulated and focused, and because alignment studies make clear the limits of what can be assessed in a finite assessment, the results of alignment studies may indicate a need to modify the standards or to take other steps to improve the alignment between standards and assessments. Furthermore, because the committee advocates a system of assessments that supports student learning and development over time, alignment studies will need to address all of the assessments and sources of data that are intended to be part of the system, as well as addressing the alignment of assessments with learning expectations across grades. Methodologies will be needed to judge the alignment of a multilevel system.
Moreover, as states change their actual assessments, or portions of them, from year to year or within years, evidence must be collected to show the extent to which the different test forms are comparable and that the equating from one form to the next has been done correctly. Without this evidence, scores cannot be compared from one administration to the next, because any differences may be caused by differences in the difficulty levels of the two tests or the constructs measured, rather than changes in performance.
Like other aspects of test development, the plan for the reporting of test results requires monitoring, and methods of reporting should be field-tested with each intended audience—parents, administrators, and teachers—to ensure that reports are clear and comprehensible to users, that users are likely to interpret the information appropriately, and that the information is useful. Similarly, standard-setting processes should be monitored to ensure that appropriate stakeholders were included in the process, that the process took into account both empirical data on test performance and qualitative judgments of what kinds and levels of performance can be expected of minimally proficient students, and that there is evidence of the validity and accuracy of proficiency classifications based on the standards. Moreover, methodologies will be needed to ensure that performance standards take into account the results of a system of assessments, some of which are derived from statewide assessments, others from classroom assessments.
As noted above, the CRESST accountability standards (Baker et al., 2002) highlight the need for longitudinal studies to examine the effects of any accountability system. If the primary purpose of NCLB science assessments is to improve student achievement overall and to close the achievement gap between high- and low-achieving students, then studies should examine the extent to which the intended benefits are realized. The CRESST researchers suggest that among the intended benefits that should be investigated are the extent to which the system does the following:
builds the capacity of staff to enable students to reach standards;
builds teacher assessment capacity;
influences the way resources are allocated to ensure that students will achieve standards;
supports high-quality instruction aligned with standards; and
supports equity in students’ access to quality education.
The accountability standards also note potential unintended consequences that should be investigated. These include the possibility of corruption of test scores; adverse effects on teacher quality, recruitment, or retention; and increases in dropout rates. All these unanticipated outcomes have been associated with high-stakes assessments (Klein, Hamilton, McCaffrey, and Stecher, 2000; Madaus, 1998).
The feasibility of the assessment system also merits inquiry. For example, an assessment program may place new burdens on teachers, principals, and districts. It may raise questions about opportunity costs, cost-effectiveness, and the feasibility of performance targets. Thus, evaluation of the feasibility of any targets set for school performance and progress must be part of the process. For example, Linn (2003) uses historical data to suggest that current goals for adequate yearly progress in reading and mathematics represent a level of improvement that is well beyond what the most successful schools have actually achieved.
As noted earlier, when new high-stakes state assessments are put into place, scores typically show an increase over the first several years. But as Koretz (2005) has noted, such gains may be spurious. One way to examine the extent to which gains in test scores represent real improvements in learning—rather than effective test preparation—is to compare the gains shown on the high-stakes test with those shown on other, independent measures of the same or similar construct.
Another study shows the importance of one of the CRESST evaluation recommendations—that the impact of accountability and assessment on subgroups of the student population be monitored (Klein et al., 2000). Reducing the achievement gap in science between historically underachieving minorities and their more privileged peers is an explicit purpose of NCLB. Just as the law requires that
results be disaggregated by subgroup, so, too, should studies of the effects of testing look for differential effects on population subgroups. Such effects may suggest different conclusions than those that result from looking only at overall aggregate performance. It is thus important to look not only at multiyear trends in performance overall and by subgroup but also to examine students’ longitudinal growth using advanced statistical models and individual-level data. For example, Choi, Seltzer, Herman, and Yamachiro (2004) found that schools with similar overall growth patterns could be differentially effective with students of differing initial ability. In some schools the gap between high-ability and low-ability students could be increasing, while in others with similar overall growth the pattern could be reversed.
An additional concern is the utility and use of assessment results. A primary purpose of state assessment systems is to provide evidence that will improve decision making and enable states, districts, and schools to better understand and improve science learning. Stakeholders at each level of the educational hierarchy—state departments of education, school districts, schools, and classrooms—need to monitor student performance and take appropriate action to improve it. For example, a district or state may observe trends in student performance in biology, discover that students are performing relatively poorly with particular science concepts, and use these data to institute a new professional program for teachers that develops their capacity to teach and assess understanding of key biology concepts. At the classroom level, a teacher using a classroom assessment to get detailed knowledge of students’ understanding of a particular concept, such as buoyancy, can use that information to provide immediate feedback to students, recognizing the need to engage students in additional lab work to over-come their misconceptions. Thus, the consequences and uses of the assessment system at each level need to be evaluated. This analysis should include questions about whether and how the data are actually used, with an eye to both intended and unintended consequences. Surveys, focus groups, observations, and the collection of artifacts are all means of acquiring this kind of information.
This report outlines ambitious goals for assessment systems that go beyond current practice in supporting both accountability and student learning, although we recognize that experience with the design requirements of effective standards-based systems is still developing. For example, the committee has stressed, and NCLB requires, that the elements of an assessment and accountability system should be both coherent and aligned with standards. However, the methodology for developing and ensuring such alignment is still evolving, and there is only a
limited amount of research to guide states and districts in their efforts to achieve alignment. Similarly, the research base that can support the development of assessments based on current theories of learning is also evolving. Thus, while NCLB is based on the premise that continuous cycles of assessment and improvement are key to helping all students reach high standards, the means of making that a reality are not yet completely evident.
Continual monitoring and periodic evaluation are particularly important in this high-stakes context. If states are able to keep track of the effects and effectiveness of their systems, not only can they avoid unanticipated consequences but they can also make ongoing improvement a genuine element of their systems.
These are the specific challenges facing states:
The time and resources to conduct evaluations are limited in a time of constricted state budgets. Evaluation associated with development and field-testing of assessments should be considered part of the development cost. Evaluation of assessment consequences typically is quite costly and thus has drawn on external funding sources. However, states may be able to look at some important aspects of assessment impact through the routine survey collection of data on students’ opportunity to learn.
Funding mechanisms are constrained. States may not be in a position to develop separate contracts for evaluation. Some states have solved this problem by including a requirement for independent evaluation in their requests for proposals for general assessment contracts. Thus, the winners of the contract would contract on behalf of the state for independent technical advisers and others to conduct evaluations.
Sophisticated evaluation skills are required. Expertise in both assessment and evaluation skills is needed; evaluators must be knowledgeable about both qualitative and quantitative procedures.
Appropriate methodologies are still evolving. The most effective ways to assess alignment developmentally, over time, are not entirely evident. Similar concerns are associated with the challenges of assessing the alignment of system components beyond tests and standards; judging the alignment and integration of information across levels (school, district, and state); evaluating instructional sensitivity; and identifying optimal ways to identify and address fluctuations in scores from year to year that are unrelated to student learning.
QUESTIONS FOR STATES
States can use the following questions to consider whether their methods for evaluating and monitoring their assessment systems are sufficient, and to think about ways to move their assessment systems in the directions the committee has described.
Question 8-1: Does the state make use of multiple sources of information to continually monitor the effects of the science assessment system on science learning and teaching in the state?
Question 8-2: Does the state formally evaluate all aspects of its science assessment system, including development, administration, implementation, reporting, use, and both short- and long-term intended and unintended effects? Do the evaluations address the integration of the components of the system and address the major purposes the assessments are intended to serve? Do they include appropriate procedures and incentives? Do they include multiple indicators, such as technical quality, utility, and impact?
Question 8-3: Does the state monitor and evaluate the interactions between its science assessment system and the assessment systems for other disciplines? Does the evaluation address both the intended and unintended effects of the science assessment system on the state’s overall goals for K–12 education? Are the content standards, achievement standards, and assessments evaluated together to ensure they work together as a coherent system?