As Chapter 2 discusses, the performance measures that are used with incentives are critically important in determining how incentives operate. Specifically, performance measures need to be aligned with the desired outcomes so that behavior that increases the measures also increases the desired outcomes. In this chapter we look at the use of tests as performance measures for incentive systems in education.
We have noted above that tests fall short as a complete measure of desired educational outcomes. Most obviously, the typical tests of academic subjects that are used in test-based accountability provide direct measures of performance only in the tested subjects and grade levels. In addition, less tangible characteristics—such as curiosity, persistence, collaboration, or socialization—are not tested. Nor are subsequent achievements, such as success in work, civic, or personal life, which are examples of the long-term outcomes that education aims to improve.
In this chapter, we turn to some important limits about tests that are not obvious—specifically, the ways they fall short in providing a direct measure of performance even in the tested subjects and grades. We begin by looking at an essential characteristic of tests themselves and then turn to review the ways that test results can be turned into performance measures that can be used with incentives. Finally, we look at the use of multiple measures in incentive systems in which there is an attempt to overcome the limitations of any single measure by using a set of complementary measures.
Although large-scale tests can provide a relatively objective and efficient way to gauge the most valued aspects of student achievement, they are neither perfect nor comprehensive measures. Many policy makers in education are familiar with the concept of test reliability and understand that the test score for an individual is measured with uncertainty. Test scores will typically differ from one occasion to another even when there has been no change in a test taker’s proficiency because of chance differences in the interaction of the test questions, the test taker, and the testing context. Researchers think of these fluctuations as measurement error and so treat test results as estimates of test takers’ “true scores” and not as “the truth” in an absolute sense.
In addition, tests are estimates in another way that has important implications for the way they function when used as performance measures with incentives: they cover only a subset of the content domain that is being tested. There are four key stages of selection and sampling that occur when a large-scale testing program is created to test a particular subject area. Each stage narrows the range of material that the test covers (Koretz, 2002; Popham, 2000). First, the domain to be tested, when specifically defined, is typically only part of what might be reasonable to assess. For example, there needs to be a decision about whether the material to be tested in each grade and subject should include only material currently taught in most schools in the state or whether it should include material that people think should be taught in each grade and subject.
Second, the test maker crafts a framework that lists the content and skills to be tested. For example, if history questions are to be part of the eighth grade test, they might ask about names and the sequence of events or they might ask students to relate such facts to abstractions, such as rights and democracy. These decisions are partly influenced by practical constraints. Some aspects of learning are more difficult or costly to assess using standardized measures than others. In reading, for example, students’ general understanding of the main topic of a text is typically more straightforward to assess than the extent to which a student has formed connections among parts of the text or applied the text to other texts or to real-world situations.
Third, the test maker develops specifications that dictate how many test questions of certain types will constitute a test form. Such a document describes the mix of item formats (such as multiple choice or short answer), the distribution of test questions across different content and skill areas (such as the number of test questions that will assess decimal numbers or percentages), and whether additional tools will be allowed (such as calculators or computers).
Fourth, specific test items (questions) are created to meet the test
specifications. After a set of test items of the correct types are created, the items are pilot-tested with students to see whether they are at the appropriate level of difficulty and are technically sound in other ways. On the basis of the results of the pilot test and expert reviews, the best test items are selected to be used on the final test. It is generally more difficult to design items at higher levels of cognitive complexity and to have such items survive pilot testing.
As a result of these necessary decisions about how to focus the content and the types of questions, the resulting test will measure only a subset of the domain being tested. Some material in the domain will be reflected in the test and other material in the domain will not. If one imagines the full range of material that might be appropriate to test for a particular subject—such as eighth grade mathematics as it is taught in a particular state—then the resulting test might include questions that reflect, for example, only three-quarters of that material. The rest of the material—in this hypothetical example, the remaining quarter of the subject that is excluded—would simply not be measured by the test, and this missing segment would typically be the portion of the curriculum that deals with higher levels of cognitive functioning and application of knowledge and skills.
Although the example of a test covering only three-quarters of a domain is hypothetical, it provides a useful way to think about what can happen if instruction shifts to focus on test preparation in response to test-based incentives. If teachers move from covering the full range of material in eighth grade mathematics to focusing specifically on the portion of the content standards included on the test, it is possible for test scores to increase while learning in the untested portions of the subject stays the same or even declines. That is, test preparation may improve learning of the three-quarters of the domain that is included on the test by increasing instruction time on that material, but that increase will occur by reducing instruction time on the remaining one-quarter of the material. The likely outcome is that performance on the untested material will show less improvement or decrease, but this difference will be invisible because the material is not covered by the test.
To this point, we have discussed problems with tests as accountability measures even when best practices are followed. In addition, now that tests are being widely used for high-stakes accountability, inappropriate forms of test preparation are becoming more widespread and problematic (Hamilton et al., 2007). Test results may become increasingly misleading as measures of achievement in a domain when instruction is focused too narrowly on the specific knowledge, skills, and test question formats that
are likely to appear on the test. Overly narrow instruction might include such practices as drilling students on practice questions that were released from prior years’ tests, focusing on the limited subset of skills, knowledge and question formats that are most likely to be tested, teaching test-taking tricks (such as the process of elimination for multiple-choice items or memorizing the two “common Pythagorean ratios” rather than learning the Pythagorean theorem), or teaching students to answer open-ended questions according to the test’s scoring rubric. When scores increase on a test for which students have been “prepared” in these ways, it indicates only that students have learned to correctly answer the specific kinds of questions that are included on that particular test. It does not indicate that that students have also attained greater mastery of the broader domain that the test is intended to represent (Koretz, 2002).
Changing teaching in at least some classrooms is one goal of test-based incentives. Good test preparation is instruction that leads to students’ mastery of the full domain of knowledge and skills that a test is intended to measure. This mastery will incidentally improve large-scale test scores, but it will also be reflected elsewhere, for example, on other tests and in the application of knowledge outside school.
It is an essential goal of education reform that instruction be tied to the full set of intended learning goals, not just the tested sample of knowledge, skills, and question formats. Bad or inappropriate test preparation is instruction that leads to test score gains without increasing students’ mastery of the broader, intended domain, which can result from engaging in the types of inappropriate strategies discussed above. These practices are technically permissible and can even be appropriate to a limited degree, but they will not necessarily help students understand the material in a way that generalizes beyond the particular problems they have practiced. Mastering content taught in test-like formats has been shown not to generalize to mastery of the same content taught or tested in even slightly different ways (Koretz et al., 1991). In this kind of situation, test scores are likely to give an inflated picture of students’ understanding of the broader domain.
If test score gains are meaningful, they must generalize to the intended domain, and if they do, they should also generalize to a considerable extent to other tests and nontest indicators of the same domain. For that reason, trends in performance on the National Assessment of Educational Progress (NAEP)—a broad assessment designed to reflect a national consensus about important elements of the tested domains—are frequently compared with trends on the tests that states use for accountability.
One study examined the extent to which the large performance gains shown on the Kentucky Instructional Results Information System (KIRIS), the state’s high-stakes test, represented real improvements in student
learning rather than inflation of scores (Koretz and Barron, 1998). The study found evidence of score inflation. Even though KIRIS was designed partially to reflect the frameworks of NAEP, very large and rapid KIRIS gains in fourth grade reading from 1992 to 1994 were not matched by gains in NAEP scores. Although large KIRIS gains in mathematics from 1992 to 1994 in the fourth and eighth grades were accompanied by gains in NAEP scores, Kentucky’s NAEP gains were roughly one-fourth as large as the KIRIS gains and were typical of gains shown in other states. At the high school level, the large gains that students showed on KIRIS in mathematics and reading were not reflected in their scores on the American College Testing (ACT) college admissions tests.1 A Texas study found similar evidence of score inflation (Klein et al., 2000).
In a recent comparison of state test and NAEP results between 2003 and 2007, the Center on Education Policy (2008) found that trends in reading and mathematics achievement on NAEP generally moved in the same positive direction as trends on state tests, although gains on NAEP tended to be smaller than those on state tests. The exception to the broad trend of rising scores on both assessments occurred in eighth grade reading, in which fewer states showed gains on NAEP than on state tests.
The average scores on state accountability tests tend to rise, sometimes dramatically, every year for the first 3 or 4 years of use and then level off (Linn, 2000). When an existing test is then replaced with a new test or test form, the scores on the new test rise while the scores on the old test fall. Linn surmised that these initial gains reflect growing familiarity with the specific format and content of the new test. This explanation was supported by a study in which students were retested with an old test 4 years after a new test had been introduced in a large district (Koretz et al., 1991): while students’ performance on the new test had increased, their performance had dropped on the test no longer routinely used. This result showed that the initial gains on the new test were specific to that test and did not support a conclusion of improved learning in the subject matter domain. A number of other studies provide persuasive evidence that gains on high-stakes accountability tests do not always generalize to other assessments given at approximately the same time in the same subjects (Fuller et al., 2006; Ho and Haertel, 2006; Jacob, 2005, 2007; Klein et al., 2000; Koretz and Barron, 1998; Lee 2006; Linn and Dunbar, 1990).
There is also evidence that teachers themselves lack confidence in the meaningfulness of the score gains in their own schools. A survey of educators in Kentucky asked respondents how much each of seven factors had contributed to score gains in their schools (Koretz et al., 1996a).
1The two tests measure somewhat different constructs, but the overlap was sufficient that one would expect a substantial echo of the KIRIS trends in ACT data.
Over half of the teachers said that “increased familiarity with KIRIS [the state test]” and “work with practice tests and preparation materials” had contributed a great deal. In contrast, only 16 percent reported that “broad improvements in knowledge and skill” had contributed a great deal. Very similar results were found in Maryland (Koretz et al., 1996b).
Fundamentally, the score inflation that results from teaching to the test is a problem with attaching incentives to performance measures that do not fully reflect desired outcomes in a domain that is broader than the test. It is unreasonable to implement incentives with narrow tests and then criticize teachers for narrowing their instruction to match the tests. When incentives are used, the performance measures need to be broad enough to adequately align with desired outcomes. One route to doing this is to use multiple measures, which we discuss later on in the chapter. However, another important route to broadening the performance measures is to improve the tests themselves. Finally, given the inherent limits in the breadth that can be achieved on tests, it is important to evaluate test results for possible score inflation.2
Broadening Tests to Reflect the Domain of Interest
A test will not provide good information about students’ learning, in an accountability context when incentives have been attached to the results, unless it samples well—both in terms of breadth and depth—from the content that students have studied and asks questions in a variety of ways to make sure that students’ performance covers the domain. That is: Can a test’s results be generalized beyond that test?
In current practice, this concern is addressed in part by examining the alignment of tests with content and performance standards. However, it is not enough to have the limited alignment obtained when test publishers show that all of their multiple-choice items can be matched somewhere within the categories of a state’s content standards (Shepard, 2003). Rather, what is needed is a more complete and substantive type of alignment “that occurs when the tasks, problems, and projects in which students are engaged represent the range and depth of what we say we want students to understand and be able to do. Perhaps a better word than alignment would be embodiment” (Shepard, 2003, p. 123). Shepard goes on to warn that “when the conception of curriculum represented by a state’s large-scale assessment is at odds with content standards and curricular goals, then the ill effects of teaching to the external, high-stakes
2Such monitoring can be done by looking at low-stakes tests that are not attached to the incentives. In addition, see the work by Koretz and Béguin (2010) on possibilities for designing tests that include a component to self-monitor for score inflation.
test, especially curriculum distortion and nongeneralizable test score gains, will be exaggerated” (p. 124). To the extent feasible, it is important to broaden the range of material included on tests to better reflect the full range of what we expect students to know and be able to do.
In addition to broadening the range of material included on tests to better reflect the content standards they are intended to measure, it is also important to broaden the questions that are used to assess performance. Currently, one can find many unnecessary recurrences in the characteristics of many tests—unneeded similarities in content, format, other aspects of presentation, and aspects of the responses demanded (Koretz, 2008a). In some cases, one can find items that are near clones of items used in previous years, with only minor details changed. These unnecessary recurrences provide opportunities for coaching, and, indeed, test preparation materials often focus on them. Reducing these recurrences would make it harder to focus instructional time on tested details and thereby reduce score inflation when incentives are attached to the tests.
Incentives are rarely attached directly to individual test scores; rather, they are usually attached to an indicator that summarizes those scores in some way. The indicators that are constructed from test scores have a crucial role in determining how the incentives operate. Different indicators created from the same test can produce dramatically different incentives.
A choice of indicator is fundamentally a choice about what a policy maker values and what pressures the policy maker wants to create by the incentives of test-based accountability. Is the goal to affect particular students, such as those who are high achievers, low achievers, or English learners? Is the only goal to ensure that everyone reaches some minimum performance level, or should progress below the minimum that fails to reach the minimum as well as progress above the minimum also be encouraged? It can be difficult to talk about the trade-offs that these questions imply, but the indicators used in test-based accountability implicitly include decisions about how such tradeoffs have been made.
For example, two commonly used ways of constructing indicators from test scores—mean scores and minimum performance levels—result in dramatically different incentives. A mean score places value on scores at all levels of achievement: every student who improves raises the mean and every student who declines lowers the mean. An incentive attached to a mean score will focus efforts on the scores of students at any achievement level whose scores can be increased. In contrast, a performance standard for a specific minimum level of achievement focuses attention on the scores near the cut score that represents the standard. When a standard
that defines a minimum performance level is set, efforts are focused on raising the performance of students below the standard up to that level, while keeping students just above it from falling below it. An incentive attached to an indicator based on a minimum performance level will focus instruction on students believed to be near the standard; there is no incentive to improve the performance of students who are already well above the standard or who are far below it.
Research has demonstrated the effect that incentives based on performance standards can have in focusing attention on students who are near the standard. In a study that analyzed test scores before and after the introduction of Chicago’s own accountability program in 1996, and before and after the introduction of the No Child Left Behind (NCLB) Act requirements in 2002 (Neal and Schanzenbach, 2010), the greatest gains were shown by students in the middle deciles, particularly the third and fourth. Little or no gain was shown in the top decile, and the bottom two deciles showed no improvement or even a decline. A similar pattern was found in Texas during the 1990s (Reback, 2008): “marginal” students, meaning those on the cusp of passing or failing a state exam used to judge the quality of schools, showed the greatest improvements because the accountability system provided strong incentives for teachers to focus on them. Two other studies (Booher-Jennings, 2005; Hamilton et al., 2007) also found that teachers focused their efforts on students near the proficiency cut score; teachers even reported being concerned about the consequences of doing so for the instruction of high- and low-achieving students.
Indicators based on performance standards were adopted to give more interpretable summaries to policy makers and the public of how groups of students are performing. There is some question whether the use of performance standards actually accomplishes this goal of greater interpretability. The simple performance labels that are shared across many tests—basic, proficient, and advanced—mask substantial variation within the categories. The reason for this variation is that standard-setting is a judgmental process that can be affected by the particular process used, the panelists who implement the process, and the political pressures that may lead to adjustments for the final levels. Different standard-setting methods often produce dramatically different results, as do different groups of panelists (Buckendahl et al., 2002; Jaeger et al., 1980; Linn, 2003; Shepard, 1993). Despite improvements in standard setting methods over time, performance standards vary greatly in rigor across the states (McLaughlin et al., 2008). One prominent researcher concluded that this variability is so great as to render characterizations such as “proficient” meaningless—despite the rhetorical power of such terms (Linn, 2003). In any case, it is important to realize that the use of performance standards
has additional implications when incentives are attached to indicators that are based on those performance standards.
Another basic difference in types of indicators is the contrast between indicators that look at the levels of test scores and indicators that look at changes in those levels. There are several different ways of constructing indicators that look at test score changes: cohort-to-cohort changes, growth models, and value-added models.
Cohort-to-Cohort Changes Some indicators of change look at the test scores of successive cohorts of students in a particular grade to see if the performance of the students in that grade is improving over time. NCLB includes an indicator based on this kind of cohort-to-cohort change in its “safe harbor” provision, which gives credit to schools that have sufficiently improved the percentage of students meeting the proficient performance standard in successive years, even if the percentage does not yet meet the state’s target for that year (Center on Education Policy, 2003).
Growth Models Some indicators of change look at the growth paths of individual students using longitudinal data that has multiple test scores for each student over time (see, e.g., Raudenbush, 2004). An indicator based on growth can adjust for the point at which students start in each grade and focus on how much they are progressing in that year. Growth models are technically challenging, both because there are difficulties in linking scores from year to year (especially when students change schools), though many states are making substantial progress, and because the models may require tests that are linked from grade to grade, which is difficult to do (Doran and Cohen, 2005; Michaelides and Haertel, 2004). Researchers have proposed an approach to modeling growth that would structure both instruction and tests around “learning progressions” that describe learning in terms of conceptual milestones in each subject (National Research Council, 2006b), but such an approach is not yet common.
Value-Added Models There has been widespread interest in a special type of growth model that attempts to control statistically for differences across students to make it possible to quantify the portions of student growth that are due to schools or teachers. The appeal of indicators based on value-added models is the promise that they could be used to fairly compare the effectiveness of different schools and teachers, despite the substantial differences in the types of students at different schools and the factors that determine how students are assigned to teachers and schools. This is an active area of research, but the extent to which value-added
models can realize their promise has not yet been determined (see, e.g., Braun, 2005; National Research Council, 2010).
These different ways for deriving indicators from changes in test scores focus on different questions and so can be used to provide different incentives when consequences are attached to the indicators. Cohort-to-cohort change indicators look at the change in successive cohorts and may be especially useful during periods of reform when schools and teachers are making substantial changes over a short period of time. In periods when the education system is relatively stable, there is no reason to expect cohort-to-cohort indicators to show any change. Growth indicators look at changes for individual students and provide a way of isolating the learning that occurs in a given year. Because one always expect students to be learning—whether there is education reform or not—growth models need to provide some sort of target to indicate what level of annual change is appropriate. Indicators of growth based on learning progressions offer a way to do this that is tied to the curriculum in a meaningful way, but the necessary curricula and tests for this approach have not yet been developed. Finally, indicators based on value added expand the focus beyond student learning to the contributions of their teachers or schools to learning, with the attempt to identify the portion of learning that can be attributed to a teacher or school. As with growth models in general, value-added models have no natural metric that defines how much value added is appropriate or to be expected. These models have been used to look at the distribution of results for different teachers and schools to identify those that are apparently more effective or less effective in raising test scores, as well as possible mechanisms for increasing effectiveness.
We also note the use of subgroup indicators, which have been an important part of the test-based accountability structure of NCLB. If there is concern that group measures may systematically mask the performance of different subgroups, then it is possible to calculate an indicator using the test scores of different subgroups of students rather than the test scores for the entire student population. Attaching incentives to indicators of test results for different subgroups focuses attention to how each of those subgroups is doing.
In summary, different indicators constructed from the same test can provide very different types of information and very different pressures for change when incentives are attached to them. When choosing an indicator, it is necessary for policy makers to think carefully about the changes they want to bring about, the actions that would bring about those changes, and the people who could perform those actions. The answers to these questions must guide the aggregation of students’ scores
into indicators so that the indicators highlight useful information that can help bring about the desired changes.
Each type of indicator also brings its own technical challenges, which may limit its ability to provide information that is fair, reliable, and valid. It is important to address these technical issues, and we have mentioned some of them briefly in our discussion. However, the message from our review of the research—an assessment of the big picture about the use of test-based incentives—is that different indicators result in very different incentives. Consequently, it is important for policy makers to fully consider possible indicators when they are designing a system of test-based accountability.
The tests that are typically used to measure performance in education fall short of providing a complete measure of desired educational outcomes in many ways. In addition, the indicators constructed from tests highlight particular types of information. Given the broad outcomes people want and expect for education, the necessarily limited coverage of tests, and the ways that indicators constructed from tests focus on particular types of information, it is prudent to consider designing an incentives system that uses multiple performance measures.
One of the basic research findings detailed in Chapter 2 is the importance of aligning performance measures with desired outcomes. As we note in that chapter, incentive systems in other sectors tend to evolve toward using increasing numbers of performance measures as experience with the limitations of particular performance measures accumulates. This evolution can be viewed as a search for a set of performance measures that better covers the full range of desired outcomes and also monitors behavior that merely inflates the measured performance without actually improving outcomes. In this section we discuss the use of multiple performance measures in education.
Professional standards for educational testing and guidelines for using tests emphasize that important decisions should not be made on the basis of a single test score and that other relevant information should be taken into account (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; National Research Council, 1999). Adding information about student performance from other sources can enhance the validity and reliability of decisions. This standard was originally conceived with individual students in mind, cognizant of the fact that tests are only samples of what students know and can do. For example, when a student fails a high school exit exam, taking into account other test scores or samples
of the student’s work can guard against denying a diploma to someone who really has mastered the requisite material.
As the consequences of testing have become more serious for entire schools, education stakeholders are increasingly advocating the use of multiple measures for school accountability to help guard against wrongly identifying schools as failing or needing intervention. Adopting appropriate multiple measures is a design choice that satisfies professional standards and can offer a better representation of the full range of educational goals. Give the context of our focus on incentives, we are particularly interested in the possibility that a set of multiple measures may better reflect education goals and so can provide better incentives when consequences are attached to those measures.
“Multiple measures” is often used loosely and can refer to many different things. Sometimes it is used to mean multiple opportunities with the same measure: for example, in many states, students are allowed to retake the high school exit exam until they pass. In our discussion here, we specifically exclude a discussion of the interpretation of multiple measures that focuses on multiple opportunities to take the same test because that does not provide a way to broaden the performance measure to better reflect our goals. Rather, we focus on two other meanings of the term. One is the use of more than one indicator of a student’s performance in one subject area, such as by using both standardized test scores and teachers’ judgments to determine a student’s level of mathematics achievement. The second meaning is assessing student achievement in multiple subjects, such as reading, writing, mathematics, and science, and combining indicators across domains. In both kinds of multiple measures, indicators can be combined in a conjunctive or compensatory fashion, each of which has implications when consequences are attached, as discussed below.
Conjunctive models combine indicators but do not allow high performance on one measure to compensate for low performance on another. For example, NCLB uses a conjunctive or multiple-hurdle model. In order to make adequate yearly progress, a school must meet each of a number of conditions. The first is that 95 percent of students in each numerically significant subgroup must be tested. Then, all students, as well as all subgroups, must meet targets for percentage proficient. In addition, there are targets for attendance and graduation rates. This combination of measures is used to determine whether schools are making adequately yearly progress, with consequences if they are not. The consequences attached to this conjunctive system of measures sends the message that each indicator is important and schools are expected to meet each target. The result is that
there is only one way to pass—to meet all of the requirements—and many ways to fail. For example, with NCLB, a school may have excellent test scores, but a shortfall in attendance would still cause the school to fail to make adequate yearly progress (Linn, 2007). With multiple ways to fail, the consequences in this system focus attention on the areas that are in danger of not meeting their targets.
In contrast to conjunctive models, compensatory models combine multiple indicators so that a low score in one area can be offset by a high score in another. This produces an overall picture of whether performance targets are being achieved, across the range of areas, but it does not require that each of the individual targets is achieved. Attaching consequences to a system of multiple measures using a compensatory model provides incentives to improve overall performance; the consequences in this system focus attention on the areas where there are the most opportunities for improvement, not areas that are most in danger of failing to meet their individual targets, because there are no individual targets. Compensatory incentives are appropriate in cases where policy makers want to ensure overall performance levels across a number of areas but not where they have individual targets for each of those areas that they view as critical.
In Ohio, for example, the system involves four indicators that are combined in a compensatory way to classify its districts and schools into five categories of performance. The four indicators are (1) the performance indicators for each grade and subject area (reading, writing, mathematics, science, and social studies); (2) a performance index that is a composite score based on all tested grades and subjects, weighted so that scores above proficient count more than those below proficient; (3) a growth calculation; and (4) adequate yearly progress under NCLB. Each indicator uses scores from the statewide testing program, and two of the indicators also consider attendance and graduation rates. The way the four different indicators complement each other to produce an aggregate measure has been described by one expert (Chester, 2005) as better than any single measure in capturing the varied outcomes that the state wants to monitor and encourage. For example, rather than viewing NCLB’s measure of adequate yearly progress as a substitute for the state’s entire system, Ohio understood that that measure provides crucial monitoring of subgroup performance that had previously been lacking in their system. Thus the adequate yearly progress indicator provides important additional information on the overall performance of the schools in the states, even though it fails to capture crucial information—about other
subject areas, different levels of performance, and growth—that the other indicators in the system provide.
In cases where compensatory systems bring together different independent measures, they can have greater reliability than conjunctive systems in a statistical sense because information about the overall performance accumulates across indicators, and the random fluctuations that affect any single indicator tend to offset each other; a chance positive on one indicator can be offset by an equally chance negative on another, but information about performance is present in all indicators (Chester, 2005; Linn, 2007).
Compensatory systems can combine indicators either in a single subject area or across subject areas. Each version can be appropriate for some objectives and inappropriate for others.
The structure of high school exit exams in many states provides an example of the use of compensatory measures within a single subject area. Although people commonly think of high school exit exam requirements as requiring students to pass a single test, the actual requirement in many states involves additional routes to meeting the target. These multiple routes effectively create a compensatory system of multiple measures. In 2006, 16 of the 25 states with exit exams had policies in place for an alternate route to a diploma for students who could not pass the exams, yet had adequate attendance records and grades (Center on Education Policy, 2006b). For example, in a number of states students can use course grades, a collection of classroom work, or the results from a different test in the subject—such as an AP test—to make up for a failure to pass a subject on the state’s high school exam. In states that allow these multiple routes, the high school exit exam requirement provides an overall incentive to meet the requirement but not to pass the test itself.
Similarly, there are examples of incentive systems that use compensatory models across subject areas. For example, at the individual level, Maryland’s high school exit exam uses an overall score that combines results for different subjects (Center on Education Policy, 2005). At the state level, California’s accountability program uses an academic performance index that combines indicators from four different tests: the state’s standards-referenced test, a norm-referenced test, an alternate test, and the state’s high school exit exam. The tested subjects are English, mathematics, history/social science, and science. The indicators are weighted on a scale that was determined by the state board of education and combined to give a final metaindicator of school performance (California Department of Education, 2011).
The essential principle in using a compensatory system of multiple measures is that attaching consequences to an overall compensatory index focuses the incentives at an overall level that uses a broader performance
measure than any one measure alone. If the compensatory system is used for multiple indicators within a single subject area, then incentives will focus attention more broadly across the full range of the subject than a single test would. If the compensatory system is used for multiple indicators across subject areas, then incentives will focus attention across the full range of subject areas. In both cases, there are no targets for the individual measures—which means no targets on the individual tests when compensatory measures are used within a single subject area and no targets on the individual subjects when compensatory measures are used across subject areas. Attaching incentives to a compensatory system of multiple measures within a subject area may be appropriate for a subject area that is critical where there is concern about the necessarily limited coverage of each of the available measures. Attaching incentives to a compensatory system of multiple measures across subject areas may be appropriate where there is more concern about tracking overall performance and less concern about the relative performance in particular subject areas.
An Alternative Approach to Multiple Measures: Using Test Scores as a Trigger
Another possible approach is to use large-scale test scores as a trigger for a more in-depth evaluation, as proposed by Linn (2008). Under such a system, teachers or schools with low scores on standardized tests would not be subject to automatic sanctions. Instead, the results of standardized tests would be used as descriptive information in order to identify schools that may need a review of their organizational and instructional practices. With such identification, the appropriate authority would begin an intensive investigation to determine whether the poor performance was reflected in other measures, possibly including subjective measures.
One way of thinking about the trigger approach is that it effectively institutes a system of multiple measures in stages, incorporating additional measures of school performance only when the test score measures indicate a likelihood that there is a problem. The approach trades off greater reliability and validity of a system of multiple measures applied to all schools for a more detailed inspection carried out for those schools identified as possibly in trouble. In addition, the approach combines the step of obtaining additional information with the opportunity to provide initial recommendations for improvement, if they seem to be warranted.
Variations of this approach are already being used in some places (see Archer, 2006; McDonnell, 2008). For example, in Britain, teams of inspectors visit schools periodically to judge the quality of their leadership and ability to make improvements. The inspectors draw on test scores, school self-evaluations, and input from parents, teachers, and students and then issue a report on various aspects of the school’s performance.