Uses and Consequences of Value-Added Models
This chapter provides an overview of how value-added models are currently being used for research, school and teacher improvement, program evaluation, and school and teacher accountability. These purposes can overlap to some extent, and often an evaluation system will be used for more than one purpose. The use of these models for educational purposes is growing fast. For example, the Teacher Incentive Fund program of the U.S. Department of Education, created in 2006, has distributed funds to over 30 jurisdictions to experiment with alternate compensation systems for teachers and principals—particularly systems that reward educators (at least in part) for increases in student achievement as measured by state tests.1 Some districts, such as the Dallas Independent School District (Texas), Guilford County Schools (North Carolina), and Memphis City Schools (Tennessee) are using value-added models to evaluate teacher performance (Center for Educator Compensation Reform, no date; Isenberg, 2008).
If the use of value-added modeling becomes widespread, what are the likely consequences? These models, particularly when used in a high-stakes accountability setting, may create strong incentives for teachers and administrators to change their behavior. The avowed intention is for educators to respond by working harder or by incorporating different teaching strategies to improve student achievement. However, perverse incentives may also be
created, resulting in unintended negative consequences. On one hand, for example, since a value-added system compares the performance of teachers relative to one another, it could reduce teacher cooperation within schools, depending on how the incentives are structured. On the other hand, if school-level value-added is rewarded, it can create a “free rider” problem whereby some shirkers benefit from the good work of their colleagues, without putting forth more effort themselves. Because the implementation of value-added models in education has so far been limited, there is not much evidence about their consequences. At the workshop, some clues as to how educators might respond were provided by the case of a program instituted in New York that used an adjusted status model to monitor the effectiveness of heart surgeons in the state’s hospitals. We provide below some examples of how value-added models have recently been used in education for various purposes.
SOME RECENT USES
Value-added models can be useful for conducting exploratory research on educational interventions because they aim to identify the contributions of certain programs, teachers, or schools when a true experimental design is not feasible.
Workshop presenter John Easton has been studying school reform in Chicago for about 20 years. He and his colleagues used surveys of educators to identify essential supports for school success (inclusive leadership, parents’ community ties, professional capacity, student-centered learning climate, and ambitious instruction). The team then used a value-added analysis to provide empirical evidence that these fundamentals were indeed strongly associated with school effectiveness. As a result of this research, the Chicago Public School system has adopted these essential supports as its “five fundamentals for school success” (Easton, 2008).
Value-added models have also been used by researchers to gauge the relationship of various teacher qualifications (such as licensure, certification, years of experience, advanced degrees) to student progress. Workshop discussant Helen Ladd described her research, which applied a value-added model to data from North Carolina to explore the relationship between teacher credentials and students’ performance on end-of-course exams at the high school level (Clotfelter, Ladd, and Vigdor, 2007). The researchers found that teacher credentials are positively correlated with student achievement. One problem Ladd’s studies identified is that teachers with weaker credentials were concentrated in higher poverty schools, and the apparent effects of having low-credentialed teachers in
high school was great, particularly for African American students: “We conclude that if the teachers assigned to black students had the same credentials on average as those assigned to white students, the achievement difference between black and white students would be reduced by about one third” (Clotfelter, Ladd, and Vigdor, 2007, p. 38).
Easton argued that more research studies are needed using value-added models, as an essential first step in exploring their possible uses for accountability or other high-stakes purposes. “The more widely circulated research using value-added metrics as outcomes there is, the more understanding there will be about [how] they can be used most successfully and what their limits are” (Easton, 2008, p. 9).
School or Teacher Improvement
Value-added models are intended to help identify schools or teachers as more effective or less effective, as well as the areas in which they are differentially effective. Ideally, that can lead to further investigation and, ultimately, the adoption of improved instructional strategies. Value-added results might be used by teachers for self-improvement or target setting. At the school level, they might be used along with other measures to help identify the subjects, grades, and groups of students for which the school is adding most value and where improvement is needed. Value-added analyses of the relationships between school inputs and school performance could suggest which strategies are most productive, leading to ongoing policy adjustments and reallocation of resources. The models might also be used to create projections of school performance that can assist in planning, resource allocation, and decision making. In these ways, value-added results could be used by teachers and schools as an early warning signal.
Perhaps the best-known value-added model used for teacher evaluation and improvement is the Education Value Added Assessment System (EVAAS), which has been used in Tennessee since 1993. “The primary purpose … is to provide information about how effective a school, system, or teacher has been in leading students to achieve normal academic gain over a three year period.” (Sanders and Horn, 1998, p. 250). The system was created by William Sanders and his colleagues, and this model (or variations) have been tried in a number of different school districts. EVAAS-derived reports on teacher effectiveness are made available to teachers and administrators but are not made public. State legislation requires that EVAAS results are to be part of the evaluation of those teachers for whom such data are available (those who teach courses tested by the statewide assessment program). How large a role the estimates of effectiveness are to play in teacher evaluation is left up to the district,
although EVAAS reports cannot be the sole source of information in a teacher’s evaluation. They are used to create individualized professional development plans for teachers, and subsequent EVAAS reports can be used to judge the extent to which improved teacher performance has resulted from these plans (Sanders and Horn, 1998).
When used for program evaluation, value-added models can provide information about which types of local or national school programs or policy initiatives are adding the most value and which are not, in terms of student achievement. These might include initiatives as diverse as a new curriculum, decreased class size, and approaches to teacher certification.
The Teach For America (TFA) Program recruits graduates of four-year colleges and universities to teach in public schools (K-12) in high-poverty districts. It receives funding from both private sources and the federal government. In recent years, the program has placed between 2,000 and 4,000 teachers annually. Recruits agree to teach for two years at pay comparable to that of other newly hired teachers. After an intensive summer-long training session, they are placed in the classroom, with mentoring and evaluation provided throughout the year. The program has been criticized because many believe that this alternate route to teaching is associated with lower quality teaching. There is also the concern that, because the majority of participants leave their positions upon completing their two-year commitment, students in participating districts are being taught by less experienced (and therefore less effective) teachers. Xu, Hannaway, and Taylor (2007) used an adjusted status model (similar to a value-added model but does not use prior test scores) to investigate these criticisms. Using data on secondary school students and teachers from North Carolina,2 the researchers found that TFA teachers were more effective in raising exam scores than other teachers, even those with more experience: “TFA teachers are more effective than the teachers who would otherwise be in the classroom in their stead” (p. 23). This finding may be dependent on the poor quality of the experienced teachers in the types of high-poverty urban districts served by the program.
School or Teacher Accountability
In an accountability context, consequences are attached to value-added results in order to provide incentives to teachers and school administrators to improve student performance. They might be used for such decisions as whether the students in a school are making appropriate progress for the school to avoid sanctions or receive rewards, or whether a teacher should get a salary increase. School accountability systems that use value-added models would provide this information to the public—taxpayers might be informed as to whether tax money is being used efficiently, and users might be able to choose schools on a more informed basis. At this time, many policy makers are seriously considering using value-added results for accountability, and there is much discussion about these possible uses. But the design of a model might differ depending on whether the goal is to create incentives to improve the performance of certain students, to weed out weak teachers, or to inform parents about the most effective schools for their children.
In August 2008, Ohio began implementing a program that incorporates a value-added model. The program chosen by the state is based on the EVAAS model William Sanders developed for Tennessee. Ohio’s accountability system employs multiple measures, whereby schools are assigned ratings on the basis of a set of indicators. Until recently, the measures were (1) the percentage of students reaching the proficient level on state tests, as well as graduation and attendance rates; (2) whether the school made adequate yearly progress under No Child Left Behind; (3) a performance index that combines state tests results; and (4) a measure of improvement in the performance index. Ohio replaced the last component with a value-added indicator. Instead of simply comparing a student’s gain with the average gain, the model develops a customized prediction of each student’s progress on the basis of his or her own academic record, as well as that of other students over multiple years, with statewide test performance serving as an anchor. So the value-added gain is the difference between a student’s score in a given subject and the score predicted by the model. The school-level indicator is based on the averages of the value-added gains of its students. Consequently, Ohio will now be rating schools using estimated value-added as one component among others. The model will be used only at the school level, not the teacher level, and only at the elementary and middle grades. Because tests are given only once in high school, in tenth grade, growth in student test scores cannot be determined directly (Public Impact, 2008).
There are examples of using value-added modeling to determine teacher performance pay at the district level. The national Teacher Advancement Program (TAP) is a merit pay program for teachers that uses a value-added model of student test score growth as a factor in deter-
mining teacher pay. About 6,000 teachers in 50 school districts nationwide participate in this program, which was established by the Milken Family Foundation in 1999. Participating districts essentially create an alternate pay and training system for teachers, based on multiple career paths, ongoing professional development, accountability for student performance, and performance pay. TAP uses a value-added model to determine contributions to student achievement gains at both the classroom and school levels. Teachers are awarded bonuses based on their scores in a weighted performance evaluation that measures mastery of effective classroom practices (50 percent), student achievement gains for their classrooms (30 percent), and school-wide achievement gains (20 percent) (http://www.talentedteachers.org/index.taf).
It should be noted that a number of other states have had performance pay programs for teachers, including Alaska, Arizona, Florida,3 and Minnesota, where growth in test scores is a factor, usually a rather small one, in determining teacher pay. However, these systems are based on growth models, not value-added models. Unlike value-added models, the growth models used do not control for background factors, other than students’ achievement in the previous year.
Low Stakes Versus High Stakes
A frequent theme throughout the workshop was that when test-based indicators are used to make important decisions, especially ones that affect individual teachers, administrators, or students, the results must be held to higher standards of reliability and validity than when the stakes are lower. However, drawing the line between high and low stakes is not always straightforward. As Henry Braun noted, what is “high stakes for somebody may be low stakes for someone else.” For example, simply reporting school test results through the media or sharing teacher-level results among staff—even in the absence of more concrete rewards or sanctions—can be experienced as high stakes for some schools or teachers. Furthermore, in a particular evaluation, stakes are often different for various stakeholders, such as students, teachers, and principals.
Participants generally referred to exploratory research as a low-stakes use and school or teacher accountability as a high-stakes uses. Using value-added results for school or teacher improvement, or program evaluation, fell somewhere in between, depending on the particular circum-
stances. For example, as Derek Briggs pointed out, using a value-added model for program evaluation could be high stakes if the studies were part of the What Works Clearinghouse, sponsored by the U.S. Department of Education.
In any case, it is important for designers of an evaluation system to first set out the standards for the properties they desire of the evaluation model and then ask if value-added approaches satisfy them. For example, if one wants transparency to enable personnel actions to be fully defensible, a very complex value-added model may well fail to meet the requirement. If one wants all schools in a state to be assessed using the same tests and with adjustments for background factors, value-added approaches do meet the requirement.
POSSIBLE INCENTIVES AND CONSEQUENCES
To date, there is little relevant research in education on the incentives created by value-added evaluation systems and the effects on school culture, teacher practice, and student outcomes. The workshop therefore addressed the issue of the possible consequences of using value-added models for high-stakes purposes by looking at high-quality studies about their use in other contexts. Ashish Jha presented a paper on the use of an adjusted status model (see footnote 4, Chapter 1) in New York State for the purpose of improving health care. The Cardiac Surgery Reporting System (CSRS) was introduced in 1990 to monitor the performance of surgeons performing coronary bypass surgeries. The New York Department of Health began to publicly report the performance of both hospitals and individual surgeons. Assessment of the performance of about 31 hospitals and 100 surgeons, as measured by risk-adjusted mortality rates, was freely available to New York citizens. In this application, the statistical model adjusted for patient risk, in a manner similar to the way models in education adjust for student characteristics. The model tried to address the question: How successful was the treatment by a certain doctor or hospital, given the severity of a patient’s symptoms? The risk-adjustment model drew on the patients’ clinical data (adequacy of heart function prior to surgery, condition of the kidneys, other factors associated with recovery, etc.).
In 1989, prior to the introduction of CSRS, the risk-adjusted in-hospital mortality rate for patients undergoing heart surgery was 4.2 percent; eight years after the introduction of CSRS, this rate was cut in half to 2.1 percent, the lowest in the nation. Empirical evaluations of CSRS, as well as anecdotal evidence, indicate that a number of surgeons with high adjusted mortality rates stopped practicing in New York after public reporting began. Poor-performing surgeons were four times more likely
to stop practicing in New York within two years of the release of a negative report. (However, many simply moved to neighboring states.) Several of the hospitals with the worst mortality rates revamped their cardiac surgery programs. This was precisely what was hoped for by the state and, from this point of view, the CSRS program was a success.
However, there were reports of unintended consequences of this intervention. Some studies indicated that surgeons were less likely to operate on sicker patients, although others contradicted this claim. There was also some evidence that documentation of patients’ previous conditions changed in such a way as to make them appear sicker, thereby reducing a provider’s risk-adjusted mortality rate. Finally, one study conducted by Jha and colleagues (2008) found that the introduction of CSRS had a significant deleterious effect on access to surgery for African American patients. The proportion of African American patients dropped, presumably because surgeons perceived them as high risk and therefore were less willing to perform surgery on them. It took almost a decade before the racial composition of patients reverted to pre-CSRS proportions.
This health care example illustrates that, if value-added models are to be used in an education accountability context, with the intention of changing the behavior of teachers and administrators, one can expect both intended and unintended consequences. The adjustment process should be clearly explained, and an incentive structure should be put into place that minimizes perverse incentives. Discussant Helen Ladd emphasized transparency: “Teachers need to understand what goes into the outcome measures, what they can do to change the outcome, and to have confidence that the measure is consistently and fairly calculated…. The system is likely to be most effective if teachers believe the measure treats them fairly in the sense of holding them accountable for things that are under their control.”
Workshop participants noted a few ways that test-based accountability systems have had unintended consequences in the education context. For example, Ladd (2008) gave the example of South Carolina, which experimented in the 1980s with a growth model (not a value-added model). It was hoped that the growth model would be more appropriate and useful than the status model that had been used previously. The status model was regarded as faulty because the results largely reflected socioeconomic status (SES). It was found, however, that the growth model results still favored schools serving more advantaged students, which were then more likely to be eligible for rewards than schools serving low-income students and minority students. State and school officials were concerned. In response, they created a school classification system based mainly on the average SES of the students in the schools. Schools were then compared only with other schools in the same category, with rewards equitably dis-
tributed across categories. This was widely regarded as fair. However, one result was that schools at the boundaries had an incentive to try to get into a lower SES classification in order to increase their chances of receiving a reward.
Sean Reardon pointed out a similar situation based on the use of a value-added model in San Diego (Koedel and Betts, 2009). Test scores from fourth grade students (along with their matched test scores from third and second grade) indicated that teachers were showing the greatest gains among low-performing students. Possible explanations were that the best teachers were concentrated in the classes with students with the lowest initial skills (which was unlikely), or that there was a ceiling effect or some other consequence of test scaling, such that low-performing students were able to show much greater gains than higher-performing students. It was difficult to determine the exact cause, but had the model been implemented for teacher pay or accountability purposes, the teachers would have had an incentive to move to those schools serving students with low SES, where they could achieve the greatest score gains. Reardon observed, “That could be a good thing. If I think I am a really good teacher with this population of students, then the league [tables] make me want to move to a school where I teach that population of students, so that I rank relatively high in that league.” The disadvantage of using indicators based on students’ status is that one can no longer reasonably compare the effectiveness of a teacher who teaches low-skilled students with that of a teacher who teaches high-skilled students or compare schools with very different populations.
Adam Gamoran suggested that the jury has not reached a verdict on whether a performance-based incentive system that was intended to motivate teachers to improve would be better than the current system, which rewards teachers on the basis of experience and professional qualifications. However, he noted that the current system also has problematic incentives: it provides incentives for all teachers, regardless of their effectiveness, to stay in teaching, because the longer they stay, the more their salary increases. After several years of teaching, teachers reach the point at which there are huge benefits for persisting and substantial costs to leaving.
An alternative is a system that rewards more effective teachers and encourages less effective ones to leave. A value-added model that evaluates teachers has the potential to become part of such a system. At the moment, such a system is problematic, in part because of the imprecision of value-added teacher estimates. Gamoran speculated that a pay-for-performance system for teachers based on current value-added models would probably result in short-term improvement for staying, because teachers would work harder for a bonus. He judged that the long-term effects are less clear, however, due to the imprecision of the models under some conditions.
Given this imprecision, a teacher’s bonus might be largely a matter of luck rather than a matter of doing something better. “Teachers will figure that out pretty quickly. The system will lose its incentive power. Why bother to try hard? Why bother to seek out new strategies? Just trust to luck to get the bonus one year if not another.” These potential problems might be reduced by combining a teacher’s results across several (e.g., three) years, thus improving the precision of teachers’ value-added estimates.
Several workshop participants made the point that, even without strong, tangible rewards or sanctions for teachers or administrators, an accountability system will still induce incentives. Ben Jensen commented that when value-added scores are made publicly available, they create both career and prestige incentives: “If I am a school principal, particularly at a school serving a poor community, [and] I have a high value-added score, I am going to put that on my CV and therefore, there is a real incentive effect.” Brian Stecher also noted that for school principals in Dallas, which has a performance pay system, it is not always necessary to give a principal a monetary reward to change his or her behavior. There is the effect of competition: if a principal saw other principals receiving rewards and he or she did not get one, that tended to be enough to change behavior. The incentives created a dramatic shift in internal norms and cultures in the workplace and achieved the desired result.
NOT FOR ALL POLICY PURPOSES
Value-added models are not necessarily the best choice for all policy purposes; indeed, no single evaluation model is. For example, there is concern that adjusting for students’ family characteristics and school contextual variables might reinforce existing disadvantages in schools with a high proportion of students with lower SES, by effectively setting lower expectations for those students. Another issue is that value-added results are usually normative: Schools or teachers are characterized as performing either above or below average compared with other units in the analysis, such as teachers in the same school, district, or perhaps state. In other words, estimates of value-added have meaning only in comparison to average estimated effectiveness. This is different from current systems of state accountability that are criterion-referenced, in which performance is described in relation to a standard set by the state (such as the proficient level). Dan McCaffrey explained that if the policy goal is for all students to reach a certain acceptable level of achievement, then it may not be appropriate to reward schools that are adding great value but
still are not making enough progress.4 From the perspective of students and their families, school value-added measures might be important, but they may also want to know the extent to which schools and students have met state standards.
Value-added models clearly have many potential uses in education. At the workshop, there was little concern about using them for exploratory research or to identify teachers who might benefit most from professional development. In fact, one participant argued that these types of low-stakes uses were needed to increase understanding about the strengths and limitations of different value-added approaches and to set the stage for their possible use for higher stakes purposes in the future.
There was a great deal of concern expressed, however, about using these models alone for high-stakes decisions—such as whether a school is in need of improvement or whether a teacher deserves a bonus, tenure, or promotion—given the current state of knowledge about the accuracy of value-added estimates. Most participants acknowledged that they would be uncomfortable basing almost any high-stakes decision on a single measure or indicator, such as a status determination. The rationales for participants’ concerns are explained in the next two chapters.