Considerations for Policy Makers
The purpose of the workshop was to scan the research on value-added methods to identify their strengths and limitations, which can inform policy makers’ decisions about whether and how to proceed with the implementation of these methods in different settings. This chapter summarizes the key policy-relevant messages that emerged from the workshop.
Many participants emphasized that value-added models have the potential to provide useful information for educational decision making, beyond that provided by the test-based indicators that are widely used today. These models are unique in that they are intended to provide credible measures of the contributions of specific teachers, schools, or programs to student test performance. At the same time, participants recognized that there are still many technical and practical issues that need to be resolved in order for researchers to feel confident in supporting certain policy uses of value-added results.
Workshop participants expressed a range of views about the most critical challenges, and the intensity of their concerns varied. Robert Linn expressed concern about overselling by proponents of value-added modeling: “Some think it’s the can opener that can open any can…. More modest claims are ones that I would endorse.” Adam Gamoran and Robert Gordon, among others, focused on what they saw as the advantages of these models over indicators based on student status. Gordon observed that although many important technical issues still need to be resolved, it is not realistic to think that policy makers will wait 20 years until all of the
difficulties are worked out before making use of such methods. Decisions about schools and teachers are being made, and, as Jane Hannaway noted, there is enormous demand from the policy side, to which the testing and research communities need to respond as quickly as possible. Some of the technical problems may never be resolved, as is the case with current status models, but many participants asserted that value-added methods can still be used, albeit with caution.
At the same time, throughout the workshop, participants raised a number of questions that they thought were important for policy makers to be asking if they are considering using value-added indicators for evaluation and other purposes.
A RANGE OF VIEWS
Compared to What?
Kevin Lang suggested that, when deciding whether to use value-added methods, one question for decision makers to ask is “Compared to what?” If these models are intended to replace other indicators, will they provide information that is more useful, accurate, or fair than what is currently available? If they are being considered as an additional indicator (in conjunction with others), will the incremental gain in information be substantively meaningful?
Dale Ballou reminded the group that every method for evaluating effectiveness with respect to student achievement (e.g., status, growth, value-added) has risks and rewards. So the question “Compared to what?” is also important to ask about the risk-reward trade-off associated with different test-based evaluation strategies. Many of the concerns about value-added models—including concerns about the models themselves (e.g., transparency and robustness to violations of assumptions), concerns about the test data that feed into the models (e.g., reliability, validity, scaling), and concerns about statistical characteristics of the results (e.g., precision, bias)—also apply to some extent to the assessment models that are currently used by the states. Value-added models do raise some unique issues, which were addressed at the workshop.
Regardless of which evaluation method is chosen, risk is unavoidable. That is, in the context of school accountability, whether decision makers choose to stay with what they do now or to do something different, they are going to incur risks of two kinds: (1) identifying some schools as failing (i.e., truly ineffective) that really are not and (2) neglecting to identify some schools that really are failing. One question is whether value-added models used in place of, or in addition to, other methods will help reduce those risks.
A good strategy when considering the use of a value-added approach is to try not to judge the benefits and drawbacks of various value-added models in isolation. Rather, it would be more appropriate to take a systemic view and think about how a value-added indicator would fit into a particular evaluation system, given the political context, the values of different stakeholders, the types of tests and other data available, and the indicators that will be constructed, as well as the sanctions or rewards to be attached to schools in the different classifications determined by those indicators. Of course, the availability of adequate funding and appropriate expertise also needs to be taken into account. As Sean Reardon and others suggested, this is a design problem: If the overarching challenge is to improve the quality of education for all children, what are the most powerful strategic levers that policy makers can use, given the current situation, and what can be done in the context of measurement, to make the most progress in a cost-effective way? Given the time and expense necessary to carry out a value-added evaluation, is the resulting information more useful for the purpose of educational improvement than the information and indicators currently used—or that provided by other, nonquantitative means?
Dan McCaffrey has conducted some research suggesting that recent implementations of value-added models to improve schooling outcomes have fallen short of expectations. As an example, he cited a pilot program in Pennsylvania in which the information derived from the model was not found by administrators to be very useful—or, at best, of limited additional value compared with the information provided by existing indicators. (However, he noted that these findings were obtained very early in the implementation process.) John Easton described a similar phenomenon in Chicago: essentially the “lack of internal capacity to use [information] profitably,” but he nonetheless believes that value-added models can be used for research and evaluation and eventually to identify good school-wide and classroom-based teaching practices.
Is There a Best Value-Added Method?
There are many different types of value-added models, and, to date, no single dominant method. No value-added approach (or any test-based indicator, for that matter) addresses all the challenges to identifying effective or ineffective schools or teachers. As explained in Chapter 4, most workshop participants thought that fixed-effects models generally worked well to minimize the bias that results from selection on fixed (time-invariant) student characteristics, whereas models employing student characteristics and teacher random effects worked well to minimize variance. More needs to be learned about how important properties, such as mean-squared error and stability, vary across different value-added approaches
applied in various contexts, as well as the implications of these choices for accountability system design.
Until now, econometricians have favored fixed-effects approaches, and statisticians have used random-effects or mixed-effects approaches.1 One message from the workshop is that disciplinary traditions should not dictate model choices. Neither approach is best in all situations; one ought to ask which model makes the most sense for the particular research or policy problem being faced, given the data available, and so on.
What Is Needed to Implement a Value-Added Model?
Workshop participants talked about the different capacities that a statewide or district system would need to have in order to properly implement, and to derive meaningful benefits from, a value-added analysis, for example:
a longitudinal database that tracks individual students over time and accurately links them to their teachers, or at least to schools (if the system will be used only for school and not for teacher accountability);
confidence that missing data are missing for legitimate reasons (such as student mobility), not because of problems with the data collection system;2 and
expert staff to run or monitor the value-added analyses, either in-house or through a contractor.
To maximize the utility of the value-added analysis, some workshop presenters suggested that the system would also need to have
a vertically coherent set of standards, curriculum and pedagogical strategies that are linked to the standards, and a sequence of tests that it is well aligned to that set of standards (with respect to both content coverage and cognitive complexity);
Statisticians use the term “mixed-effects” to denote regression models that incorporate as predictors a set of student characteristics (whose corresponding regression coefficients are treated as fixed) and a set of coefficients representing schools or teachers (and are thought of as being randomly drawn from some distribution). It is unfortunate that the two disciplines sometimes have different meanings for the same term, thereby adding confusion to discussions involving adherents of both traditions.
As discussed in Chapter 4, data missing because of student mobility can introduce bias and increase variability in value-added estimates.
a reporting system that effectively presents results and provides sufficient support so that users are likely to make appropriate inferences from the analysis;
an ongoing training program for teachers and administrators, so that they can understand and use the results constructively; and
a mechanism to monitor and evaluate the model’s effects on teachers and students, so the program can be adapted if unintended consequences arise.
It is important to bear in mind that the above are necessary conditions for the optimal use of a value-added analysis. Even if all these capacities were in place, however, technical problems noted in this report, such as those related to bias and precision, would need to be examined prior to implementation for high-stakes purposes.
How High Are the Stakes?
A recurring message throughout the workshop was that value-added models could be useful for low-stakes purposes that do not have serious consequences for individual teachers or schools (such as to help make decisions about professional development needs), but that persistent concerns about precision and bias militate against employing value-added indicators as the principal basis for high-stakes decisions.
One complication is determining exactly what constitutes low versus high stakes. What are low stakes for one person might be high stakes for another. For example, a state official might consider simply reporting school test results to the media, without any sanctions attached to the results, to be low stakes; but a teacher or a principal may feel that such public reporting amounts to high stakes, because it affects her professional reputation and negative results can cause her embarrassment. When there is uncertainty about how different stakeholders will perceive the stakes associated with the results of a value-added system, decision makers should err on the side of assuming that the stakes are high and take the necessary precautions.
The consequential validity of an indicator system refers to the appropriateness of actions or uses derived from the test score inferences. Judgments regarding consequential validity can rest on technical analyses, as well as on the examination of both short-term and long-term outcomes. Of course, there may be disagreement among observers about whether the consequences are on the whole positive or negative.
Gordon argued that evidence of the validity of the value-added estimates should be commensurate with the stakes attached. Himself a lawyer, Gordon made a legal analogy.
In the law, we make rules based on our judgments on the relative importance of different interests. We don’t like to put innocent people in jail, so we tip the scales against wrongful convictions. That’s why we apply the beyond reasonable doubt standard in criminal cases. It’s why criminal courts typically refuse to admit polygraph tests. And as a result, we let guilty people get out of jail. It’s not a bug, it’s a feature. When the stakes do not involve the stigma and loss of liberty from a criminal conviction, we act differently. In civil cases, we just want to get the right answer, so we apply a preponderance of the evidence standard. That’s 50 percent plus one, and actually courts are more likely … to admit polygraphs, because the goal is just to make the right judgment, even if it’s just barely.
In other words, the greater the consequences, the greater the burden on the evidence that is brought to bear. This suggests that, for high-stakes purposes, there needs to be solid evidence of the reliability and validity of value-added results—evidence that, in the view of many workshop participants, is currently not to be had.3 As discussed below, this view prompted the idea that value-added models be used in combination with other accepted indicators of teacher or school performance when making high-stakes decisions.
Is This a Fair Way to Evaluate Teachers?
Of the various uses to which value-added models could be put, workshop participants expressed a number of concerns regarding their use for high-stakes decisions affecting individual teachers, such as promotions or pay. The first problem is that value-added estimates for teachers are usually based on small numbers of students. As discussed in Chapter 3, measurement error tends to be greater when aggregate test scores are based on a smaller number of students’ test scores than when based on a larger number. Because longitudinal student data are needed, missing data can further reduce the sample size. Many teachers simply do not teach a large enough sample of students to be credibly evaluated by a value-added model. Furthermore, as Lorrie Shepard noted, if high-stakes decisions are to be made about individual teachers, one would need to provide safe-
guards, such as data on multiple cohorts of students to determine whether the teacher was, for example, low one year or low three years in a row. Such an approach imposes substantial data requirements.4
Second, while any value-added model is almost certainly biased in favor of some groups of teachers and against others, it is usually difficult to determine which ones are which. With status models, which are widely recognized to favor teachers of higher achieving and more advantaged pupils, people frequently make ad hoc adjustments in their interpretation to reflect the known direction of bias. In contrast, with value-added results, not enough is generally known to ascertain the appropriate direction of the correction. In part, this is due to the different sources of confounding that can result in biased estimates.
That said, value-added methods could be useful for lower stakes purposes, such as identifying (apparently) high-performing or low-performing teachers to inform teacher improvement strategies. Brian Stecher suggested that a value-added analysis could provide a preliminary, quantitative indicator to identify certain teachers who might employ pedagogical strategies or exhibit certain behaviors to be emulated, as well as teachers who might need to change their strategies or behaviors. However, statistical analysis alone cannot reveal the specific changes to be made—that requires both direct observation and expertise in pedagogy and professional development. One should certainly be open to the possibility that the evidence gathered in this manner may lead to evaluations that are at odds with those derived from statistical analysis.
How Might Value-Added Modeling Fit into a System of Multiple Measures?
Many workshop presenters favored using value-added models in combination with other measures, particularly when high stakes are attached to the results. As Henry Braun stated, “Even if we are attracted to value-added, and certainly value-added has many advantages over status systems, we are not ready to give up on status.”5 An ideal approach would be to find ways of combining value-added, status, and other types of indicators about teacher, school, or program effectiveness. Doug Willms suggested that additional measures include information about school
contextual variables, school process variables, school discipline, and so on. At a minimum, such measures would provide information that could assist in the interpretation of differences among schools or teachers in their value-added results. Mark Wilson pointed out that, in addition to value-added and status models, there is a third alternative: growth (or change) models that do not include value-added adjustments.
There are several ways to combine and report on multiple measures. A school profile report comprises a variety of indicators, usually displayed side by side. Presenting value-added as just one of many indicators could reduce the chance of readers placing too much emphasis on it—or on any other indicator, for that matter. Of course, different observers would focus on different indicators, but the more comprehensive picture would be available and educators would feel that, at least in principle, stakeholders could consider the full range of school outcomes.
In many states, a single index is required and so a rule for combining the indicators must be developed. A simple rule involves standardizing the indicator values and then calculating a weighted average. In a more complex rule, the value of each indicator must exceed a predetermined threshold for a school to avoid sanctions or to be awarded a commendation. For example, the state of Ohio has developed a school rating system that incorporates four measures: (1) graduation and attendance rates, (2) adequate yearly progress under No Child Left Behind, (3) a performance index that combines all test results on a single scale, and (4) a value-added estimate.6 Schools are placed into one of five categories depending on the values of these indicators in relation to the thresholds.7
Scott Marion and Lorrie Shepard described Damian Betebenner’s work on the reporting system for Colorado as a good illustration of how status and value-added models might be combined, although this system includes a growth model, not a value-added one. The Colorado Growth Model offers a way for educators to understand how much growth a student made from one year to the next in comparison to his or her academic peers. In many ways, it is similar to a value-added model.8 The Colorado Growth Model compares each student’s performance with students in the same grade throughout the state who had the same sequence of test scores in previous years. The model then produces a (conditional) growth percentile for each student, much like children’s height and weight growth charts. A student who grew more than 60 percent of his or her academic
For grades 9-12, a growth measure rather than a value-added estimate will be employed.
For more details, see http://www.achieve.org/files/World_Class_Edu_Ohio_FINAL.pdf (pp. 58-59).
Colorado’s growth model conditions on all possible prior scores, uses a regression-based estimation procedure, and produces results based on a measure of central tendency.
peers would have a (conditional) growth percentile of 60. Using this model it is possible to determine, in terms of growth percentiles, how much progress a student needs to make to reach proficiency within one, two, or three years.9
In addition to calculating and reporting growth results for each student, school, and district, the Colorado Department of Education produces school and district reports depicting both growth and status (percentage proficient and above) results in what has been termed a “four quadrant” report. This report is basically a 2 × 2 figure with growth depicted on the x-axis and divided into those schools (or districts) producing above-average and below-average amounts of student growth. Similarly, the y-axis represents status, in this case in terms of percentage proficient, and divided (arbitrarily) between higher and lower than average status results. This report allows stakeholders to easily see that schools in the lower left quadrant, for example, have lower than average percentages of students achieving proficiency and whose students are exhibiting lower than average growth. Such schools might well be considered the highest priority schools for intervention. (To view sample reports, go to http://www.cde.state.co.us/cdeedserv/GrowthCharts-2008.htm.)
Stecher mentioned another approach, which is sometimes referred to as the “triage” strategy. Status and value-added indicators could be used to trigger visits by a team of “inspectors” to conduct a closer evaluation of a school that may be in need of improvement. The point is that test scores cannot tell why students’ achievement levels and test score trajectories are problematic; trained inspectors might uncover extenuating circumstances or identify specific educational practices that might help. Properly implemented, such a strategy could lead to improvements in school effectiveness. A variant of this approach, involving both quantitative and qualitative measures, is currently being used both in England and in New York City schools.
Hannaway pointed out that there are many levels of decision making in education, and different types of information might be most useful—or politically attractive—at different levels. For example, the federal government might have an accountability system focused primarily on status measures in reading and mathematics. But that does not preclude states and school districts from employing additional indicators that they bring to bear on the allocation of resources or decisions about school viability (as in Ohio, as described above).
For a sample report see Colorado Department of Education, http://www.cde.state.co.us/FedPrograms/AYP/download/index_coaypgrowpro.pdf.
How Important Is Transparency?
As discussed in Chapter 4, designers of an evaluation system must consider the trade-off between complexity and transparency. Judith Singer observed that the term transparency was used during the workshop to refer to several different but related ideas: one meaning relates to fairness—people desire a system that is equitable and cannot be “gamed” and that rewards the teachers and schools that truly deserve it. The second meaning relates to methodology—that is, methodologists could “inspect” the model and the estimation machinery in order to evaluate them in relation to professional standards. The third meaning relates to availability of information—that is, providing the public with understandable information about how the methods work. All three seem to be important.
Gordon raised the issue of whether many of the models are simply “too esoteric to be useful to teachers in the real world.” Another workshop attendee described his experience implementing value-added models in Chicago and New York. He found that teachers generally understood issues like sample size and random distribution, so the complexity of the models may not necessarily be an overwhelming issue. He felt that teachers would come down on the side of fairness over transparency. That is because they may see status models as unfair, particularly if they have a number of special education students in one year, or because they easily understand that they can have “good” or “bad” batches of students in a given year. “They will go with the more complex models, because the transparent ones they see through easily as being unfair.” It is probably most important for stakeholders to understand the logic of using value-added modeling rather than the actual estimation methods.
Another key consideration related to complexity is the resources required to implement the more complex models. Complex models require greater technical expertise on the part of staff. It is critical that the staff conducting sophisticated analyses have the expertise to run them correctly and interpret the results appropriately. These analyses will typically be contracted out, but in-house staff still need to have the expertise to understand and monitor the contractor. Complex value-added models also usually require more comprehensive data, the availability and quality of which places limits on the complexity of the models that can be considered.
How Will the Consequences of Using Value-Added Models Be Monitored?
Several participants emphasized the importance of monitoring the consequences of using a value-added model to determine its utility in helping states and districts achieve their education goals. When these models are
used for real-world purposes, they have consequences, intended and unintended. In the health care example described in Chapter 2, Ashish Jha described how an adjusted status model caused many apparently low-performing surgeons to quit the profession as intended, but also caused some surgeons to turn away high-risk patients, which was unintended. The education field needs to find ways to monitor the impact of value-added models not only on student achievement but also on school policies, instructional practices, teacher morale and mobility, and so on. For example, monitoring the impact on instruction may involve collecting longitudinal data about teacher behaviors, curriculum, and allocation of instructional time across subject areas. Ideally, data collection would begin before implementation of the new system and extend for some years afterward, a substantial undertaking. It is important to allow for flexibility and adaptation over time, as knowledge is accumulated about how the accountability system impacts the larger education system. As Henry Braun commented, “We are dealing with very, very complex systems; there’s no reason to believe that we will get it right the first time.”
Several workshop participants remarked that people should not hold value-added models to higher standards than other measures that are already being widely used for accountability and other high-stakes purposes. All test-based indicators have limitations, and measurement experts have long advised that no single indicator should be the sole basis for high-stakes decisions (National Research Council, 1999). There are well-known problems with the kinds of status models that are now used for accountability. In particular, teachers and schools with more advantaged students will tend to rank higher on status measures than will equally skilled teachers and schools with less advantaged students. It is natural, then, for policy makers and the education community to seek alternatives.
This report conveys some of the advantages that can accrue with the use of value-added models. However, it also presents information about why value-added results are not completely trustworthy. For example, the estimates produced by value-added models are biased (as explained in Chapter 4), and it is difficult to assess the direction and magnitude of the bias. When a school with many disadvantaged students performs at the state average with respect to a status model, most observers would be inclined to judge the school’s performance as laudable, although they probably would not do so if the student population were drawn from a very advantaged community. And if the above-average performance were the outcome of a value-added analysis, one would be unsure whether
the school’s performance was truly above average or whether it reflected bias in the model. To some extent, concerns about the bias in value-added estimates can be addressed by continuing research. However, since randomized experiments are rare in education—and those that are conducted take place in special circumstances so that generalizability can be a problem—it will be hard to ever be fully confident that the application of a particular statistical model in a specific setting produces essentially unbiased value-added estimates. Moreover, precision is an ongoing problem and, as Linn pointed out, there is a great deal of measurement error in the test results fed into these models, which in turn induces substantial uncertainty in the resulting estimates.
Any evaluation method leads to implicit causal interpretations. When a school does not make adequate yearly progress under the status model of No Child Left Behind, most people infer that this is an indication of the school’s lack of effectiveness. With a little reflection, however, many will come to understand that a school serving a disadvantaged community faces a greater challenge in making adequate yearly progress than does a school serving an advantaged community. That is, many people understand that there are limits to what can be inferred from status results. Because value-added models involve sophisticated statistical machinery and the results explicitly attribute components of achievement gains to certain schools or teachers, people are more likely to accept the causal interpretations. The sources of imprecision and bias are less transparent but still present. Linn talked about the “scientific aura” around these models and the danger that it may lead people to place more faith in the results than is warranted.
Although none of the workshop participants argued against the possible utility of value-added modeling, there was a range of perspectives about its appropriate uses at this time. The most conservative perspective expressed at the workshop was that the models have more problems than current status measures and are appropriate only for low-stakes purposes, such as research. Others felt that the models would provide additional relevant information about school, teacher, or program effectiveness and could be employed in combination with other indicators. For example, many suggested that they could be useful in conjunction with status models to identify high and low performers. Still others argued that while the models have flaws, they represent an improvement compared with current practices—namely, status models for determining school performance under No Child Left Behind or credential-based promotion and rewards for teachers.
In sum, most of the workshop participants were quite positive about the potential utility of value-added models for low-stakes purposes, but much more cautious about their use for high-stakes decisions. Most agreed
that value-added indicators might be tried out in high-stakes contexts, as long as the value-added information is one of multiple indicators used for decision making and the program is pilot-tested first, implemented with sufficient communication and training, includes well-developed evaluation plans, and provides an option to discontinue the program if it appears to be doing a disservice to educators or students.