Analysts use a variety of methods to estimate value-added effects. All value-added models (VAMs) adjust for students’ starting level of achievement using prior test scores, but they do so in different ways. Some also adjust for student characteristics and school context variables. The outcome of applying any model is that some schools, teachers, or programs are identified as being significantly better or worse than average. The models differ in the number of years of data they use, the kinds of assumptions they make, how they handle missing data, and so on. Not surprisingly, findings may differ depending on the model chosen and how it is specified.
This chapter begins with a review of some major challenges with these analytic methods—including nonrandom assignment of teachers and students, bias, precision, stability, data quality, and the balance between complexity and transparency—and causal interpretations. That is followed by a brief overview of two broad approaches to value-added modeling and the strengths and limitations of each. It concludes with a discussion about areas in which further research is most needed, as well as a summary of the main messages that emerged from the workshop regarding analytic approaches.
ANALYTIC CHALLENGES FOR VALUE-ADDED MODELING
Nonrandom Assignment of Teachers and Students
A primary goal of value-added modeling is to make causal inferences by identifying the component of a student’s test score trajectory that can be credibly associated with a particular teacher, school, or program. In other words, the purpose is to determine how students’ achievement differs, having been in their assigned school, teacher’s classroom, or program, from what would have been observed had they been taught in another school, by another teacher, or in the absence of the program. This is often referred to as the estimation of counterfactual quantities—for example, the expected outcomes for students taught by teacher A had they been taught by teacher B and vice versa.
The ideal research design for obtaining evidence about effectiveness is one in which students are randomly assigned to schools, teachers, or programs. With random assignment and sufficiently large samples, differences in achievement among schools, teachers, or programs can be directly estimated and inferences drawn regarding their relative effectiveness. However, in the real world of education, random assignment is rarely possible or even desirable. There are many ways that classroom assignments depart from randomness, and some are quite purposeful (e.g., matching individual students’ to teachers’ instructional styles).1 Different schools and teachers often serve very different student populations, and programs are typically targeted at particular groups of students, so straightforward comparisons may be neither fair nor useful.
As workshop presenter Dale Ballou explained, to get around the problem of nonrandom assignment, value-added models adjust for preexisting differences among students using their starting levels of achievement. Sometimes a gain score model is used, so the outcome measure is students’ growth from their own starting point a year prior; sometimes prior achievement is included as a predictor or control variable in a regression or analysis of covariance; and some models use a more extensive history of student test scores as control variables, as in William Sanders’s work.
Many researchers believe that controlling for students’ prior achievement is not enough—that more needs to be done to statistically adjust for differences between the groups of students assigned to different schools, teachers, or programs. That is, the question is whether the test score history incorporated into the model is sufficient to account for differences among students on observed—and unobserved (e.g., systematic differences in
student motivation or parent support at home)—characteristics that are statistically associated with academic achievement. Ballou explained that nontest-score characteristics can be associated with students’ rates of gain in achievement, but relatively few of them are typically measured and available in education data sets. Some variables associated with achievement are generally available, such as students’ socioeconomic status, gender, or race. Other contextual factors are more difficult to quantify, such as home environment and peer influences, as well as various school characteristics.
Another problem is that educational inputs are generally conflated, so a classroom of students might receive inputs from the school administration, the teacher, other teachers in the school, the community, and other students in the classroom, many of which are related and overlap to some extent. For example, although a value-added model may purport to be estimating the effect of an individual teacher, adjusting for differences in student backgrounds and prior achievement, this estimate may also be confounded with (i.e., “picking up”) unmeasured contextual variables, such as the contributions of the school’s leadership, the quality of a teacher’s colleagues, and other factors. The contributions of these factors, positive or negative, may end up being attributed to the teacher.
Dan McCaffrey noted that most statistical models that have been used in practice have tended not to include student- or context-level predictor variables, such as race or socioeconomic status measures. One argument for excluding such covariates is that including them might imply different expectations for students of different sociodemographic classes. Another concern is that if a certain racial group is exposed to poorer teachers, the model could inappropriately attribute lower performance to race rather than to teacher quality.2 However, there are also technical challenges to including such variables in the model. Ballou, Sanders, and Wright (2004) investigated the effects of including these types of student-level covariates in the models that avoided the technical problems; the researchers found that their inclusion had no appreciable effect on estimates of classroom effects. However, attempts to expand the methods to include classroom-level variables resulted in unstable estimates (Ballou, 2005).
Bias refers to the inaccuracy of an estimate that is due to a shortcoming or incompleteness in a statistical model itself. For example, imagine a value-added model focused on isolating the effectiveness of schools using school-wide test results. Suppose that the fourth grade
test is a gateway test. In schools with advantaged children, large numbers of parents enroll their children in private test preparation sessions in advance of the exam, while parents of children in other schools do not. Students in the first group would tend to perform better on the test than would be predicted on the basis of the third grade test. Even if all schools provided instruction of equal quality, value-added estimates would indicate that the schools serving the first group were more effective, even though they were not responsible for the higher performance of their students. In this case, the estimates would be biased because the contributions of the private test preparation sessions are confounded with true school effectiveness. One way to address this bias would be to augment the model in such a way as to include outside test preparation as a variable (Organisation for Economic Co-operation and Development, 2008). Addition of more student background and context variables to a value-added model can reduce bias but can also lead to more complications, such as missing data.
The prior example illustrated the problem of underadjustment in the model. There is also the potential for the reverse problem of overadjustment. To continue the previous example, suppose that the fifth grade test is not a gateway test, and therefore parents in schools with advantaged children do not use tutoring. Now, children in these schools do less well on the fifth grade test than predicted based on their (test preparation inflated) fourth grade scores. Similarly, if the children in the advantaged schools do well on both the third and fourth grade tests, in part because such schools are able to hire better teachers, then, depending on the approach used, the model may attribute too much of the high fourth grade scores to the “quality of the students” reflected in the third grade scores and too little to the quality of the fourth grade teachers.
Finally, Ballou and a few others raised the issue that current value-added models assume that there is a single teacher effect that is common for all students. Yet one can readily imagine that one teacher might work very effectively with struggling students but not really be able to stimulate students already performing at high levels, and the opposite might be true of another teacher. Value-added models usually attempt to summarize a teacher’s effectiveness in a single number. If teacher quality is multidimensional in this sense, then frequently it will not be possible to say that one teacher is better than another because of the scaling issues discussed in Chapter 3. The importance of this problem depends on the goal of the model. If the objective is to rank all teachers, the problem is likely to be very serious. If the goal is to create incentives to teach struggling students well, the problem may be less serious.
The precision of the estimated effects is also an important issue. The precision problem differs from the bias problem in that it stems, in large part, from small sample sizes. Small sample sizes are more of a challenge for value-added models that seek to measure teacher effects rather than school effects. This is because estimates of school effects tend to be derived from test score data of hundreds of students, whereas estimates of teacher effects are often derived from data for just a few classes. (Elementary teachers may teach just one class of students each year, whereas middle and high school teachers may have more than 100 students in a given year.) If the number of students per teachers is low, just a few poorly performing students can lower the estimate of a teacher’s effectiveness substantially. Research on the precision of value-added estimates consistently finds large sampling errors. As McCaffrey reported, based on his prior research (McCaffrey et al., 2005), standard errors are often so large that about two-thirds of estimated teacher effects are not statistically significantly different from the average.
A related problem is the stability of estimates. All value-added models produce estimates of school or teacher effects that vary from year to year. This raises the question of the degree to which this instability reflects real variation in performance from year to year, rather than error in the estimates. McCaffrey discussed research findings (Aaronson, Barrows, and Sanders, 2007; Ballou, 2005) demonstrating that only about 30 to 35 percent of teachers ranked in either the top or bottom quintile in one year remain there in the next year. If estimates were completely random, 20 percent would remain in the same quintile from one year to the next. If the definition of a weak teacher is one in the bottom quintile, then this suggests that a significant proportion of teachers identified as weak in a single year would be falsely identified. In another study, McCaffrey, Sass, and Lockwood (2008) investigated the stability of teacher effect estimates from one year and cohort of students to the next (e.g., the estimated teacher effect estimates in 2000-2001 compared to those in 2001-2002) for elementary and middle school teachers in four counties in Florida. They computed 12 correlations (4 counties by 3 pairs of years) for elementary school teachers and 16 correlations (4 counties by 4 pairs of years) for middle school teachers. For elementary school teachers, the 12 correlations between estimates in consecutive years ranged from .09 to .34 with a median of .25. For middle school teachers, the 16 correlations ranged from .05 to .35 with a median of .205. Thus, the year-to-year stability of
estimated teacher effects can be characterized as being quite low from one year to the next.
Instability in value-added estimates is not only a result of sampling error due to the small numbers of students in classes. McCaffrey and his colleagues (2008) found that the year-to-year variability in teacher effects exceeded what might be expected from simple sampling error. This year-to-year variability generally accounted for a much larger share of the variation in effects for elementary school teachers than for middle school teachers (perhaps because middle school teachers usually tend to teach many more students in a single year than elementary teachers). Further, year-to-year variability was only weakly related to teachers’ qualifications, such as their credentials, tenure status, and annual levels of professional development. Whether this variability reflects real changes in teachers’ performance or a source of error at the classroom level (such as peer effects that are usually omitted from the model) remains unknown.
Instability will tend to erode confidence in value-added results on the part of educators because most researchers and education practitioners will expect that true school, teacher, or even program performance will change only gradually over time rather than display large swings from year to year. Moreover, if estimates are unstable, they will not be as credible for motivating or justifying changes in future behavior or programs. One possible solution would be to consider several years’ of data when making important decisions, such as teacher tenure.
Missing or faulty data can have a negative impact on the precision and stability of value-added estimates and can contribute to bias. The procedures used to transform the raw test data into usable data files, as well as the completeness of the data, should be carefully evaluated when deciding whether to use a value-added model. Student records for two or more years are needed, and it is not uncommon in longitudinal data files for some scores to be missing because of imperfect record matching, student absences, and students transferring into or out of a school.
A key issue for implementing value-added methods is the capacity to link students to their teachers. As Helen Ladd noted, many state data systems do not currently provide direct information on which students are taught by which teachers. Ladd stated, “Until recently, for example, those of us using the North Carolina data have had to make inferences about a student’s teacher from the identity of the proctor of the relevant test and a wealth of other information from school activity reports. In my own work, I have been able to match between 60-80 percent of students to their teachers at the elementary and high school levels but far lower
percentages at the middle school level” (Ladd, 2008, p. 9). She went on to say that even if states start providing more complete data of this type, a number of issues still complicate the situation—for example, how to deal with students who are pulled out of their regular classes for part of the day, team-taught courses, and students who transfer into or out of a class in the middle of the year. Attributing learning to a certain school or teacher is difficult in systems in which there is high student mobility. Moreover, if the reason that the data are missing is related to test score outcomes, the resulting value-added estimates can be seriously biased.
Generally, the greater the proportion of missing data, the weaker the credibility of the value-added results. Of course, missing data are a problem for any type of test score analysis, but some models depend on student- or context-level characteristics, which may be especially incomplete. The integrity and completeness of such data need to be evaluated before implementing a value-added system. When value-added models are used for research purposes or program evaluation, the standard for what constitutes sufficient data may be somewhat lower than when the purpose is for school or teacher improvement or for accountability. Ladd emphasized this point, noting that if these models are to be used as part of a teacher evaluation system, capturing only 60-80 percent of the student data probably will not be sufficient; it may not be possible to include all teachers in the analysis.
Finally, there is the problem that very large numbers of teachers would not have test score data for computing value-added scores. Many subjects and grades are not currently assessed using large-scale tests, so most K-2 and high school teachers, as well as teachers of such subjects as social studies, foreign languages, physical education, and arts are not directly linked to state-level student test scores. This presents a major obstacle to implementing a value-added evaluation system of teachers at a district level. (This problem applies to using status test score data for teacher evaluation as well.)
Complexity Versus Transparency
Value-added models range from relatively simple regression models to extremely sophisticated models that require rich databases and state-of-the-art computational procedures. McCaffrey and Lockwood (2008) suggest that “complex methods are likely to be necessary for accurate estimation of teacher effects and that accountability or compensation systems based on performance measures with weak statistical properties will fail to provide educators with useful information to guide their practice and could eventually erode their confidence in such systems” (p. 10). However, there is always a limit, beyond which adding complexity to
the analysis results in little or no advantages. When used for purposes such as accountability, the choice of models needs to balance the goals of complexity and accuracy, on one hand, and transparency, on the other. At the same time, it is likely that the importance attached to transparency will depend on other features of the accountability system of which the value-added model is but one component, as well as the political context in which the accountability system is operating.
Transparency refers to the ability of educators and the public to understand how the estimates were generated and what they mean. A major goal of improvement and accountability systems is to provide educators with signals about what is considered effective performance and whether they have achieved it, as well as to motivate lower performing individuals to change their behavior to improve their effectiveness. There is general agreement that highly complex statistical procedures are difficult for educators to understand, which leads to a concern that the use of such procedures might limit the practical utility of value-added models. Workshop participant Robert Gordon raised the issue of whether many of the models are simply “too esoteric to be useful to teachers in the real world.” This is an important consideration when these models are used for accountability because a key aspect of their success is acceptance by teachers and administrators. In contrast, when the models are used for research or program evaluation, transparency may not be important.
Transparency also may not be an overriding concern for public uses, such as for accountability. Henry Braun recounted a discussion with policy makers who judged that transparency was important but not crucial. These policy makers indicated that they did not need to know the details of what went into the “black box” to produce value-added results. If the results were trustworthy and the rationale could be explained in an understandable way, they believed that school systems would be willing to forgo transparency for the sake of accuracy. For example, most current tests are scored using item response theory, which is also very complex. However, test users generally accept the reported test scores, even though they do not fully understand the mathematical intricacies through which they are derived (i.e., the process for producing raw scores, scale scores, and equating the results to maintain year-to-year comparability). Analysis raw scores are converted to scale scores and then further adjusted through an equating process to maintain year-to-year comparability.
A key consideration in the trade-off between complexity and transparency is the resources required to implement the more complex models. Complex models require greater technical expertise on the part of staff. It is critical that the staff conducting sophisticated analyses have the expertise to run them correctly and interpret the results appropriately. Complex models also usually require more comprehensive data. Data availability
and data quality, as described in the previous section, place limits on the complexity of the models that can be considered. Thus, a number of issues have to be weighed to achieve the optimal balance between complexity, accuracy, and transparency when choosing a value-added model.
Although not always stated explicitly, the goal of value-added modeling is to make causal inferences. In practical terms, this means drawing conclusions, such as that certain teachers caused the higher (or lower) achievement in their students.
The two disciplines that focus on value-added modeling take different approaches to this problem. The statistics discipline generally handles it by characterizing its models as descriptive, not causal; however, it does recognize that using such models to evaluate schools, teachers, or programs implicitly treats the results as causal effects. Lockwood and McCaffrey (2007) identify conditions under which the estimates derived from statistical models approximate causal effects. The economics discipline generally makes certain assumptions that, if met, support causal interpretations of value-added results obtained from the models it favors. The critical assumption is that any differences among classes, schools, or programs that are not captured by the predictor variables used in the model are captured by the student fixed-effect components. In the end, despite their status as empirical descriptions, the results of the statistical models are used in ways similar to the econometric models—that is, to support causal interpretations.
Rothstein (2009) tested the assumptions of the economics models in the context of estimating teacher effects in North Carolina. His idea was to see if estimated teacher effects can predict the achievement gains of their students in the years prior to these students being in their classes. For example, does a fifth grade teacher effect predict her students’ achievement gains when those students were third and fourth graders? Indeed, he found that, for example, fifth grade teachers were nearly as strongly linked statistically to their students’ fourth grade scores as were the students’ fourth grade teachers. Rothstein also found that the relationship between current teachers and prior gains differs by time span: that is, the strength of the statistical association of the fifth grade teacher with fourth grade gains differs from that with third grade gains.
Since teachers cannot rewrite the past, the finding that teachers’ effects predict their students’ prior performance implies there is selection of students into teachers’ classrooms that is related to student prior achievement growth and other dynamic factors, not simply to time-invariant characteristics of the students. The implication is that, in such settings, the
central assumption of the econometric model does not hold and value-added estimates are likely to be biased. The size of the bias and the prevalence of the conditions leading to the violations are unknown. Although Rothstein’s study was intended to test the specification of the econometric models, it has important implications for the interpretation of estimates from statistical models as well, because dynamic classroom assignment would also violate the assumptions that Lockwood and McCaffrey (2007) establish for allowing causal interpretation of statistical model estimates. Analysts in both paradigms have been taken aback by Rothstein’s (2009) results. Some researchers are currently conducting studies to see whether they will replicate Rothstein’s findings; if Rothstein’s findings are confirmed, then both camps may need to adapt their modeling approaches to address the problematic aspects of their current assumptions (McCaffrey and Lockwood, 2008).
TWO MAIN ANALYTIC APPROACHES
A full explication of value-added analytic methods is too complex to include in this report. Nontechnical readers may want to skip the relatively brief explanation of the two main analytic approaches that follows, because it assumes some statistical background and is not essential for understanding the rest of the report. Readers who are interested in more technical information are referred to the workshop transcript and background papers (available at http://www7.nationalacademies.org/bota/VAM_Workshop_Agenda.html), as well as Graham, Singer, and Willett (in press); Harris and Sass (2005); McCaffrey and Lockwood (2008); McCaffrey et al. (2003); Organisation for Economic Co-operation and Development (2008); and Willett and Singer (in preparation).
Simplifying somewhat, there are two general choices to be made in the design and estimation of value-added models. (To make matters concrete, we focus this discussion on obtaining value-added scores for teachers.) The first choice concerns how to adjust for differences among students taught by different teachers. The second choice concerns the estimation methodology.
One approach to adjusting for student differences is to incorporate into the model a parameter for each student (i.e., student fixed effects). The student fixed effects include, for a given student, all the unobservable characteristics of the student and family (including community context) that contribute to achievement and are stable across time (McCaffrey and Lockwood, 2008). Advocates of using student fixed effects argue that measured student covariates are unlikely to remove all the relevant differences among students of different teachers. For example, in a comparison of students with the same prior test scores, a student in the more advan-
taged school is likely to differ from a student in a less advantaged school on a number of other characteristics related to academic achievement. If they are both performing at the national 50th percentile, the student at the less advantaged school may exhibit more drive to overcome disadvantages. Using student fixed effects captures all unchanging (time-invariant) student characteristics and thus eliminates selection bias stemming from the student characteristics not included in the model, provided that the model is otherwise properly specified.
But elimination of this bias may come at a significant cost. Because it requires estimation of a coefficient for each student, it will generally make estimation of the other coefficients less reliable (have higher variance). Thus, there is a trade-off between bias and variance that may favor one choice or the other. In addition, when fixed effects are used, it is impossible to compare groups of teachers whose students do not commingle at some point. For example, if students at school A always start and end their school careers there, as do students at school B, by using fixed effects, one can never tell whether students do better at school A because they are more advantaged or because school A has better teachers. Even when the students do overlap, the estimates rely heavily on the outcomes for students changing schools, generally a small fraction of the total student population. This, too, reduces the reliability of estimates using fixed student effects. Because the students who change schools are not likely to be representative of the student population, biased estimates can result.3 Which approach produces lower mean-squared error depends on the specifics of the problem.
A similar set of issues arises when deciding whether to estimate teacher value-added as the coefficient on a teacher fixed effect or through the formulation of a random-effects model. Employing random-effects estimates can introduce bias because it may attribute to the student some characteristics that are common to teachers in the school. If advantaged children tend to have better teachers, with random effects one will attribute some of the benefit of having better teachers to being advantaged and will predict higher test scores for these children than they would actually achieve with average teachers. This, in turn, will make their teachers appear to have provided less value-added. In contrast, incorporating teacher fixed effects would eliminate this source of bias.4
Advocates of using random effects for teachers respond that this seeming advantage of the fixed-effects approach depends on the model being otherwise specified correctly; that is, all the other variables contributing to student outcomes are properly represented in the model. If the model is seriously misspecified, then fixed-effects estimates may well be more biased than random-effects estimates. Moreover, the fixed-effects estimates tend to be quite volatile, especially when the number of students linked to a teacher is small. In general, random-effects estimates will have lower variance but higher bias than fixed-effects estimates.5 Either could have lower mean-squared error. The smaller number of parameters estimated in the random-effects model also makes it easier to include more complexity. Thus, the appropriateness of a model will always depend in some measure on the particular context of use and, for this reason, there was little optimism that a particular approach to estimating value-added would be always preferred.
A final decision is whether to “shrink” the estimates. To some extent, this decision reflects whether one comes, like most econometricians, from a “frequentist” statistical tradition or, like most modern statisticians, a “Bayesian” statistical tradition. If one thinks that nothing is known about the distribution of teacher effects (the frequentist approach), then the estimate derived from the model (usually the fixed effect) is the best estimate of the teacher effect. However, if one thinks something is known about this distribution (the Bayesian approach), then a very large positive or negative (usually random effect) estimate of the teacher effect is unlikely and is probably the result of random errors. Therefore, the estimates should be shrunk toward the mean. The two approaches can be reconciled by using the estimated distribution of teacher effects to infer the actual distribution of teacher effects. This approach, known as “empirical Bayes,” is quite complex. If all teacher effects are estimated with the same precision, then shrinking does not change the ranking of teachers, only their score. If there is more information on some teachers, then those on whom there is less information will have less precisely estimated teacher effects, and these estimated effects will be shrunk more. Such teachers will rarely be found in the extreme tails of the distribution of value-added estimates.
Key Research Areas
Workshop participants identified a number of areas in which more research on value-added models is needed in order for researchers, policy makers, and the public to have more confidence in their results. Some key research questions that were discussed at the workshop include
How might the econometric and statistical models incorporate features from the other paradigm that are missing in their own approaches?
What are the effects of violations of model assumptions on the accuracy of value-added estimates? For example, what are the effects on accuracy of not meeting assumptions about the assignment of students to classrooms, the characteristics of the missing data, as well as needed sample sizes?
How do the models perform in simulation studies? One way of evaluating a model is to generate simulated data that have the same characteristics as operational data, but with known parameters, and test whether the model can accurately capture the relationships that were built into the simulated data.
How could the precision of value-added estimates be improved? Instability declines when multiple years of data are combined, but some research shows that there is true variability in teacher performance across years, suggesting that simply pooling data across years might introduce bias and not allow for true deviation in performance.
What are the implications of Rothstein’s results about causality/bias, for both the economics and the statistical approaches?
How might value-added estimates of effectiveness be validated? One approach would be to link estimates of school, teacher, or program effects derived from the models with other measures of effectiveness to examine the extent that the various measures concur. Some past studies have looked at whether value-added modeling can distinguish certified and noncertified teachers, in an effort to validate the National Board for Professional Teaching Standards certification. In other words, value-added estimates are treated as the criterion. Another approach would be to turn that on its head and ask: How well do the value-added estimates agree with other approaches to evaluating the relative effectiveness of teachers?
How do policy makers, educators, and the public use value-added information? What is the appropriate balance between the complex methods necessary for accurate measures and the need for measures to be transparent?
Henry Braun summed up the analytic discussion by stating: “To nobody’s surprise, there is not one dominant VAM.” Each major class of models has shortcomings, there is no consensus on the best approaches, and little work has been done on synthesizing the best aspects of each approach. There are questions about the accuracy and stability of value-added estimates of schools, teachers, or program effects. More needs to be learned about how these properties differ, using different value-added techniques and under different conditions. Most of the workshop participants argued that steps need to be taken to improve accuracy if the estimates are to be used as a primary indicator for high-stakes decisions; rather, value-added estimates should best be used in combination with other indicators. But most thought that the degree of precision and stability does seem sufficient to justify low-stakes uses of value-added results for research, evaluation, or improvement when there are no serious consequences for individual teachers, administrators, or students.