Analytic Issues

Analysts use a variety of methods to estimate value-added effects. All value-added models (VAMs) adjust for students’ starting level of achievement using prior test scores, but they do so in different ways. Some also adjust for student characteristics and school context variables. The outcome of applying any model is that some schools, teachers, or programs are identified as being significantly better or worse than average. The models differ in the number of years of data they use, the kinds of assumptions they make, how they handle missing data, and so on. Not surprisingly, findings may differ depending on the model chosen and how it is specified.

This chapter begins with a review of some major challenges with these analytic methods—including nonrandom assignment of teachers and students, bias, precision, stability, data quality, and the balance between complexity and transparency—and causal interpretations. That is followed by a brief overview of two broad approaches to value-added modeling and the strengths and limitations of each. It concludes with a discussion about areas in which further research is most needed, as well as a summary of the main messages that emerged from the workshop regarding analytic approaches.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 41

4
Analytic Issues
Analysts use a variety of methods to estimate value-added effects.
All value-added models (VAMs) adjust for students’ starting level of
achievement using prior test scores, but they do so in different ways.
Some also adjust for student characteristics and school context variables.
The outcome of applying any model is that some schools, teachers, or
programs are identified as being significantly better or worse than aver-
age. The models differ in the number of years of data they use, the kinds
of assumptions they make, how they handle missing data, and so on. Not
surprisingly, findings may differ depending on the model chosen and
how it is specified.
This chapter begins with a review of some major challenges with
these analytic methods—including nonrandom assignment of teach -
ers and students, bias, precision, stability, data quality, and the balance
between complexity and transparency—and causal interpretations. That
is followed by a brief overview of two broad approaches to value-added
modeling and the strengths and limitations of each. It concludes with a
discussion about areas in which further research is most needed, as well
as a summary of the main messages that emerged from the workshop
regarding analytic approaches.

OCR for page 41

GETTING VALUE OUT OF VALUE-ADDED
ANALyTIC CHALLENgES FOR VALuE-ADDED MODELINg
Nonrandom Assignment of Teachers and Students
A primary goal of value-added modeling is to make causal inferences
by identifying the component of a student’s test score trajectory that can
be credibly associated with a particular teacher, school, or program. In
other words, the purpose is to determine how students’ achievement
differs, having been in their assigned school, teacher’s classroom, or pro -
gram, from what would have been observed had they been taught in
another school, by another teacher, or in the absence of the program.
This is often referred to as the estimation of counterfactual quantities—for
example, the expected outcomes for students taught by teacher A had
they been taught by teacher B and vice versa.
The ideal research design for obtaining evidence about effectiveness
is one in which students are randomly assigned to schools, teachers, or
programs. With random assignment and sufficiently large samples, dif -
ferences in achievement among schools, teachers, or programs can be
directly estimated and inferences drawn regarding their relative effec-
tiveness. However, in the real world of education, random assignment
is rarely possible or even desirable. There are many ways that classroom
assignments depart from randomness, and some are quite purposeful
(e.g., matching individual students’ to teachers’ instructional styles).1 Dif-
ferent schools and teachers often serve very different student populations,
and programs are typically targeted at particular groups of students, so
straightforward comparisons may be neither fair nor useful.
As workshop presenter Dale Ballou explained, to get around the
problem of nonrandom assignment, value-added models adjust for preex-
isting differences among students using their starting levels of achievement.
Sometimes a gain score model is used, so the outcome measure is students’
growth from their own starting point a year prior; sometimes prior achieve-
ment is included as a predictor or control variable in a regression or analysis
of covariance; and some models use a more extensive history of student test
scores as control variables, as in William Sanders’s work.
Many researchers believe that controlling for students’ prior achieve-
ment is not enough—that more needs to be done to statistically adjust for
differences between the groups of students assigned to different schools,
teachers, or programs. That is, the question is whether the test score history
incorporated into the model is sufficient to account for differences among
students on observed—and unobserved (e.g., systematic differences in
1 Random assignment provides information about the relative effectiveness of the teacher
with a randomly assigned set of students. There are many reasons that this might not reveal
the parameter of policy interest in a world in which students are not randomly assigned.

OCR for page 41

ANALYTIC ISSUES
student motivation or parent support at home)—characteristics that are
statistically associated with academic achievement. Ballou explained that
nontest-score characteristics can be associated with students’ rates of gain
in achievement, but relatively few of them are typically measured and avail-
able in education data sets. Some variables associated with achievement
are generally available, such as students’ socioeconomic status, gender, or
race. Other contextual factors are more difficult to quantify, such as home
environment and peer influences, as well as various school characteristics.
Another problem is that educational inputs are generally conflated,
so a classroom of students might receive inputs from the school admin -
istration, the teacher, other teachers in the school, the community, and
other students in the classroom, many of which are related and overlap
to some extent. For example, although a value-added model may purport
to be estimating the effect of an individual teacher, adjusting for differ-
ences in student backgrounds and prior achievement, this estimate may
also be confounded with (i.e., “picking up”) unmeasured contextual vari -
ables, such as the contributions of the school’s leadership, the quality of a
teacher’s colleagues, and other factors. The contributions of these factors,
positive or negative, may end up being attributed to the teacher.
Dan McCaffrey noted that most statistical models that have been used
in practice have tended not to include student- or context-level predictor
variables, such as race or socioeconomic status measures. One argument
for excluding such covariates is that including them might imply different
expectations for students of different sociodemographic classes. Another
concern is that if a certain racial group is exposed to poorer teachers, the
model could inappropriately attribute lower performance to race rather
than to teacher quality.2 However, there are also technical challenges to
including such variables in the model. Ballou, Sanders, and Wright (2004)
investigated the effects of including these types of student-level covariates
in the models that avoided the technical problems; the researchers found
that their inclusion had no appreciable effect on estimates of classroom
effects. However, attempts to expand the methods to include classroom-
level variables resulted in unstable estimates (Ballou, 2005).
bias
Bias refers to the inaccuracy of an estimate that is due to a short -
coming or incompleteness in a statistical model itself. For example,
imagine a value-added model focused on isolating the effectiveness of
schools using school-wide test results. Suppose that the fourth grade
2 This is more of a potential problem with random-effects than fixed-effects models; see
page 50 for an explanation of these models.

OCR for page 41

GETTING VALUE OUT OF VALUE-ADDED
test is a gateway test. In schools with advantaged children, large num -
bers of parents enroll their children in private test preparation sessions
in advance of the exam, while parents of children in other schools do
not. Students in the first group would tend to perform better on the test
than would be predicted on the basis of the third grade test. Even if all
schools provided instruction of equal quality, value-added estimates
would indicate that the schools serving the first group were more effec -
tive, even though they were not responsible for the higher performance
of their students. In this case, the estimates would be biased because the
contributions of the private test preparation sessions are confounded
with true school effectiveness. One way to address this bias would be to
augment the model in such a way as to include outside test preparation
as a variable (Organisation for Economic Co-operation and Develop -
ment, 2008). Addition of more student background and context variables
to a value-added model can reduce bias but can also lead to more com -
plications, such as missing data.
The prior example illustrated the problem of underadjustment in the
model. There is also the potential for the reverse problem of oeradjust-
ment. To continue the previous example, suppose that the fifth grade test
is not a gateway test, and therefore parents in schools with advantaged
children do not use tutoring. Now, children in these schools do less well
on the fifth grade test than predicted based on their (test preparation
inflated) fourth grade scores. Similarly, if the children in the advantaged
schools do well on both the third and fourth grade tests, in part because
such schools are able to hire better teachers, then, depending on the
approach used, the model may attribute too much of the high fourth
grade scores to the “quality of the students” reflected in the third grade
scores and too little to the quality of the fourth grade teachers.
Finally, Ballou and a few others raised the issue that current value-
added models assume that there is a single teacher effect that is common
for all students. Yet one can readily imagine that one teacher might work
very effectively with struggling students but not really be able to stimu -
late students already performing at high levels, and the opposite might
be true of another teacher. Value-added models usually attempt to sum-
marize a teacher’s effectiveness in a single number. If teacher quality is
multidimensional in this sense, then frequently it will not be possible to
say that one teacher is better than another because of the scaling issues
discussed in Chapter 3. The importance of this problem depends on the
goal of the model. If the objective is to rank all teachers, the problem is
likely to be very serious. If the goal is to create incentives to teach strug-
gling students well, the problem may be less serious.

OCR for page 41

ANALYTIC ISSUES
Precision
The precision of the estimated effects is also an important issue. The
precision problem differs from the bias problem in that it stems, in large
part, from small sample sizes. Small sample sizes are more of a chal-
lenge for value-added models that seek to measure teacher effects rather
than school effects. This is because estimates of school effects tend to be
derived from test score data of hundreds of students, whereas estimates
of teacher effects are often derived from data for just a few classes. (Ele -
mentary teachers may teach just one class of students each year, whereas
middle and high school teachers may have more than 100 students in a
given year.) If the number of students per teachers is low, just a few poorly
performing students can lower the estimate of a teacher’s effectiveness
substantially. Research on the precision of value-added estimates consis-
tently finds large sampling errors. As McCaffrey reported, based on his
prior research (McCaffrey et al., 2005), standard errors are often so large
that about two-thirds of estimated teacher effects are not statistically sig-
nificantly different from the average.
Stability
A related problem is the stability of estimates. All value-added mod-
els produce estimates of school or teacher effects that vary from year to
year. This raises the question of the degree to which this instability reflects
real variation in performance from year to year, rather than error in the
estimates. McCaffrey discussed research findings (Aaronson, Barrows,
and Sanders, 2007; Ballou, 2005) demonstrating that only about 30 to
35 percent of teachers ranked in either the top or bottom quintile in one
year remain there in the next year. If estimates were completely random,
20 percent would remain in the same quintile from one year to the next.
If the definition of a weak teacher is one in the bottom quintile, then this
suggests that a significant proportion of teachers identified as weak in a
single year would be falsely identified. In another study, McCaffrey, Sass,
and Lockwood (2008) investigated the stability of teacher effect estimates
from one year and cohort of students to the next (e.g., the estimated
teacher effect estimates in 2000-2001 compared to those in 2001-2002) for
elementary and middle school teachers in four counties in Florida. They
computed 12 correlations (4 counties by 3 pairs of years) for elementary
school teachers and 16 correlations (4 counties by 4 pairs of years) for
middle school teachers. For elementary school teachers, the 12 correla -
tions between estimates in consecutive years ranged from .09 to .34 with
a median of .25. For middle school teachers, the 16 correlations ranged
from .05 to .35 with a median of .205. Thus, the year-to-year stability of

OCR for page 41

GETTING VALUE OUT OF VALUE-ADDED
estimated teacher effects can be characterized as being quite low from one
year to the next.
Instability in value-added estimates is not only a result of sampling
error due to the small numbers of students in classes. McCaffrey and his
colleagues (2008) found that the year-to-year variability in teacher effects
exceeded what might be expected from simple sampling error. This year-
to-year variability generally accounted for a much larger share of the
variation in effects for elementary school teachers than for middle school
teachers (perhaps because middle school teachers usually tend to teach
many more students in a single year than elementary teachers). Further,
year-to-year variability was only weakly related to teachers’ qualifica -
tions, such as their credentials, tenure status, and annual levels of pro -
fessional development. Whether this variability reflects real changes in
teachers’ performance or a source of error at the classroom level (such as
peer effects that are usually omitted from the model) remains unknown.
Instability will tend to erode confidence in value-added results on the
part of educators because most researchers and education practitioners
will expect that true school, teacher, or even program performance will
change only gradually over time rather than display large swings from
year to year. Moreover, if estimates are unstable, they will not be as cred-
ible for motivating or justifying changes in future behavior or programs.
One possible solution would be to consider several years’ of data when
making important decisions, such as teacher tenure.
Data Quality
Missing or faulty data can have a negative impact on the precision
and stability of value-added estimates and can contribute to bias. The
procedures used to transform the raw test data into usable data files, as
well as the completeness of the data, should be carefully evaluated when
deciding whether to use a value-added model. Student records for two
or more years are needed, and it is not uncommon in longitudinal data
files for some scores to be missing because of imperfect record matching,
student absences, and students transferring into or out of a school.
A key issue for implementing value-added methods is the capacity
to link students to their teachers. As Helen Ladd noted, many state data
systems do not currently provide direct information on which students
are taught by which teachers. Ladd stated, “Until recently, for example,
those of us using the North Carolina data have had to make inferences
about a student’s teacher from the identity of the proctor of the relevant
test and a wealth of other information from school activity reports. In my
own work, I have been able to match between 60-80 percent of students
to their teachers at the elementary and high school levels but far lower

OCR for page 41

ANALYTIC ISSUES
percentages at the middle school level” (Ladd, 2008, p. 9). She went on
to say that even if states start providing more complete data of this type,
a number of issues still complicate the situation—for example, how to
deal with students who are pulled out of their regular classes for part of
the day, team-taught courses, and students who transfer into or out of a
class in the middle of the year. Attributing learning to a certain school
or teacher is difficult in systems in which there is high student mobility.
Moreover, if the reason that the data are missing is related to test score
outcomes, the resulting value-added estimates can be seriously biased.
Generally, the greater the proportion of missing data, the weaker
the credibility of the value-added results. Of course, missing data are a
problem for any type of test score analysis, but some models depend on
student- or context-level characteristics, which may be especially incom -
plete. The integrity and completeness of such data need to be evaluated
before implementing a value-added system. When value-added models
are used for research purposes or program evaluation, the standard for
what constitutes sufficient data may be somewhat lower than when the
purpose is for school or teacher improvement or for accountability. Ladd
emphasized this point, noting that if these models are to be used as part
of a teacher evaluation system, capturing only 60-80 percent of the student
data probably will not be sufficient; it may not be possible to include all
teachers in the analysis.
Finally, there is the problem that very large numbers of teachers
would not have test score data for computing value-added scores. Many
subjects and grades are not currently assessed using large-scale tests, so
most K-2 and high school teachers, as well as teachers of such subjects
as social studies, foreign languages, physical education, and arts are not
directly linked to state-level student test scores. This presents a major
obstacle to implementing a value-added evaluation system of teachers at
a district level. (This problem applies to using status test score data for
teacher evaluation as well.)
Complexity Versus Transparency
Value-added models range from relatively simple regression models
to extremely sophisticated models that require rich databases and state-
of-the-art computational procedures. McCaffrey and Lockwood (2008)
suggest that “complex methods are likely to be necessary for accurate
estimation of teacher effects and that accountability or compensation
systems based on performance measures with weak statistical properties
will fail to provide educators with useful information to guide their prac -
tice and could eventually erode their confidence in such systems” (p. 10).
However, there is always a limit, beyond which adding complexity to

OCR for page 41

GETTING VALUE OUT OF VALUE-ADDED
the analysis results in little or no advantages. When used for purposes
such as accountability, the choice of models needs to balance the goals of
complexity and accuracy, on one hand, and transparency, on the other. At
the same time, it is likely that the importance attached to transparency
will depend on other features of the accountability system of which the
value-added model is but one component, as well as the political context
in which the accountability system is operating.
Transparency refers to the ability of educators and the public to under-
stand how the estimates were generated and what they mean. A major
goal of improvement and accountability systems is to provide educa-
tors with signals about what is considered effective performance and
whether they have achieved it, as well as to motivate lower performing
individuals to change their behavior to improve their effectiveness. There
is general agreement that highly complex statistical procedures are diffi -
cult for educators to understand, which leads to a concern that the use of
such procedures might limit the practical utility of value-added models.
Workshop participant Robert Gordon raised the issue of whether many
of the models are simply “too esoteric to be useful to teachers in the real
world.” This is an important consideration when these models are used
for accountability because a key aspect of their success is acceptance by
teachers and administrators. In contrast, when the models are used for
research or program evaluation, transparency may not be important.
Transparency also may not be an overriding concern for public uses,
such as for accountability. Henry Braun recounted a discussion with pol-
icy makers who judged that transparency was important but not crucial.
These policy makers indicated that they did not need to know the details
of what went into the “black box” to produce value-added results. If
the results were trustworthy and the rationale could be explained in an
understandable way, they believed that school systems would be willing
to forgo transparency for the sake of accuracy. For example, most current
tests are scored using item response theory, which is also very complex.
However, test users generally accept the reported test scores, even though
they do not fully understand the mathematical intricacies through which
they are derived (i.e., the process for producing raw scores, scale scores,
and equating the results to maintain year-to-year comparability). Analysis
raw scores are converted to scale scores and then further adjusted through
an equating process to maintain year-to-year comparability.
A key consideration in the trade-off between complexity and trans-
parency is the resources required to implement the more complex models.
Complex models require greater technical expertise on the part of staff. It
is critical that the staff conducting sophisticated analyses have the exper-
tise to run them correctly and interpret the results appropriately. Complex
models also usually require more comprehensive data. Data availability

OCR for page 41

ANALYTIC ISSUES
and data quality, as described in the previous section, place limits on the
complexity of the models that can be considered. Thus, a number of issues
have to be weighed to achieve the optimal balance between complexity,
accuracy, and transparency when choosing a value-added model.
Causal Interpretations
Although not always stated explicitly, the goal of value-added model-
ing is to make causal inferences. In practical terms, this means drawing
conclusions, such as that certain teachers caused the higher (or lower)
achievement in their students.
The two disciplines that focus on value-added modeling take differ-
ent approaches to this problem. The statistics discipline generally han -
dles it by characterizing its models as descriptie, not causal; however, it
does recognize that using such models to evaluate schools, teachers, or
programs implicitly treats the results as causal effects. Lockwood and
McCaffrey (2007) identify conditions under which the estimates derived
from statistical models approximate causal effects. The economics disci -
pline generally makes certain assumptions that, if met, support causal
interpretations of value-added results obtained from the models it favors.
The critical assumption is that any differences among classes, schools,
or programs that are not captured by the predictor variables used in the
model are captured by the student fixed-effect components. In the end,
despite their status as empirical descriptions, the results of the statistical
models are used in ways similar to the econometric models—that is, to
support causal interpretations.
Rothstein (2009) tested the assumptions of the economics models in
the context of estimating teacher effects in North Carolina. His idea was to
see if estimated teacher effects can predict the achievement gains of their
students in the years prior to these students being in their classes. For
example, does a fifth grade teacher effect predict her students’ achieve -
ment gains when those students were third and fourth graders? Indeed,
he found that, for example, fifth grade teachers were nearly as strongly
linked statistically to their students’ fourth grade scores as were the stu -
dents’ fourth grade teachers. Rothstein also found that the relationship
between current teachers and prior gains differs by time span: that is, the
strength of the statistical association of the fifth grade teacher with fourth
grade gains differs from that with third grade gains.
Since teachers cannot rewrite the past, the finding that teachers’ effects
predict their students’ prior performance implies there is selection of stu -
dents into teachers’ classrooms that is related to student prior achieve-
ment growth and other dynamic factors, not simply to time-invariant
characteristics of the students. The implication is that, in such settings, the

OCR for page 41

0 GETTING VALUE OUT OF VALUE-ADDED
central assumption of the econometric model does not hold and value-
added estimates are likely to be biased. The size of the bias and the preva-
lence of the conditions leading to the violations are unknown. Although
Rothstein’s study was intended to test the specification of the econometric
models, it has important implications for the interpretation of estimates
from statistical models as well, because dynamic classroom assignment
would also violate the assumptions that Lockwood and McCaffrey (2007)
establish for allowing causal interpretation of statistical model estimates.
Analysts in both paradigms have been taken aback by Rothstein’s (2009)
results. Some researchers are currently conducting studies to see whether
they will replicate Rothstein’s findings; if Rothstein’s findings are con-
firmed, then both camps may need to adapt their modeling approaches to
address the problematic aspects of their current assumptions (McCaffrey
and Lockwood, 2008).
TWO MAIN ANALyTIC APPROACHES
A full explication of value-added analytic methods is too complex to
include in this report. Nontechnical readers may want to skip the rela-
tively brief explanation of the two main analytic approaches that follows,
because it assumes some statistical background and is not essential for
understanding the rest of the report. Readers who are interested in more
technical information are referred to the workshop transcript and back-
ground papers (available at http://www7.nationalacademies.org/bota/
VAM_Workshop_Agenda.html), as well as Graham, Singer, and Willett (in
press); Harris and Sass (2005); McCaffrey and Lockwood (2008); McCaffrey
et al. (2003); Organisation for Economic Co-operation and Development
(2008); and Willett and Singer (in preparation).
Simplifying somewhat, there are two general choices to be made
in the design and estimation of value-added models. (To make matters
concrete, we focus this discussion on obtaining value-added scores for
teachers.) The first choice concerns how to adjust for differences among
students taught by different teachers. The second choice concerns the
estimation methodology.
One approach to adjusting for student differences is to incorporate
into the model a parameter for each student (i.e., student fixed effects).
The student fixed effects include, for a given student, all the unobservable
characteristics of the student and family (including community context)
that contribute to achievement and are stable across time (McCaffrey and
Lockwood, 2008). Advocates of using student fixed effects argue that
measured student covariates are unlikely to remove all the relevant differ-
ences among students of different teachers. For example, in a comparison
of students with the same prior test scores, a student in the more advan-

OCR for page 41

ANALYTIC ISSUES
taged school is likely to differ from a student in a less advantaged school
on a number of other characteristics related to academic achievement. If
they are both performing at the national 50th percentile, the student at the
less advantaged school may exhibit more drive to overcome disadvan-
tages. Using student fixed effects captures all unchanging (time-invariant)
student characteristics and thus eliminates selection bias stemming from
the student characteristics not included in the model, provided that the
model is otherwise properly specified.
But elimination of this bias may come at a significant cost. Because it
requires estimation of a coefficient for each student, it will generally make
estimation of the other coefficients less reliable (have higher variance).
Thus, there is a trade-off between bias and variance that may favor one
choice or the other. In addition, when fixed effects are used, it is impos -
sible to compare groups of teachers whose students do not commingle at
some point. For example, if students at school A always start and end their
school careers there, as do students at school B, by using fixed effects,
one can never tell whether students do better at school A because they
are more advantaged or because school A has better teachers. Even when
the students do overlap, the estimates rely heavily on the outcomes for
students changing schools, generally a small fraction of the total student
population. This, too, reduces the reliability of estimates using fixed stu -
dent effects. Because the students who change schools are not likely to
be representative of the student population, biased estimates can result.3
Which approach produces lower mean-squared error depends on the
specifics of the problem.
A similar set of issues arises when deciding whether to estimate
teacher value-added as the coefficient on a teacher fixed effect or through
the formulation of a random-effects model. Employing random-effects
estimates can introduce bias because it may attribute to the student some
characteristics that are common to teachers in the school. If advantaged
children tend to have better teachers, with random effects one will attri-
bute some of the benefit of having better teachers to being advantaged
and will predict higher test scores for these children than they would actu-
ally achieve with average teachers. This, in turn, will make their teachers
appear to have provided less value-added. In contrast, incorporating
teacher fixed effects would eliminate this source of bias.4
3 Note that differential student mobility across schools or teachers can lead to nonrandom
changes in the contexts of teaching and learning that are not captured by the model and thus
can introduce bias into the estimates of value-added.
4 From a technical perspective, a necessary condition for a model employing teacher ran -
dom effects to yield unbiased estimates is that teachers’ effectiveness is uncorrelated with
student characteristics. The example in the text offers a case in which this condition does
not hold.

OCR for page 41

GETTING VALUE OUT OF VALUE-ADDED
Advocates of using random effects for teachers respond that this
seeming advantage of the fixed-effects approach depends on the model
being otherwise specified correctly; that is, all the other variables contrib -
uting to student outcomes are properly represented in the model. If the
model is seriously misspecified, then fixed-effects estimates may well be
more biased than random-effects estimates. Moreover, the fixed-effects
estimates tend to be quite volatile, especially when the number of students
linked to a teacher is small. In general, random-effects estimates will have
lower variance but higher bias than fixed-effects estimates.5 Either could
have lower mean-squared error. The smaller number of parameters esti-
mated in the random-effects model also makes it easier to include more
complexity. Thus, the appropriateness of a model will always depend in
some measure on the particular context of use and, for this reason, there
was little optimism that a particular approach to estimating value-added
would be always preferred.
A final decision is whether to “shrink” the estimates. To some extent,
this decision reflects whether one comes, like most econometricians, from
a “frequentist” statistical tradition or, like most modern statisticians,
a “Bayesian” statistical tradition. If one thinks that nothing is known
about the distribution of teacher effects (the frequentist approach), then
the estimate derived from the model (usually the fixed effect) is the
best estimate of the teacher effect. However, if one thinks something
is known about this distribution (the Bayesian approach), then a very
large positive or negative (usually random effect) estimate of the teacher
effect is unlikely and is probably the result of random errors. Therefore,
the estimates should be shrunk toward the mean. The two approaches
can be reconciled by using the estimated distribution of teacher effects
to infer the actual distribution of teacher effects. This approach, known
as “empirical Bayes,” is quite complex. If all teacher effects are estimated
with the same precision, then shrinking does not change the ranking of
teachers, only their score. If there is more information on some teachers,
then those on whom there is less information will have less precisely esti-
mated teacher effects, and these estimated effects will be shrunk more.
Such teachers will rarely be found in the extreme tails of the distribution
of value-added estimates.
5 In the most common formulations of random-effects models, estimates of teacher value-
added are pulled toward the average (in contrast to estimates based on the data from each
teacher alone). For this reason they are often called “shrinkage estimates.” The shrinkage
reduces variance at the expense of introducing some bias.

OCR for page 41

ANALYTIC ISSUES
Key Research Areas
Workshop participants identified a number of areas in which more
research on value-added models is needed in order for researchers, policy
makers, and the public to have more confidence in their results. Some key
research questions that were discussed at the workshop include
• How might the econometric and statistical models incorporate
features from the other paradigm that are missing in their own
approaches?
• What are the effects of violations of model assumptions on the
accuracy of value-added estimates? For example, what are the
effects on accuracy of not meeting assumptions about the assign-
ment of students to classrooms, the characteristics of the missing
data, as well as needed sample sizes?
• How do the models perform in simulation studies? One way of
evaluating a model is to generate simulated data that have the
same characteristics as operational data, but with known param-
eters, and test whether the model can accurately capture the rela-
tionships that were built into the simulated data.
• How could the precision of value-added estimates be improved?
Instability declines when multiple years of data are combined,
but some research shows that there is true variability in teacher
performance across years, suggesting that simply pooling data
across years might introduce bias and not allow for true deviation
in performance.
• What are the implications of Rothstein’s results about causality/
bias, for both the economics and the statistical approaches?
• How might value-added estimates of effectiveness be validated?
One approach would be to link estimates of school, teacher, or
program effects derived from the models with other measures
of effectiveness to examine the extent that the various measures
concur. Some past studies have looked at whether value-added
modeling can distinguish certified and noncertified teachers, in
an effort to validate the National Board for Professional Teaching
Standards certification. In other words, value-added estimates are
treated as the criterion. Another approach would be to turn that
on its head and ask: How well do the value-added estimates agree
with other approaches to evaluating the relative effectiveness of
teachers?
• How do policy makers, educators, and the public use value-
added information? What is the appropriate balance between the
complex methods necessary for accurate measures and the need
for measures to be transparent?

OCR for page 41

GETTING VALUE OUT OF VALUE-ADDED
CONCLuSION
Henry Braun summed up the analytic discussion by stating: “To
nobody’s surprise, there is not one dominant VAM.” Each major class of
models has shortcomings, there is no consensus on the best approaches,
and little work has been done on synthesizing the best aspects of each
approach. There are questions about the accuracy and stability of value-
added estimates of schools, teachers, or program effects. More needs to
be learned about how these properties differ, using different value-added
techniques and under different conditions. Most of the workshop par-
ticipants argued that steps need to be taken to improve accuracy if the
estimates are to be used as a primary indicator for high-stakes decisions;
rather, value-added estimates should best be used in combination with
other indicators. But most thought that the degree of precision and stabil -
ity does seem sufficient to justify low-stakes uses of value-added results
for research, evaluation, or improvement when there are no serious con-
sequences for individual teachers, administrators, or students.