Measurement Issues

Student test scores are at the heart of value-added analyses. All value-added models (VAMs) use patterns in test performance over time as the measure of student learning. Therefore, it is important to ensure that the test scores themselves can support the inferences made about the results from value-added analyses.

To date, most value-added research in education has been conducted by specialists in education statistics, as well as by economists who work in the area of education policy analysis. At the workshop, Dale Ballou, an economist, pointed out that “the question of what achievement tests measure and how they measure it is probably the [issue] most neglected by economists…. If tests do not cover enough of what teachers actually teach (a common complaint), the most sophisticated statistical analysis in the world still will not yield good estimates of value-added unless it is appropriate to attach zero weight to learning that is not covered by the test.” As Mark Reckase, an educational testing expert noted, even the educational measurement literature on value-added models “makes little mention of the measurement requirements for using the models. For example, a summary of value-added research published by the American Educational Research Association (Zurawsky, 2004) only indicates that the tests need to be aligned to the state curriculum for them to be used for VAMs” (Reckase, 2008, p. 5).

Reckase further observed that, in the measurement literature, value-added methods have not made it to the point of being a “hot topic,” and most people in the measurement community do not know what they

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 27

3
Measurement Issues
Student test scores are at the heart of value-added analyses. All value-
added models (VAMs) use patterns in test performance over time as the
measure of student learning. Therefore, it is important to ensure that the
test scores themselves can support the inferences made about the results
from value-added analyses.
To date, most value-added research in education has been conducted
by specialists in education statistics, as well as by economists who work
in the area of education policy analysis. At the workshop, Dale Ballou,
an economist, pointed out that “the question of what achievement tests
measure and how they measure it is probably the [issue] most neglected
by economists. . . . If tests do not cover enough of what teachers actually
teach (a common complaint), the most sophisticated statistical analysis
in the world still will not yield good estimates of value-added unless it
is appropriate to attach zero weight to learning that is not covered by
the test.” As Mark Reckase, an educational testing expert noted, even
the educational measurement literature on value-added models “makes
little mention of the measurement requirements for using the models. For
example, a summary of value-added research published by the American
Educational Research Association (Zurawsky, 2004) only indicates that the
tests need to be aligned to the state curriculum for them to be used for
VAMs” (Reckase, 2008, p. 5).
Reckase further observed that, in the measurement literature, value-
added methods have not made it to the point of being a “hot topic,” and
most people in the measurement community do not know what they

OCR for page 27

GETTING VALUE OUT OF VALUE-ADDED
are. Several workshop participants suggested that, given the push from
policy makers to start using these models for educational improvement
and accountability, the measurement field needs to step up to the chal-
lenge and make it a priority to address the issues in test design that would
enhance the credibility of value-added analysis. More collaborative, cross-
disciplinary work between VAM researchers from the disciplines of eco-
nomics, statistics, and educational measurement will also be needed to
resolve some of the difficult technical challenges.
The papers on value-added measurement issues that were prepared
for the workshop consistently raised issues related to what tests mea -
sure, error associated with test scores, complexities of measuring growth,
and the score scales that are used to report the results from tests. This
chapter explains those issues and draws heavily from the workshop
papers by Dale Ballou, Michael Kane, Michael Kolen, Robert Linn, Mark
Reckase, and Doug Willms. More details can be found in those papers as
well as in the workshop transcripts, which are posted at http://www7.
nationalacademies.org/bota/VAM_Workshop_Agenda.html.
THE CONCEPT OF VALuE AND THE
MEASuREMENT OF VALuE-ADDED
To calculate value-added requires measurement of the value of both
outputs and inputs. Imagine two factories that produce cars and trucks
using only petroleum products (plastic, rubber) and steel as inputs. One
factory produces 2,000 cars and 500 trucks per day, and the other produces
1,000 of each. Which produces more valuable outputs? The economists’
answer is to measure value by the price of the goods. If trucks sell for
twice as much as cars, the value of the output produced by the two fac -
tories is identical. If trucks are relatively more expensive, the second fac -
tory will produce output of greater value, and if they are relatively less
expensive, it will produce output of lower value. Of course, this shows
only the relative value of the outputs. One also needs to calculate the rela-
tive value of the inputs and the value of the outputs relative to the inputs.
The existence of a price system solves that problem. But it is important
to recognize that even here, the concept of value-added is narrow. If one
does not believe that prices fully capture the social value of extracting the
raw materials converting them to output, then the value-added measured
by economists will not capture the social value-added of the factories. 1
In some cases one can rank the productivity of the plants without
1 Thereis an analogous situation in education, as many argue that test scores do not cap -
ture other important aspects of student development and, as a consequence, value-added
estimates do not reflect schools’ and teachers’ contributions to that development.

OCR for page 27

MEASUREMENT ISSUES
a price system. If the two factories use the same raw materials, but one
produces more cars and more trucks, then that factory has greater value-
added (provided that both cars and trucks are good) regardless of the rela-
tive merit of cars and trucks. Similarly, if they produce the same output,
but one uses less of each input, then it produces greater value-added.
In education, the calculation of value-added requires similar consid -
erations of the value placed on different outcomes. Is producing two stu-
dents with scores of 275 on the state test better or worse than producing
one with a 250 and another with 300? And is it better or worse to take a
student who scored 100 on last year’s test to scoring 150 this year than to
take a student from 200 to 300?
Any calculation of value-added is based only on those outputs and
inputs that are measured. If the factories described above also produce
pollution that is not measured, the economic value-added to society will
be overestimated. In the same way, failing to measure important educa -
tional inputs or outputs because these are not easily captured by written
tests will bias the measure of value-added in education.
It is not yet clear how important these concerns are in practice when
using value-added modeling. If two schools have similar students ini -
tially, but one produces students with better test scores, it will have a
higher measured value-added regardless of the scale chosen. Similarly, if
they produce the same test scores, but one began with weaker students,
the ranking of the schools will not depend on the scale. There are also
issues of the weight the test accords to different content standards and the
levels of difficulty of different questions. These and other measurement
challenges that arise when using value-added methods are explained
more fully in the sections that follow.
MEASuREMENT CHALLENgES
Tests Are Incomplete Measures of Achievement
It is not widely appreciated that all test results are estimates of student
achievement that are incomplete in several respects (National Research
Council, in press). This is an important issue that applies to all test-based
evaluation models. A test covers only a small sample of knowledge and
skills from the much larger subject domain that it is intended to represent
(e.g., fourth grade reading, eighth grade mathematics), and the test ques -
tions are typically limited to a few formats (e.g., multiple choice or short
answer). The measured domains themselves represent only a subset of
the important goals of education; a state may test mathematics, reading,
and science but not other domains that are taught, such as social stud-
ies, music, and computer skills. Furthermore, large-scale tests generally

OCR for page 27

0 GETTING VALUE OUT OF VALUE-ADDED
do not measure other important qualities that schools seek to foster in
students but are more difficult to measure, such as intellectual curiosity,
motivation, persistence in tackling difficult tasks, or the ability to collabo -
rate well with others.
For these reasons, value-added estimates are based on a set of test
scores that reflect a narrower set of educational goals than most parents
and educators have for their students. If this narrowing is severe, and if
the test does not cover the most important educational goals from state
content standards in sufficient breadth or depth, then the value-added
results will offer limited or even misleading information about the effec -
tiveness of schools, teachers, or programs. For example, if a state’s science
standards emphasize scientific inquiry as an important goal, but the state
test primarily assesses recall of science facts, then the test results are not
an appropriate basis for using value-added models to estimate the effec-
tiveness of science teachers with respect to the most valued educational
goals. A science teacher who focuses instruction on memorization of facts
may achieve a high value-added (thus appearing to be very effective),
whereas one who emphasizes scientific inquiry may obtain a low value-
added (thus appearing to be ineffective).
Robert Linn and other workshop participants raised the related
issue of instructional sensitivity. In the testing literature, Popham (2007)
explains that “an instructionally sensitie test would be capable of distin-
guishing between strong and weak instruction by allowing us to validly
conclude that a set of students’ high test scores are meaningfully, but not
exclusively, attributable to effective instruction. . . . In contrast, an instruc-
tionally insensitive test would not allow us to distinguish accurately
between strong and weak instruction” (pp. 146-147). This is relevant
to value-added modeling because the models are meant to capture the
component of learning attributable to the effort of the school, teacher, or
program, separate from other factors. If the tests are not designed to fully
capture the learning that is going on (or meant to go on) in the classroom,
then educators cannot get “credit” for their work. For example, suppose
that according to the state science standards, fourth grade science is more
about facts, and inquiry is introduced in fifth grade, but both tests focus
on facts. Then student learning with respect to inquiry will not be directly
reflected in test performance, and the fifth grade teachers will not get
adequate credit for their work. In such a case, it does not matter what
other student or context factors are taken into account in the model, as the
critical information about achievement is not there to begin with.
Lockwood and colleagues (2007) conducted research showing the
impact of the choice of tests on teacher value-added estimates. They com-
pared the results of value-added results for a large school district using
two different subtests of the Stanford mathematics assessment for grades

OCR for page 27

MEASUREMENT ISSUES
6, 7, and 8: the procedures subtest and the problem-solving subtest. They
used a wide range of models, ranging from simple gain score models to
those using a variety of control variables. The estimated teacher effects
for the two different subtests had generally low correlations regardless
of which model was used to calculate the estimated effects. Their results
demonstrate that “caution is needed when interpreting estimated teacher
effects because there is the potential for teacher performance to depend
on the skills that are measured by the achievement tests” (Lockwood et
al., 2007, p. 56).
Measurement Error
Despite all the efforts that test developers devote to creating tests
that accurately measure a student’s knowledge and skills, all test scores
are susceptible to measurement error. Measurement error results from
the fact that the test items are a sample from a universe of relevant test
items, which are administered to students at one time out of many pos-
sible times. An individual might perform slightly better or worse if a
different set of questions had been chosen or the test had been given on a
different day. For example, on a particular day there might be a disruption
in the testing room, or a student may not physically feel well. Measure -
ment error is also associated with item format. For multiple-choice items,
student guessing is a source of error. For constructed-response items
(short-answer or essay questions) that are scored by people rather than
machines, there can be variation in the behavior of the people hired to
score these questions.2
A student’s test score can thus be thought of as a composite of his or
her true skill level in the tested area as well as the random factors that
can affect his or her performance, as well as the evaluation of that perfor-
mance. Reliability is a measure of the extent to which these random factors
contribute to students’ observed scores. Another way of thinking of reli -
ability is as a measure of the replicability of students’ scores—if the same
set of students took a parallel test on another day, how similar would their
rankings be? Since inferences about teacher, school, or program effects are
based on student test scores, test score reliability is an important consid -
eration in value-added modeling.
Some models measure learning with gain scores (or change scores).
Gain scores are computed by subtracting, for each student, the previous
year’s test score from the current year’s test score. A benefit of using gain
2 Individualscorers will differ from one another on both average stringency and variability.
Scoring patterns of a particular individual will vary by time of day and over days. All these
differences contribute to measurement error.

OCR for page 27

GETTING VALUE OUT OF VALUE-ADDED
scores in value-added modeling is that students can serve as their own
controls for prior achievement. One potential problem with gain scores,
however, relates to measurement error. When a gain score is computed
by subtracting the score at time 1 from the score at time 2, the difference
in scores includes the measurement error from both testing occasions.
The variability of the measurement error of the gain score will tend to be
larger than the variability of the measurement error of either of its compo-
nents. Thus, gain scores can be less reliable than either of the scores that
were used to compute them. However, some researchers have argued that
this simple logic does not necessarily mean that one should abandon gain
scores altogether (Rogosa and Willett, 1983).
At the workshop, Linn emphasized that although it is important to
recognize the uncertainty due to measurement error at the individual
student level, value-added models focus on aggregate results—average
results for a group of students linked to a certain teacher, school, or edu -
cational program. Consequently, the magnitude of the measurement error
associated with a group mean, as well as the corresponding reliability,
is most relevant to an evaluation of the results of value-added results.
Because errors of measurement at the individual student level may be cor-
related, the variability of the errors of measurement for group means are
not simply the sum of the variances associated with individual student
errors of measurement. More to the point, the reliability of group average
scores may be higher or lower than the reliability of the individual scores
that are used to compute that average3 (Zumbo and Forer, 2008). Brennan,
Yin, and Kane (2003) examined this issue using data from the Iowa Tests
of Basic Skills. They investigated the dependability of district-level differ-
ences in mean scores from one year to the next and found that the degree
of uncertainty for the mean difference scores was substantial, suggesting
that it is important to consider aggregate-level errors in interpreting the
results of value-added analyses.
A further complication is that measurement error is not constant
along a test score scale. One characteristic of many tests is that measure-
ment error is much higher at the high and low ends of the scale than in
the middle. Michael Kolen reported at the workshop that error ratios can
be as large as 10 to 1. He speculated that the aggregate score for a school
with a large proportion of low-scoring students may include a great deal
of measurement error that, in turn, may have a substantial effect on the
accuracy of its value-added estimates.
3 The explanation has to do with the fact that reliability is directly related to the ratio of
the variance of the measurement error to the variance in the true scores. Ordinarily, taking
averages reduces both variances, so that it is not clear a priori whether their ratio increases
or decreases.

OCR for page 27

MEASUREMENT ISSUES
Measurement Error and the Stability of Teacher Effects
As long as some measurement error is specific to individuals, mea -
surement error is greater when aggregate test scores are based on a smaller
rather than a larger number of students’ test scores. Small sample sizes are
particularly a problem when trying to estimate teacher effects. For a given
school, there are more students at the school than teachers (although there
are some very small schools in rural areas). Because longitudinal student
data are needed, missing data can further shrink the sample size. For a
classroom of 25 students, the effective sample size may dwindle down to
10 because of missing data and student mobility.
Ballou (2005) studied the stability of teacher rankings derived from
Tennessee’s value-added model in 1998 and 1999 for elementary and
middle school teachers in a moderately large school district. He found
that 40 percent of the mathematics teachers whose estimated teacher
effects ranked in the bottom quartile in 1998 were also in the bottom
quartile in 1999; however, 30 percent of those teachers ranked above the
median in 1999. Although stability was somewhat better for teachers
who ranked in the top quartile in 1998, “nearly a quarter of those who
were in the top quartile in 1998 dropped below the median the following
year” (Ballou, 2005, p. 288). Such fluctuations can be due to measurement
error and other sources of imprecision, as well as changes in the context
of teaching from year to year. A high level of instability is a problem for
using the estimated teacher effects in a given year for high-stakes teacher
accountability. Employing a “three year rolling average” of estimated
value-added is a commonly used remedy.
Interval Scales
Many value-added models are elaborate regression models and, as
such, the data must meet certain technical assumptions. One of the main
assumptions is that the test scores in the analyses are represented on an
equal-interval scale (Ballou, 2008; Reardon and Raudenbush, 2008). With
an interval scale, equal-sized gains at all points on the scale represent the
same increment of test performance. It is clear that a number of scales that
are used to report test scores, such as percentile ranks or grade-equivalent
scores, are not equal-interval scales. Floor and ceiling effects also militate
against the equal interval property.4
Scales developed using item response theory (IRT, a psychometric
theory currently used to score most standardized tests) are sometimes
4 Floor and ceiling effects may prove to be problematic when measuring growth across
grades.

OCR for page 27

GETTING VALUE OUT OF VALUE-ADDED
claimed to be equal interval, but the claim is controversial and cannot be
easily verified. Furthermore, even if IRT produces such interval scales, it
does so according to a particular way of measuring that does not necessar-
ily correspond to the values society places on differences in the intervals.
For example, temperature is an equal interval scale, in the sense that it
takes an equal amount of energy to increase the temperature of an object
by one degree, regardless of its current temperature. However, it is not
an interval scale for “comfortableness.” Raising the temperature from
60° Fahrenheit to 70° affects comfort differently than raising it from 90°
to 100°. Similarly, even if the IRT scale has equal intervals based on some
definition, it is unlikely to have equal intervals based on the value society
places on improvements at different points on the scale.
At the same time, it must be acknowledged that, in the social sciences,
the strict requirement of an equal-interval scale is honored much more in
the breach than in the observance. At a practical level, the issue comes
down to the impact of departures from this assumption on the validity of
the inferences based on the statistical results. This is particularly germane
(and problematic) in the context of value-added analysis, which typically
demands score scales that extend over several grades. Such scales are
constructed through a procedure called “vertical linking.”
Vertical Scales
Reckase explained that when the left side of the model (the crite -
rion) is a gain score rather than a test score for a single point in time, the
measurement requirements are more stringent. Gain scores are supposed
to provide a measure of growth from one testing occasion to the next.
Computing gain scores makes sense only when the two measures are
comparable—that is, when the two tests measure the same constructs
(with approximately the same emphasis) and use the same units of mea -
surement in such a way that results can reasonably be represented on the
same interval scale. Of course, there are many reasons to want to use dif-
ferent measures—tests that are used at the end of one grade are generally
not suitable for use at the end of the next grade, because students at the
higher grade have been learning content appropriate for the higher grade
and the test needs to reflect that content. But there must be coherence
across the sets of knowledge and skills measured at each grade when test
scores are to be used for value-added analysis, whether or not gain scores
are used explicitly.
Most approaches to value-added analysis require a vertical score
scale that spans a consecutive sequence of grades and allows the estima -
tion of student growth along a continuum (Young, 2006). Under ideal
conditions, vertical scales allow users to compare a student’s scale score

OCR for page 27

MEASUREMENT ISSUES
in one grade with that student’s scale score in another grade, in order to
quantify his or her progress. In the statistical process called vertical link-
ing, the tests are “linked” by including some of the same questions on
tests for different grades, so that a few of the same questions appear, for
example, on both the third grade and fourth grade test forms, and a few
of the same questions appear on both the fourth grade and fifth grade
tests, and so on, through the span of grades. Data from the responses to
the questions that are common from one grade to the next are then used
to construct the vertical scale. However, as noted above, the validity of
the inferences based on the analysis of test data represented on a verti -
cal scale depends in part on how closely the vertical scale satisfies the
equal-interval scale criterion. Although there was a range of opinions
expressed at the workshop, many of the measurement experts on the
panel expressed serious concerns on this point—particularly if the link -
ing spans several grades.
Tests that are constructed for use at different grade levels are not
strictly equivalent, in the sense that two forms of the SAT might be con-
sidered to be. Thus, the linkage between tests designed for use at different
grades is necessarily weaker than the equating that is done between test
forms intended to be parallel, such as those used at the same grade or for
tests like the SAT (Linn, 1993; Mislevy, 1992). The nature of the linkage
affects the psychometric properties of the vertical scale and, consequently,
can have a substantial impact on teacher and school effects that result
from the value-added model. Again, it has proven difficult to judge the
degree of distortion in a particular context.
The tests used at different grade levels obviously differ by design in
both difficulty and content coverage, paralleling changes in the curricu -
lum from grade to grade. Moreover, the relative emphasis on different
construct dimensions changes across grade levels. For example, accord -
ing to the Mathematics Content Standards for California Public Schools
(California State Board of Education, 1997), by the end of fourth grade,
students are expected to understand large numbers and addition, subtrac-
tion, multiplication, and division of whole numbers, as well as be able to
compare simple fractions and decimals. By the end of fifth grade, students
should increase their facility with the four basic arithmetic operations
applied to fractions, decimals, and positive and negative numbers. The
common questions that are used for the vertical linking may perform dif -
ferently across grades. For example, a question that requires manipulation
of complex fractions may be appropriate for a fifth grade test but may
reflect content that has not been taught to most fourth graders. In one
grade, the responses may reflect actual learning; in the other, they may
represent guessing. That is, the mix of response styles to the common
questions will generally be different in the two grades. It is not apparent

OCR for page 27

GETTING VALUE OUT OF VALUE-ADDED
what the effect of these differences is on the properties of the resulting
vertical scale.
A related issue is how test design choices impact the vertical scales
and, ultimately, the value-added estimates. Schmidt, Houang, and
McKnight (2005) showed that constructing a vertically linked test battery
may lead to more emphasis on knowledge and skills that are common
across grades and less emphasis on relatively unique material specific to
any given grade. Such a focus on certain parts of a subject domain while
neglecting others can lead to bias in the estimation of school or teacher
effectiveness, and perhaps, more importantly, create incentives for teach-
ers to target their instruction on particular subdomains, neglecting others
that are equally important. Schmidt and colleagues also concluded that
vertical scales make the tests relatively insensitive to instruction, because
the common items used in these scales represent abilities that accrue over
time, rather than the kinds of knowledge and skills that are most directly
associated with a particular teaching experience. Martineau (2006) found
that the changing definition of the construct across grades, accompanied
by changes in the weights of the different components of the construct
across the sequence of tests, can have serious implications for the valid-
ity of the score inferences derived from the vertical scales. Again, there
is some difference of opinion on the seriousness of the problem in real-
world situations.
Other researchers have focused on different approaches to construct-
ing vertical scales and how they can result in different value-added esti -
mates. Briggs, Weeks, and Wiley (2008) constructed eight different vertical
scales for the same set of tests at consecutive grade levels. The approaches
differed with respect to the IRT model used, the method used to estimate
student scale scores, and the IRT calibration method used to place items
from the different grades on the vertical scale. Although the estimated
school effects from the value-added analyses were highly correlated for
the eight vertical scales, the estimated school effects differed for the differ-
ent scales. The researchers found that the numbers of schools that could
be reliably classified as effective, average, or ineffective was somewhat
sensitive to the choice of the underlying vertical scale. This is of some con-
cern as there is no “best” approach to vertical scaling. Indeed, the choice
of vertical scaling methodology, unlike test content, is not specified by
contract and is usually decided by the test vendor. Tong and Kolen (2007)
found that the properties of vertical scales, including the amount of aver-
age year-to-year growth and within-grade variability, were quite sensitive
to how the vertical scale was constructed. Thus, caution is needed when
interpreting school, teacher, or program effects from value-added model-
ing because estimated performance will depend on both the particular
skills that are measured by the tests and the particular vertical scaling

OCR for page 27

MEASUREMENT ISSUES
method used. Despite these problems, the use of a well-constructed verti -
cal scale may yield results that provide a general sense of the amount of
growth that has taken place from grade to grade.
If vertical scales are to be used, regular checks are important to make
sure that scaling artifacts are not driving the results. For example, one
should be suspicious of results that suggest that teachers serving low-
ability students are generally obtaining the largest value-added estimates.
If there is suspicion of a ceiling effect, then one can check whether teacher
rankings change if only the lowest half of each class is used for the
analysis.
Model of Learning
In his presentation, Doug Willms stated that “added value is about
student learning. Therefore, any discussion of added value needs to begin
with some model of what learning entails, and its estimation requires an
explicit model of learning” (Willms, 2008, p. 1). He went on to explain
that there are critical transitions in learning. For example, all points on the
reading scale are not created equal. There is a critical transition from “learn-
ing to read” to “reading to learn,” which for most students occurs around
age 8, typically by the end of third grade. Willms explained that “if children
are not able to read with ease and understand what they have read when
they enter fourth grade, they are less able to take advantage of the learn-
ing opportunities that lie ahead” (p. 5). For good reasons one may want to
acknowledge schools that are effective in moving children across that tran-
sition. Value-added models might be used to identify schools, teachers, or
programs that are most successful in moving children across that transition
in a timely fashion and give credit for it (using an ordinal scale that identi-
fies key milestones). Indeed, some transitions can be accorded extra credit
because of their perceived importance.
Kolen made a similar point regarding the development of vertically
scaled tests. If vertical scales are to become more widely used in the
future, he argued that content standards will need to be better articu-
lated within and across grades to lend themselves to measuring growth
and vertical scaling. Such articulation would make it clear which content
standards are assessed at each grade and which content standards overlap
across grades. Such well-articulated standards could then be used in test
design and the construction of a vertical scale that captures the “right”
intervals on an interval scale, that correspond to the values society places
on improvements at different points on the scale. In principle, one could
use this scale to design an incentive system that focuses on getting stu -
dents across critical transition points. But even this scale would only be
“right” with respect to this particular criterion. It would not be the right

OCR for page 27

GETTING VALUE OUT OF VALUE-ADDED
measure of how hard it is to move a student from one level to another,
and the model derived from this scale would probably not do a good job
of measuring who the best teachers are in this respect. In general, of two
teachers beginning the year with otherwise similar students at level 1, one
would prefer the teacher who brought more to level 2, but one would not
know whether this teacher was better or worse than one who began and
ended the year with students at level 3.
This discussion suggests that in order to make value-added models
more useful, improved content standards are needed that lay out develop-
mental pathways of learning and highlight critical transitions; tests could
then be aligned to such developmental standards. This would improve
all models that use prior test scores to predict current performance and
would be particularly helpful for those that measure growth using gain
scores. Several reports by the National Research Council (2001, 2005,
2007a, 2007b) summarize recent developments in the areas of learning
progressions and trajectories.
Key Research Areas
A number of important test-related issues need to be resolved before
policy makers can have justifiable confidence in value-added results for
high-stakes decisions. Key research questions discussed at the workshop
include
• What are the effects of measurement error on the accuracy of the
estimates of teacher, school, or program effects? What is the contri-
bution of measurement error to the volatility in estimates over time
(e.g., a teacher’s value-added estimates over a number of years)?
• Since there are questions about the assumption that test score scales
are equal-interval, to what extent are inferences from value-added
modeling sensitive to monotonic transformations (meaning trans -
formations that preserve the original order) of test scores?
• Given the problems described above, how might value-added
analyses be given a thorough evaluation prior to operational
implementation? One way of evaluating a model is to generate
simulated data that have the same characteristics as operational
data and determine whether the model can accurately capture the
relationships that were built into the simulated data. If the model
does not estimate parameters with sufficient accuracy from data
that are generated to fit the model and match the characteristics of
the test data, then there is little likelihood that the model will work
well with actual test data. Note that doing well by this measure

OCR for page 27

MEASUREMENT ISSUES
is necessary but not sufficient to justify use of the value-added
model.
CONCLuSION
Different workshop participants tended to identify many of the
same measurement issues associated with value-added models. As Linn
observed, at this time these are “problems without much in the way of
solutions.”
Discussant Kevin Lang asked the measurement presenters, “How
many of these issues are unique to VAM; that is, how many of these are
also problems with current accountability systems?” Linn explained that
many of the problems are present now in test-based accountability sys-
tems under NCLB, such as issues about how well the tests reflect the most
valued educational goals. However, the vertical scale issues and the equal
interval assumption are more specific to VAM applications. As far as mea-
surement error, Linn said, “I guess one difference is that the VAM has this
kind of scientific aura about it, and so it’s taken to be more precise.”
According to Kolen, there are several critical questions: Are estimated
teacher and school effects largely due to idiosyncrasies of statistical meth-
ods, measurement error, the particular test examined, and the scales used?
Or are the estimated teacher and school effects due at least in part to
educationally relevant factors? He argued that these questions need to be
answered clearly before a value-added model is used as the sole indicator
to make important educational decisions.

OCR for page 27