3
Measurement Issues
Student test scores are at the heart of valueadded analyses. All valueadded models (VAMs) use patterns in test performance over time as the measure of student learning. Therefore, it is important to ensure that the test scores themselves can support the inferences made about the results from valueadded analyses.
To date, most valueadded research in education has been conducted by specialists in education statistics, as well as by economists who work in the area of education policy analysis. At the workshop, Dale Ballou, an economist, pointed out that “the question of what achievement tests measure and how they measure it is probably the [issue] most neglected by economists…. If tests do not cover enough of what teachers actually teach (a common complaint), the most sophisticated statistical analysis in the world still will not yield good estimates of valueadded unless it is appropriate to attach zero weight to learning that is not covered by the test.” As Mark Reckase, an educational testing expert noted, even the educational measurement literature on valueadded models “makes little mention of the measurement requirements for using the models. For example, a summary of valueadded research published by the American Educational Research Association (Zurawsky, 2004) only indicates that the tests need to be aligned to the state curriculum for them to be used for VAMs” (Reckase, 2008, p. 5).
Reckase further observed that, in the measurement literature, valueadded methods have not made it to the point of being a “hot topic,” and most people in the measurement community do not know what they
are. Several workshop participants suggested that, given the push from policy makers to start using these models for educational improvement and accountability, the measurement field needs to step up to the challenge and make it a priority to address the issues in test design that would enhance the credibility of valueadded analysis. More collaborative, crossdisciplinary work between VAM researchers from the disciplines of economics, statistics, and educational measurement will also be needed to resolve some of the difficult technical challenges.
The papers on valueadded measurement issues that were prepared for the workshop consistently raised issues related to what tests measure, error associated with test scores, complexities of measuring growth, and the score scales that are used to report the results from tests. This chapter explains those issues and draws heavily from the workshop papers by Dale Ballou, Michael Kane, Michael Kolen, Robert Linn, Mark Reckase, and Doug Willms. More details can be found in those papers as well as in the workshop transcripts, which are posted at http://www7.nationalacademies.org/bota/VAM_Workshop_Agenda.html.
THE CONCEPT OF VALUE AND THE MEASUREMENT OF VALUEADDED
To calculate valueadded requires measurement of the value of both outputs and inputs. Imagine two factories that produce cars and trucks using only petroleum products (plastic, rubber) and steel as inputs. One factory produces 2,000 cars and 500 trucks per day, and the other produces 1,000 of each. Which produces more valuable outputs? The economists’ answer is to measure value by the price of the goods. If trucks sell for twice as much as cars, the value of the output produced by the two factories is identical. If trucks are relatively more expensive, the second factory will produce output of greater value, and if they are relatively less expensive, it will produce output of lower value. Of course, this shows only the relative value of the outputs. One also needs to calculate the relative value of the inputs and the value of the outputs relative to the inputs. The existence of a price system solves that problem. But it is important to recognize that even here, the concept of valueadded is narrow. If one does not believe that prices fully capture the social value of extracting the raw materials converting them to output, then the valueadded measured by economists will not capture the social valueadded of the factories.^{1}
In some cases one can rank the productivity of the plants without
a price system. If the two factories use the same raw materials, but one produces more cars and more trucks, then that factory has greater valueadded (provided that both cars and trucks are good) regardless of the relative merit of cars and trucks. Similarly, if they produce the same output, but one uses less of each input, then it produces greater valueadded.
In education, the calculation of valueadded requires similar considerations of the value placed on different outcomes. Is producing two students with scores of 275 on the state test better or worse than producing one with a 250 and another with 300? And is it better or worse to take a student who scored 100 on last year’s test to scoring 150 this year than to take a student from 200 to 300?
Any calculation of valueadded is based only on those outputs and inputs that are measured. If the factories described above also produce pollution that is not measured, the economic valueadded to society will be overestimated. In the same way, failing to measure important educational inputs or outputs because these are not easily captured by written tests will bias the measure of valueadded in education.
It is not yet clear how important these concerns are in practice when using valueadded modeling. If two schools have similar students initially, but one produces students with better test scores, it will have a higher measured valueadded regardless of the scale chosen. Similarly, if they produce the same test scores, but one began with weaker students, the ranking of the schools will not depend on the scale. There are also issues of the weight the test accords to different content standards and the levels of difficulty of different questions. These and other measurement challenges that arise when using valueadded methods are explained more fully in the sections that follow.
MEASUREMENT CHALLENGES
Tests Are Incomplete Measures of Achievement
It is not widely appreciated that all test results are estimates of student achievement that are incomplete in several respects (National Research Council, in press). This is an important issue that applies to all testbased evaluation models. A test covers only a small sample of knowledge and skills from the much larger subject domain that it is intended to represent (e.g., fourth grade reading, eighth grade mathematics), and the test questions are typically limited to a few formats (e.g., multiple choice or short answer). The measured domains themselves represent only a subset of the important goals of education; a state may test mathematics, reading, and science but not other domains that are taught, such as social studies, music, and computer skills. Furthermore, largescale tests generally
do not measure other important qualities that schools seek to foster in students but are more difficult to measure, such as intellectual curiosity, motivation, persistence in tackling difficult tasks, or the ability to collaborate well with others.
For these reasons, valueadded estimates are based on a set of test scores that reflect a narrower set of educational goals than most parents and educators have for their students. If this narrowing is severe, and if the test does not cover the most important educational goals from state content standards in sufficient breadth or depth, then the valueadded results will offer limited or even misleading information about the effectiveness of schools, teachers, or programs. For example, if a state’s science standards emphasize scientific inquiry as an important goal, but the state test primarily assesses recall of science facts, then the test results are not an appropriate basis for using valueadded models to estimate the effectiveness of science teachers with respect to the most valued educational goals. A science teacher who focuses instruction on memorization of facts may achieve a high valueadded (thus appearing to be very effective), whereas one who emphasizes scientific inquiry may obtain a low valueadded (thus appearing to be ineffective).
Robert Linn and other workshop participants raised the related issue of instructional sensitivity. In the testing literature, Popham (2007) explains that “an instructionally sensitive test would be capable of distinguishing between strong and weak instruction by allowing us to validly conclude that a set of students’ high test scores are meaningfully, but not exclusively, attributable to effective instruction…. In contrast, an instructionally insensitive test would not allow us to distinguish accurately between strong and weak instruction” (pp. 146147). This is relevant to valueadded modeling because the models are meant to capture the component of learning attributable to the effort of the school, teacher, or program, separate from other factors. If the tests are not designed to fully capture the learning that is going on (or meant to go on) in the classroom, then educators cannot get “credit” for their work. For example, suppose that according to the state science standards, fourth grade science is more about facts, and inquiry is introduced in fifth grade, but both tests focus on facts. Then student learning with respect to inquiry will not be directly reflected in test performance, and the fifth grade teachers will not get adequate credit for their work. In such a case, it does not matter what other student or context factors are taken into account in the model, as the critical information about achievement is not there to begin with.
Lockwood and colleagues (2007) conducted research showing the impact of the choice of tests on teacher valueadded estimates. They compared the results of valueadded results for a large school district using two different subtests of the Stanford mathematics assessment for grades
6, 7, and 8: the procedures subtest and the problemsolving subtest. They used a wide range of models, ranging from simple gain score models to those using a variety of control variables. The estimated teacher effects for the two different subtests had generally low correlations regardless of which model was used to calculate the estimated effects. Their results demonstrate that “caution is needed when interpreting estimated teacher effects because there is the potential for teacher performance to depend on the skills that are measured by the achievement tests” (Lockwood et al., 2007, p. 56).
Measurement Error
Despite all the efforts that test developers devote to creating tests that accurately measure a student’s knowledge and skills, all test scores are susceptible to measurement error. Measurement error results from the fact that the test items are a sample from a universe of relevant test items, which are administered to students at one time out of many possible times. An individual might perform slightly better or worse if a different set of questions had been chosen or the test had been given on a different day. For example, on a particular day there might be a disruption in the testing room, or a student may not physically feel well. Measurement error is also associated with item format. For multiplechoice items, student guessing is a source of error. For constructedresponse items (shortanswer or essay questions) that are scored by people rather than machines, there can be variation in the behavior of the people hired to score these questions.^{2}
A student’s test score can thus be thought of as a composite of his or her true skill level in the tested area as well as the random factors that can affect his or her performance, as well as the evaluation of that performance. Reliability is a measure of the extent to which these random factors contribute to students’ observed scores. Another way of thinking of reliability is as a measure of the replicability of students’ scores—if the same set of students took a parallel test on another day, how similar would their rankings be? Since inferences about teacher, school, or program effects are based on student test scores, test score reliability is an important consideration in valueadded modeling.
Some models measure learning with gain scores (or change scores). Gain scores are computed by subtracting, for each student, the previous year’s test score from the current year’s test score. A benefit of using gain
scores in valueadded modeling is that students can serve as their own controls for prior achievement. One potential problem with gain scores, however, relates to measurement error. When a gain score is computed by subtracting the score at time 1 from the score at time 2, the difference in scores includes the measurement error from both testing occasions. The variability of the measurement error of the gain score will tend to be larger than the variability of the measurement error of either of its components. Thus, gain scores can be less reliable than either of the scores that were used to compute them. However, some researchers have argued that this simple logic does not necessarily mean that one should abandon gain scores altogether (Rogosa and Willett, 1983).
At the workshop, Linn emphasized that although it is important to recognize the uncertainty due to measurement error at the individual student level, valueadded models focus on aggregate results—average results for a group of students linked to a certain teacher, school, or educational program. Consequently, the magnitude of the measurement error associated with a group mean, as well as the corresponding reliability, is most relevant to an evaluation of the results of valueadded results. Because errors of measurement at the individual student level may be correlated, the variability of the errors of measurement for group means are not simply the sum of the variances associated with individual student errors of measurement. More to the point, the reliability of group average scores may be higher or lower than the reliability of the individual scores that are used to compute that average^{3} (Zumbo and Forer, 2008). Brennan, Yin, and Kane (2003) examined this issue using data from the Iowa Tests of Basic Skills. They investigated the dependability of districtlevel differences in mean scores from one year to the next and found that the degree of uncertainty for the mean difference scores was substantial, suggesting that it is important to consider aggregatelevel errors in interpreting the results of valueadded analyses.
A further complication is that measurement error is not constant along a test score scale. One characteristic of many tests is that measurement error is much higher at the high and low ends of the scale than in the middle. Michael Kolen reported at the workshop that error ratios can be as large as 10 to 1. He speculated that the aggregate score for a school with a large proportion of lowscoring students may include a great deal of measurement error that, in turn, may have a substantial effect on the accuracy of its valueadded estimates.
Measurement Error and the Stability of Teacher Effects
As long as some measurement error is specific to individuals, measurement error is greater when aggregate test scores are based on a smaller rather than a larger number of students’ test scores. Small sample sizes are particularly a problem when trying to estimate teacher effects. For a given school, there are more students at the school than teachers (although there are some very small schools in rural areas). Because longitudinal student data are needed, missing data can further shrink the sample size. For a classroom of 25 students, the effective sample size may dwindle down to 10 because of missing data and student mobility.
Ballou (2005) studied the stability of teacher rankings derived from Tennessee’s valueadded model in 1998 and 1999 for elementary and middle school teachers in a moderately large school district. He found that 40 percent of the mathematics teachers whose estimated teacher effects ranked in the bottom quartile in 1998 were also in the bottom quartile in 1999; however, 30 percent of those teachers ranked above the median in 1999. Although stability was somewhat better for teachers who ranked in the top quartile in 1998, “nearly a quarter of those who were in the top quartile in 1998 dropped below the median the following year” (Ballou, 2005, p. 288). Such fluctuations can be due to measurement error and other sources of imprecision, as well as changes in the context of teaching from year to year. A high level of instability is a problem for using the estimated teacher effects in a given year for highstakes teacher accountability. Employing a “three year rolling average” of estimated valueadded is a commonly used remedy.
Interval Scales
Many valueadded models are elaborate regression models and, as such, the data must meet certain technical assumptions. One of the main assumptions is that the test scores in the analyses are represented on an equalinterval scale (Ballou, 2008; Reardon and Raudenbush, 2008). With an interval scale, equalsized gains at all points on the scale represent the same increment of test performance. It is clear that a number of scales that are used to report test scores, such as percentile ranks or gradeequivalent scores, are not equalinterval scales. Floor and ceiling effects also militate against the equal interval property.^{4}
Scales developed using item response theory (IRT, a psychometric theory currently used to score most standardized tests) are sometimes
claimed to be equal interval, but the claim is controversial and cannot be easily verified. Furthermore, even if IRT produces such interval scales, it does so according to a particular way of measuring that does not necessarily correspond to the values society places on differences in the intervals. For example, temperature is an equal interval scale, in the sense that it takes an equal amount of energy to increase the temperature of an object by one degree, regardless of its current temperature. However, it is not an interval scale for “comfortableness.” Raising the temperature from 60° Fahrenheit to 70° affects comfort differently than raising it from 90° to 100°. Similarly, even if the IRT scale has equal intervals based on some definition, it is unlikely to have equal intervals based on the value society places on improvements at different points on the scale.
At the same time, it must be acknowledged that, in the social sciences, the strict requirement of an equalinterval scale is honored much more in the breach than in the observance. At a practical level, the issue comes down to the impact of departures from this assumption on the validity of the inferences based on the statistical results. This is particularly germane (and problematic) in the context of valueadded analysis, which typically demands score scales that extend over several grades. Such scales are constructed through a procedure called “vertical linking.”
Vertical Scales
Reckase explained that when the left side of the model (the criterion) is a gain score rather than a test score for a single point in time, the measurement requirements are more stringent. Gain scores are supposed to provide a measure of growth from one testing occasion to the next. Computing gain scores makes sense only when the two measures are comparable—that is, when the two tests measure the same constructs (with approximately the same emphasis) and use the same units of measurement in such a way that results can reasonably be represented on the same interval scale. Of course, there are many reasons to want to use different measures—tests that are used at the end of one grade are generally not suitable for use at the end of the next grade, because students at the higher grade have been learning content appropriate for the higher grade and the test needs to reflect that content. But there must be coherence across the sets of knowledge and skills measured at each grade when test scores are to be used for valueadded analysis, whether or not gain scores are used explicitly.
Most approaches to valueadded analysis require a vertical score scale that spans a consecutive sequence of grades and allows the estimation of student growth along a continuum (Young, 2006). Under ideal conditions, vertical scales allow users to compare a student’s scale score
in one grade with that student’s scale score in another grade, in order to quantify his or her progress. In the statistical process called vertical linking, the tests are “linked” by including some of the same questions on tests for different grades, so that a few of the same questions appear, for example, on both the third grade and fourth grade test forms, and a few of the same questions appear on both the fourth grade and fifth grade tests, and so on, through the span of grades. Data from the responses to the questions that are common from one grade to the next are then used to construct the vertical scale. However, as noted above, the validity of the inferences based on the analysis of test data represented on a vertical scale depends in part on how closely the vertical scale satisfies the equalinterval scale criterion. Although there was a range of opinions expressed at the workshop, many of the measurement experts on the panel expressed serious concerns on this point—particularly if the linking spans several grades.
Tests that are constructed for use at different grade levels are not strictly equivalent, in the sense that two forms of the SAT might be considered to be. Thus, the linkage between tests designed for use at different grades is necessarily weaker than the equating that is done between test forms intended to be parallel, such as those used at the same grade or for tests like the SAT (Linn, 1993; Mislevy, 1992). The nature of the linkage affects the psychometric properties of the vertical scale and, consequently, can have a substantial impact on teacher and school effects that result from the valueadded model. Again, it has proven difficult to judge the degree of distortion in a particular context.
The tests used at different grade levels obviously differ by design in both difficulty and content coverage, paralleling changes in the curriculum from grade to grade. Moreover, the relative emphasis on different construct dimensions changes across grade levels. For example, according to the Mathematics Content Standards for California Public Schools (California State Board of Education, 1997), by the end of fourth grade, students are expected to understand large numbers and addition, subtraction, multiplication, and division of whole numbers, as well as be able to compare simple fractions and decimals. By the end of fifth grade, students should increase their facility with the four basic arithmetic operations applied to fractions, decimals, and positive and negative numbers. The common questions that are used for the vertical linking may perform differently across grades. For example, a question that requires manipulation of complex fractions may be appropriate for a fifth grade test but may reflect content that has not been taught to most fourth graders. In one grade, the responses may reflect actual learning; in the other, they may represent guessing. That is, the mix of response styles to the common questions will generally be different in the two grades. It is not apparent
what the effect of these differences is on the properties of the resulting vertical scale.
A related issue is how test design choices impact the vertical scales and, ultimately, the valueadded estimates. Schmidt, Houang, and McKnight (2005) showed that constructing a vertically linked test battery may lead to more emphasis on knowledge and skills that are common across grades and less emphasis on relatively unique material specific to any given grade. Such a focus on certain parts of a subject domain while neglecting others can lead to bias in the estimation of school or teacher effectiveness, and perhaps, more importantly, create incentives for teachers to target their instruction on particular subdomains, neglecting others that are equally important. Schmidt and colleagues also concluded that vertical scales make the tests relatively insensitive to instruction, because the common items used in these scales represent abilities that accrue over time, rather than the kinds of knowledge and skills that are most directly associated with a particular teaching experience. Martineau (2006) found that the changing definition of the construct across grades, accompanied by changes in the weights of the different components of the construct across the sequence of tests, can have serious implications for the validity of the score inferences derived from the vertical scales. Again, there is some difference of opinion on the seriousness of the problem in realworld situations.
Other researchers have focused on different approaches to constructing vertical scales and how they can result in different valueadded estimates. Briggs, Weeks, and Wiley (2008) constructed eight different vertical scales for the same set of tests at consecutive grade levels. The approaches differed with respect to the IRT model used, the method used to estimate student scale scores, and the IRT calibration method used to place items from the different grades on the vertical scale. Although the estimated school effects from the valueadded analyses were highly correlated for the eight vertical scales, the estimated school effects differed for the different scales. The researchers found that the numbers of schools that could be reliably classified as effective, average, or ineffective was somewhat sensitive to the choice of the underlying vertical scale. This is of some concern as there is no “best” approach to vertical scaling. Indeed, the choice of vertical scaling methodology, unlike test content, is not specified by contract and is usually decided by the test vendor. Tong and Kolen (2007) found that the properties of vertical scales, including the amount of average yeartoyear growth and withingrade variability, were quite sensitive to how the vertical scale was constructed. Thus, caution is needed when interpreting school, teacher, or program effects from valueadded modeling because estimated performance will depend on both the particular skills that are measured by the tests and the particular vertical scaling
method used. Despite these problems, the use of a wellconstructed vertical scale may yield results that provide a general sense of the amount of growth that has taken place from grade to grade.
If vertical scales are to be used, regular checks are important to make sure that scaling artifacts are not driving the results. For example, one should be suspicious of results that suggest that teachers serving lowability students are generally obtaining the largest valueadded estimates. If there is suspicion of a ceiling effect, then one can check whether teacher rankings change if only the lowest half of each class is used for the analysis.
Model of Learning
In his presentation, Doug Willms stated that “added value is about student learning. Therefore, any discussion of added value needs to begin with some model of what learning entails, and its estimation requires an explicit model of learning” (Willms, 2008, p. 1). He went on to explain that there are critical transitions in learning. For example, all points on the reading scale are not created equal. There is a critical transition from “learning to read” to “reading to learn,” which for most students occurs around age 8, typically by the end of third grade. Willms explained that “if children are not able to read with ease and understand what they have read when they enter fourth grade, they are less able to take advantage of the learning opportunities that lie ahead” (p. 5). For good reasons one may want to acknowledge schools that are effective in moving children across that transition. Valueadded models might be used to identify schools, teachers, or programs that are most successful in moving children across that transition in a timely fashion and give credit for it (using an ordinal scale that identifies key milestones). Indeed, some transitions can be accorded extra credit because of their perceived importance.
Kolen made a similar point regarding the development of vertically scaled tests. If vertical scales are to become more widely used in the future, he argued that content standards will need to be better articulated within and across grades to lend themselves to measuring growth and vertical scaling. Such articulation would make it clear which content standards are assessed at each grade and which content standards overlap across grades. Such wellarticulated standards could then be used in test design and the construction of a vertical scale that captures the “right” intervals on an interval scale, that correspond to the values society places on improvements at different points on the scale. In principle, one could use this scale to design an incentive system that focuses on getting students across critical transition points. But even this scale would only be “right” with respect to this particular criterion. It would not be the right
measure of how hard it is to move a student from one level to another, and the model derived from this scale would probably not do a good job of measuring who the best teachers are in this respect. In general, of two teachers beginning the year with otherwise similar students at level 1, one would prefer the teacher who brought more to level 2, but one would not know whether this teacher was better or worse than one who began and ended the year with students at level 3.
This discussion suggests that in order to make valueadded models more useful, improved content standards are needed that lay out developmental pathways of learning and highlight critical transitions; tests could then be aligned to such developmental standards. This would improve all models that use prior test scores to predict current performance and would be particularly helpful for those that measure growth using gain scores. Several reports by the National Research Council (2001, 2005, 2007a, 2007b) summarize recent developments in the areas of learning progressions and trajectories.
Key Research Areas
A number of important testrelated issues need to be resolved before policy makers can have justifiable confidence in valueadded results for highstakes decisions. Key research questions discussed at the workshop include

What are the effects of measurement error on the accuracy of the estimates of teacher, school, or program effects? What is the contribution of measurement error to the volatility in estimates over time (e.g., a teacher’s valueadded estimates over a number of years)?

Since there are questions about the assumption that test score scales are equalinterval, to what extent are inferences from valueadded modeling sensitive to monotonic transformations (meaning transformations that preserve the original order) of test scores?

Given the problems described above, how might valueadded analyses be given a thorough evaluation prior to operational implementation? One way of evaluating a model is to generate simulated data that have the same characteristics as operational data and determine whether the model can accurately capture the relationships that were built into the simulated data. If the model does not estimate parameters with sufficient accuracy from data that are generated to fit the model and match the characteristics of the test data, then there is little likelihood that the model will work well with actual test data. Note that doing well by this measure

is necessary but not sufficient to justify use of the valueadded model.
CONCLUSION
Different workshop participants tended to identify many of the same measurement issues associated with valueadded models. As Linn observed, at this time these are “problems without much in the way of solutions.”
Discussant Kevin Lang asked the measurement presenters, “How many of these issues are unique to VAM; that is, how many of these are also problems with current accountability systems?” Linn explained that many of the problems are present now in testbased accountability systems under NCLB, such as issues about how well the tests reflect the most valued educational goals. However, the vertical scale issues and the equal interval assumption are more specific to VAM applications. As far as measurement error, Linn said, “I guess one difference is that the VAM has this kind of scientific aura about it, and so it’s taken to be more precise.”
According to Kolen, there are several critical questions: Are estimated teacher and school effects largely due to idiosyncrasies of statistical methods, measurement error, the particular test examined, and the scales used? Or are the estimated teacher and school effects due at least in part to educationally relevant factors? He argued that these questions need to be answered clearly before a valueadded model is used as the sole indicator to make important educational decisions.