quires both judgment and defined criteria on which to base the judgment. We refer to these criteria as a rubric. A rubric includes a description of the dimensions for judging student performance and a scale of values for rating those dimensions. Rubrics are often supplemented with examples of student work at each scale value to further assist in making the judgments. The performance descriptors that are part of the states’ achievement standards could be associated with the rubrics that are developed for individual tests or tasks. A discussion of achievement standards is included in Chapter 4.

Box 5-8 is a progress guide or rubric that is used to evaluate students’ performance on an assessment of the concept of buoyancy. The guide could be useful to teachers and students because it provides information about both current performance and what would be necessary for students to progress.

Delaware has developed a system for gleaning instructionally relevant information from responses to multiple-choice items. The state uses a two-digit scoring rubric modeled after the scoring rubric used in the performance tasks of the Third International Mathematics and Science Study (TIMSS). The first digit of the score indicates whether the answer is correct, incorrect, or partially correct; the second digit of an incorrect or partially correct response score indicates the nature of the misconception that led to the wrong answer. Educators analyze these misconceptions to understand what is lacking in students’ understanding and to shed light on aspects of the curriculum that are not functioning as desired (Box 5-9).

Determining the Measurement Model

Formal measurement models are statistical and psychometric tools that allow interpreters of assessment results to draw meaning from large data sets about student performance and to express the degree of uncertainty that surrounds the conclusions. Measurement models are a particular form of reasoning from evidence that include formal rules for how to integrate a variety of data that may be relevant to a particular inference. There are a variety of measurement models and each model carries both assumptions and inferences that can be drawn when the assumptions are met.

For most of the last century, interpreting test scores was thought of in terms of an assumption that a person’s observed score (O) on a test was made up of two components, true score (T) and error (E), i.e., O = T + E. From that formulation were derived methods of determining how much error was present, and working backward, how much confidence one could have in the observed score. Reliability is a measure of the proportion of variance of observed score that is attributable to the true score rather than to error. The main portions of the traditional psychometrics of test interpretation, test construction, etc. are built on this basis.

Another commonly used type of measurement model is item response theory (IRT), which, as originally conceived, is appropriate to use in situations where the

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement