[A] well-developed and empirically validated model of thinking and learning in an academic domain can be used to design and select assessment tasks that support the analysis of various kinds of student performance. Such a model can also serve as the basis for rubrics for evaluating and scoring pupils’ work, with discriminating features of expertise defining the specific targets of assessment.

Testing and Measurement

Basic Vocabulary

Like any other field of knowledge, assessment has a specialized vocabulary. The terms “test” and “instrument,” for instance, which are often used interchangeably, refer to a set of items, questions, or tasks presented to individuals under controlled conditions. “Testing” is the administration of a test, and “measurement” is the process of assigning numbers, attributes, or characteristics—according to established rules—to determine the test taker’s level of performance on an instrument. The current emphasis on accountability in public schools, which entails accurate measurements of student performance, has renewed interest in measurement theory, which became a formal discipline in the 1930s.

“Assessment,” derived from the French assidere (to sit beside), is defined as the process of collecting data to describe a level of functioning. Never an end in itself, an assessment provides information about what an individual knows or can do and a basis for decision making, for instance about a school curriculum. A related term, “evaluation,” implies a value judgment about the level of functioning.

“Reliability” is a critical aspect of an assessment. An instrument is considered reliable if it provides consistent information over multiple administrations. For example, on a reliable test, a person’s score should be the same regardless of when the assessment was completed, when the responses were scored, or who scored the responses (Moskal and Leydens, 2000). Reliability is necessary, but not sufficient, to ensure that a test serves the purpose for which it was designed. Statistically, indices of test reliability typically range from zero to one, with reliabilities of 0.85 and above signifying test scores that are likely to be consistent from one test administration to the next and thus highly reliable (Linn and Gronlund, 2000). Assuming other aspects of an assessment remain

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement