and what meaning can be drawn from the results—and whether the conclusions and inferences drawn from the test results are appropriate.

Fairness incorporates not just technical issues of reliability and validity but also social values of equity and justice—for example, whether a test systematically underestimates the knowledge or skill of members of a particular group.

Reliability of Measurement

Reliability is typically estimated in one of three ways. One is to estimate the consistency of a test's results on different occasions, as explained above. A second way is to examine consistency across parallel forms of a test, which are developed to be equivalent in content and technical characteristics. That is, to what extent does performance on one form of the test correlate with performance on a parallel form? A third way is to determine how consistently examinees perform across similar items or subsets of items, intended to measure the same knowledge or skill, within a single test form. This concept of reliability is called internal consistency.

For judgmentally scored tests, such as essays, another widely used index is the coefficient of scorer reliability, which addresses consistency across different observers, raters, or scorers. That is, do the scores assigned by one judge using a set of designated rating criteria agree with those given by another judge using the same criteria?

How reliable must a test be? That depends on the nature of the construct—that is, the abstract skill, attribute, or domain of knowledge—being measured. For a very homogeneous, narrow construct, such as adding two-digit numbers, internal-consistency reliability should be extremely high. We would expect somewhat less high reliability for a more heterogeneous, broad construct, such as algebra, given the same length test. Measures of certain constructs, such as mood or anxiety (that is, states as opposed to traits), are generally less stable; thus high reliability would not be expected.

For most purposes, a more useful index than reliability is the standard error of measurement, which is related to the unreliability of a test. This index defines a range of likely variation, or uncertainty, around the test score—similar to when public opinion polls report a margin of error of plus or minus x points. The standard error thus quantifies and makes explicit the uncertainty involved in interpreting a student's level of performance;

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement