Although large-scale tests can provide a relatively objective and efficient way to gauge the most valued aspects of student achievement, they are neither perfect nor comprehensive measures. Many policy makers in education are familiar with the concept of test reliability and understand that the test score for an individual is measured with uncertainty. Test scores will typically differ from one occasion to another even when there has been no change in a test taker’s proficiency because of chance differences in the interaction of the test questions, the test taker, and the testing context. Researchers think of these fluctuations as measurement error and so treat test results as estimates of test takers’ “true scores” and not as “the truth” in an absolute sense.

In addition, tests are estimates in another way that has important implications for the way they function when used as performance measures with incentives: they cover only a subset of the content domain that is being tested. There are four key stages of selection and sampling that occur when a large-scale testing program is created to test a particular subject area. Each stage narrows the range of material that the test covers (Koretz, 2002; Popham, 2000). First, the domain to be tested, when specifically defined, is typically only part of what might be reasonable to assess. For example, there needs to be a decision about whether the material to be tested in each grade and subject should include only material currently taught in most schools in the state or whether it should include material that people think should be taught in each grade and subject.

Second, the test maker crafts a framework that lists the content and skills to be tested. For example, if history questions are to be part of the eighth grade test, they might ask about names and the sequence of events or they might ask students to relate such facts to abstractions, such as rights and democracy. These decisions are partly influenced by practical constraints. Some aspects of learning are more difficult or costly to assess using standardized measures than others. In reading, for example, students’ general understanding of the main topic of a text is typically more straightforward to assess than the extent to which a student has formed connections among parts of the text or applied the text to other texts or to real-world situations.

Third, the test maker develops specifications that dictate how many test questions of certain types will constitute a test form. Such a document describes the mix of item formats (such as multiple choice or short answer), the distribution of test questions across different content and skill areas (such as the number of test questions that will assess decimal numbers or percentages), and whether additional tools will be allowed (such as calculators or computers).

Fourth, specific test items (questions) are created to meet the test

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement