Read "Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests" at NAP.edu

Page 73 Cite

Suggested Citation:"Glossary." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

Glossary

This glossary provides definitions of technical terms and concepts used in this report. Note that technical usage may differ from common usage. For many of the terms, multiple definitions can be found in the literature. Words set in italics are defined elsewhere in the glossary.

Accommodation

A change in the standard procedure for administering a test or in the mode of response required of examinees, used to lessen bias in the scores of individuals with a special need or disability. Examples of accommodations include allotting extra time and providing the test in large type.

Achievement levels

Descriptions of student or adult competency in a particular subject area, usually defined as ordered categories on a continuum, often labeled from ''basic" to "advanced," that constitute broad ranges for classifying performance. The National Assessment of Educational Progress (NAEP) defines three achievement levels for each subject and grade being assessed: basic, proficient, and advanced. The National Assessment Governing Board (NAGB), the governing body for NAEP, describes the knowledge and skills demonstrated by students at or above each of these three levels of achievement, and provides exemplars of performance for each. NAGB also reports the percentage of students who are in the four categories of achievement defined by the three levels,

Page 74 Cite

Suggested Citation:"Glossary." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

basic, proficient, or advanced. NAGB does not provide a description for the below basic category.

Assessment

Any systematic method of obtaining evidence from tests and collateral sources used to draw inferences about characteristics of people, objects, or programs for a specific purpose; often used interchangeably with test.

Bias

In a test, a systematic error in a test score. Bias usually favors one group of test takers over another.

Calibration

(1) With respect to scales, the process of setting a test score scale, including the mean, standard deviation, and possibly the shape of the score distribution, so that scores on the scale have the same relative meaning as scores on a related score scale. (2) With respect to items, the process of determining the relation of item responses to the underlying scale that the item is measuring, including indications of an item's difficulty, correlation to the scale, and susceptibility to guessing.

Common measure

A scale of measurement that has a single meaning. Scores from tests that are calibrated (see calibration) to this scale support the same inferences about student performance from one locality to another and from one year to the next.

Constructed-response item

An exercise for which examinees must create their own responses or products rather than choose a response from an enumerated set. See selected-response item.

Content domain

The set of behaviors, knowledge, skills, abilities, attitudes, or other characteristics measured by a test, represented in a detailed specification, and often organized into categories by which items are classified.

Distribution

The number, or the percentage, of cases having each possible data value on a scale of data values. Distributions are often reported in terms of grouped ranges of data values. In testing, data values are usually test scores. A distribution can be characterized by its mean and standard deviation.

Page 75 Cite

Suggested Citation:"Glossary." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

Domain

The full array of a particular subject matter being addressed by an assessment

Effect size

A measure of the practical effect of a statistical difference, usually a difference of the means of two distributions. The mean difference between two distributions, or an equivalent difference, is expressed in units of the standard deviation of the dominant distribution or of some average of the two standard deviations. For example, if two distributions had means of 50 and 54, and both had standard deviations of 10, the effect size of their mean difference would be 4/10, or 0.4. The effect size is sometimes called the standardized mean difference. In other contexts, other ways are sometimes used to express the practical size of an observed statistical difference.

Embedding

In testing, including all or part of one test in another. The embedded part may be kept together as a unit or interspersed throughout the test.

Equating

The process of statistical adjustments by which the scores on two or more alternate forms are placed on a common scale. The process assumes that the test forms have been constructed to the same explicit content and statistical specifications and administered under identical procedures.

Field test

A test administration used to check the adequacy of testing procedures, generally including test administration, test responding, test scoring, and test reporting.

Form

In testing, a particular test in a set of tests, all of which have the same test specifications and are mutually equated.

Framework

The detailed description of the test domain in the way that it will be represented by a test.

High stakes test

A test whose results have important, direct consequences for examinees, programs, or institutions tested.

Page 76 Cite

Suggested Citation:"Glossary." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

Item

A generic term used to refer to a question or an exercise on a test or assessment. The test taker must respond to the item in some way. Since many test questions have the grammatical form of a statement, the neutral term item is preferred.

Item format

The form in which a question is posed on a test and the form in which the response is to be made. The formats include, among others, selected-response items (multiple-choice) and constructed-response items, which may be either short-answer or extended-response items .

Item pool

The aggregate of items from which a test's items are selected during test development or the total set of items from which a particular test is selected for a test taker during adaptive testing.

Limited English proficiency (LEP)

A term used to identify students

whose performance on tests of achievement may be inappropriately low because of their poor proficiency in English.

Linking

Placing two or more tests on the same scale so that scores can be used interchangeably.

Matrix sampling

A measurement format in which a large set of test items is organized into a number of relatively short item sets, each of which is randomly assigned to a subsample of test takers, thereby avoiding the need to administer all items to all examinees.

Measurement error

The amount of variation in a measured value, such as a score, due to unknown, random factors. In testing, measurement error is viewed as the difference between an observed score and a corresponding theoretical true score or proficiency.

Metric

The units in which the values on a scale are expressed.

Norm-referenced

Interpreted by comparison with the performance of those in a specified population. A norm-referenced test score is interpreted on the basis of a comparison of a test taker's performance to the performance of other people in a specified reference population, or by a comparison of a group to other groups. See criterion-referenced .

Page 77 Cite

Suggested Citation:"Glossary." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

Norms

Statistics or tabular data that summarize the distribution of test performance for one or more specified groups, such as test takers of various ages or grades. Norms are usually designed to represent some larger population, such as all test takers in the country.

Performance standard

An objective definition of a certain level of performance in some domain in terms of a cut-score or a range of scores on the score scale of a test measuring proficiency in that domain. Also, sometimes, a statement or description of a set of operational tasks exemplifying a level of performance associated with a more general content standard; the statement may be used to guide judgments about the location of a cut-score on a score scale.

Proficiency levels

See achievement levels.

Reliability

The degree to which the scores are consistent over repeated applications of a measurement procedure and hence are dependable, and repeatable; the degree to which scores are free of errors of measurement. Reliability is usually expressed by a unit-free index that either is, or resembles, a product-moment correlation. In classical test theory, the term represents the ratio of true score variance to observed score variance for a particular examinee population. The conditions under which the coefficient is estimated may involve variation in test forms, measurement occasions, raters, or scorers, and may entail multiple examinee products or performances. These and other variations in conditions give rise to qualifying adjectives, such as alternate-forms reliability, internal-consistency reliability, test-retest reliability, etc.

Scale score

A score on a test that is expressed on some defined scale of measurement.

Score

Any specific number resulting from the assessment of an individual; a generic term applied for convenience to such diverse measures as test scores, production counts, absence records, course grades, ratings, and so forth.

Selected-response item

Test item for which test taker selects response from provided choices; also known as a multiple-choice item. See constructed-response item.

Page 78 Cite

Suggested Citation:"Glossary." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

Standard Deviation

An index of the degree to which a set of data values is concentrated about its mean. Sometimes referred to as "spread." The standard deviation measures the variability in a distribution of quantities. Distributions with relatively small standard deviations are relatively concentrated; larger standard deviations signify greater variability. In common distributions, like the mathematically defined "normal distribution," roughly 67 percent of the quantities are within (plus or minus) 1 standard deviation from the mean; about 95 percent are within (plus or minus) 2 standard deviations; nearly all are within (plus or minus) 3 standard deviations. See distribution, effect size, variance.

Standardization

In test administration, maintaining a constant testing environment and conducting the test according to detailed rules and specifications so that testing conditions are the same for all test takers. In statistical analysis, transforming a variable so that its standard deviation is 1.0 for some specified population or sample.

Systematic error

A score component (often observed indirectly), not related to the characteristic being measured, that appears to be related to some salient variable or subgrouping of cases in an analysis. See bias.

Test

A set of items given under prescribed and standardized conditions for the purpose of measuring the knowledge, skill, or ability of a person. The person's responses to the items yield a score, which is a numerical evaluation of the person's performance on the test.

Test development

The process through which a test is planned, constructed, evaluated, and modified, including consideration of the content, format, administration, scoring, item properties, scaling, and technical quality for its intended purpose.

Test specifications

A framework that specifies the proportion of items that assess each content and process or skill area; the format of items, responses, and scoring protocols and procedures; and the desired psychometric properties of the items and test, such as the distribution of item difficulty and discrimination indices.

Validity

An overall evaluation of the degree to which accumulated evidence and theory support specific interpretations of test scores.

Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests (1999)

Chapter: Glossary

Glossary

Welcome to OpenBook!

Get Email Updates