This glossary provides definitions of technical terms and concepts used in this report. Note that technical usage may differ from common usage. For many of the terms, multiple definitions can be found in the literature. Words set in italics are defined elsewhere in the glossary.
A change in the standard procedure for administering a test or in the mode of response required of examinees, used to lessen bias in the scores of individuals with a special need or disability. Examples of accommodations include allotting extra time and providing the test in large type.
Descriptions of student or adult competency in a particular subject area, usually defined as ordered categories on a continuum, often labeled from ''basic" to "advanced," that constitute broad ranges for classifying performance. The National Assessment of Educational Progress (NAEP) defines three achievement levels for each subject and grade being assessed: basic, proficient, and advanced. The National Assessment Governing Board (NAGB), the governing body for NAEP, describes the knowledge and skills demonstrated by students at or above each of these three levels of achievement, and provides exemplars of performance for each. NAGB also reports the percentage of students who are in the four categories of achievement defined by the three levels,
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 73
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Glossary This glossary provides definitions of technical terms and concepts used in this report. Note that technical usage may differ from common usage. For many of the terms, multiple definitions can be found in the literature. Words set in italics are defined elsewhere in the glossary. Accommodation A change in the standard procedure for administering a test or in the mode of response required of examinees, used to lessen bias in the scores of individuals with a special need or disability. Examples of accommodations include allotting extra time and providing the test in large type. Achievement levels Descriptions of student or adult competency in a particular subject area, usually defined as ordered categories on a continuum, often labeled from ''basic" to "advanced," that constitute broad ranges for classifying performance. The National Assessment of Educational Progress (NAEP) defines three achievement levels for each subject and grade being assessed: basic, proficient, and advanced. The National Assessment Governing Board (NAGB), the governing body for NAEP, describes the knowledge and skills demonstrated by students at or above each of these three levels of achievement, and provides exemplars of performance for each. NAGB also reports the percentage of students who are in the four categories of achievement defined by the three levels,
OCR for page 73
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests basic, proficient, or advanced. NAGB does not provide a description for the below basic category. Assessment Any systematic method of obtaining evidence from tests and collateral sources used to draw inferences about characteristics of people, objects, or programs for a specific purpose; often used interchangeably with test. Bias In a test, a systematic error in a test score. Bias usually favors one group of test takers over another. Calibration (1) With respect to scales, the process of setting a test score scale, including the mean, standard deviation, and possibly the shape of the score distribution, so that scores on the scale have the same relative meaning as scores on a related score scale. (2) With respect to items, the process of determining the relation of item responses to the underlying scale that the item is measuring, including indications of an item's difficulty, correlation to the scale, and susceptibility to guessing. Common measure A scale of measurement that has a single meaning. Scores from tests that are calibrated (see calibration) to this scale support the same inferences about student performance from one locality to another and from one year to the next. Constructed-response item An exercise for which examinees must create their own responses or products rather than choose a response from an enumerated set. See selected-response item. Content domain The set of behaviors, knowledge, skills, abilities, attitudes, or other characteristics measured by a test, represented in a detailed specification, and often organized into categories by which items are classified. Distribution The number, or the percentage, of cases having each possible data value on a scale of data values. Distributions are often reported in terms of grouped ranges of data values. In testing, data values are usually test scores. A distribution can be characterized by its mean and standard deviation.
OCR for page 73
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Domain The full array of a particular subject matter being addressed by an assessment Effect size A measure of the practical effect of a statistical difference, usually a difference of the means of two distributions. The mean difference between two distributions, or an equivalent difference, is expressed in units of the standard deviation of the dominant distribution or of some average of the two standard deviations. For example, if two distributions had means of 50 and 54, and both had standard deviations of 10, the effect size of their mean difference would be 4/10, or 0.4. The effect size is sometimes called the standardized mean difference. In other contexts, other ways are sometimes used to express the practical size of an observed statistical difference. Embedding In testing, including all or part of one test in another. The embedded part may be kept together as a unit or interspersed throughout the test. Equating The process of statistical adjustments by which the scores on two or more alternate forms are placed on a common scale. The process assumes that the test forms have been constructed to the same explicit content and statistical specifications and administered under identical procedures. Field test A test administration used to check the adequacy of testing procedures, generally including test administration, test responding, test scoring, and test reporting. Form In testing, a particular test in a set of tests, all of which have the same test specifications and are mutually equated. Framework The detailed description of the test domain in the way that it will be represented by a test. High stakes test A test whose results have important, direct consequences for examinees, programs, or institutions tested.
OCR for page 73
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Item A generic term used to refer to a question or an exercise on a test or assessment. The test taker must respond to the item in some way. Since many test questions have the grammatical form of a statement, the neutral term item is preferred. Item format The form in which a question is posed on a test and the form in which the response is to be made. The formats include, among others, selected-response items (multiple-choice) and constructed-response items, which may be either short-answer or extended-response items . Item pool The aggregate of items from which a test's items are selected during test development or the total set of items from which a particular test is selected for a test taker during adaptive testing. Limited English proficiency (LEP)A term used to identify students whose performance on tests of achievement may be inappropriately low because of their poor proficiency in English. Linking Placing two or more tests on the same scale so that scores can be used interchangeably. Matrix sampling A measurement format in which a large set of test items is organized into a number of relatively short item sets, each of which is randomly assigned to a subsample of test takers, thereby avoiding the need to administer all items to all examinees. Measurement error The amount of variation in a measured value, such as a score, due to unknown, random factors. In testing, measurement error is viewed as the difference between an observed score and a corresponding theoretical true score or proficiency. Metric The units in which the values on a scale are expressed. Norm-referenced Interpreted by comparison with the performance of those in a specified population. A norm-referenced test score is interpreted on the basis of a comparison of a test taker's performance to the performance of other people in a specified reference population, or by a comparison of a group to other groups. See criterion-referenced .
OCR for page 73
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Norms Statistics or tabular data that summarize the distribution of test performance for one or more specified groups, such as test takers of various ages or grades. Norms are usually designed to represent some larger population, such as all test takers in the country. Performance standard An objective definition of a certain level of performance in some domain in terms of a cut-score or a range of scores on the score scale of a test measuring proficiency in that domain. Also, sometimes, a statement or description of a set of operational tasks exemplifying a level of performance associated with a more general content standard; the statement may be used to guide judgments about the location of a cut-score on a score scale. Proficiency levels See achievement levels. Reliability The degree to which the scores are consistent over repeated applications of a measurement procedure and hence are dependable, and repeatable; the degree to which scores are free of errors of measurement. Reliability is usually expressed by a unit-free index that either is, or resembles, a product-moment correlation. In classical test theory, the term represents the ratio of true score variance to observed score variance for a particular examinee population. The conditions under which the coefficient is estimated may involve variation in test forms, measurement occasions, raters, or scorers, and may entail multiple examinee products or performances. These and other variations in conditions give rise to qualifying adjectives, such as alternate-forms reliability, internal-consistency reliability, test-retest reliability, etc. Scale score A score on a test that is expressed on some defined scale of measurement. Score Any specific number resulting from the assessment of an individual; a generic term applied for convenience to such diverse measures as test scores, production counts, absence records, course grades, ratings, and so forth. Selected-response item Test item for which test taker selects response from provided choices; also known as a multiple-choice item. See constructed-response item.
OCR for page 73
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Standard Deviation An index of the degree to which a set of data values is concentrated about its mean. Sometimes referred to as "spread." The standard deviation measures the variability in a distribution of quantities. Distributions with relatively small standard deviations are relatively concentrated; larger standard deviations signify greater variability. In common distributions, like the mathematically defined "normal distribution," roughly 67 percent of the quantities are within (plus or minus) 1 standard deviation from the mean; about 95 percent are within (plus or minus) 2 standard deviations; nearly all are within (plus or minus) 3 standard deviations. See distribution, effect size, variance. Standardization In test administration, maintaining a constant testing environment and conducting the test according to detailed rules and specifications so that testing conditions are the same for all test takers. In statistical analysis, transforming a variable so that its standard deviation is 1.0 for some specified population or sample. Systematic error A score component (often observed indirectly), not related to the characteristic being measured, that appears to be related to some salient variable or subgrouping of cases in an analysis. See bias. Test A set of items given under prescribed and standardized conditions for the purpose of measuring the knowledge, skill, or ability of a person. The person's responses to the items yield a score, which is a numerical evaluation of the person's performance on the test. Test development The process through which a test is planned, constructed, evaluated, and modified, including consideration of the content, format, administration, scoring, item properties, scaling, and technical quality for its intended purpose. Test specifications A framework that specifies the proportion of items that assess each content and process or skill area; the format of items, responses, and scoring protocols and procedures; and the desired psychometric properties of the items and test, such as the distribution of item difficulty and discrimination indices. Validity An overall evaluation of the degree to which accumulated evidence and theory support specific interpretations of test scores.