Glossary
This glossary provides definitions of terms as used in this report. Note that technical usage may differ from common usage. For many of the terms, multiple definitions can be found in the literature. Words set in italics are defined elsewhere in the Glossary.
Achievement levels/proficiency levels
Descriptions of student or adult competency in a particular subject area, usually defined as ordered categories on a continuum, often labeled from "basic" to "advanced," that constitute broad ranges for classifying performance. NAEP defines three achievement levels for each subject and grade being assessed: basic, proficient, and advanced. NAGB describes the knowledge and skills demonstrated by students at or above each of these three levels of achievement, and provides exemplars of performance for each. In addition, NAGB also reports the percentage of students who are in four categories ranges of achievement as defined by the three levels. These achievement categories are generally labeled below basic, basic, proficient, or advanced. NAGB does not provide a description for the below basic category.
ACT
American College Testing Assessment. A set of tests designed to predict college performance from current achievement, used in college admissions produced by the American College Testing Program.
Alternate forms
Two or more versions of a test that are considered interchangeable, in that they measure the same constructs, are intended for the same purposes, and are administered using the same directions. Alternate forms is a generic term used to refer to any of three categories. Parallel forms have equal raw score means, equal standard deviations, and equal correlations with other measures for any given population. Equivalent forms do not have the statistical similarity of parallel forms, but the dissimilarities in raw score statistics are compensated for in the conversions to derived scores or in form-specific norm tables. Comparable forms are highly similar in content, but the degree of statistical similarity has not been demonstrated; also called equivalent forms.
Anchor test
A common set of items administered with each of two or more different tests for the purpose of equating the scores of these tests.
Assessment
Any systematic method of obtaining evidence from tests and collateral sources used to draw inferences about characteristics of people, objects, or programs for a specific purpose; often used interchangeably with test.
ASVAB
Armed Services Vocational Aptitude Battery. A set of 10 tests used for entrance into U.S. military service.
Bias
In a test, a systematic error in a test score. In a linkage, a systematic difference in linked values for different subgroups of test takers. Bias usually favors one group of test takers over another.
Calibration
The process of setting a test score scale, including the mean, standard deviation, and possibly the shape of the score distribution, so that scores on the scale have the same relative meaning as scores on a related score scale.
CCSSO
Council of Chief State School Officers. A nationwide, non-profit organization of public officials who head departments of elementary and secondary education. Through standing and special committees, CCSSO responds to a broad range of education concerns.
Classical test theory
The view that an individual's observed score on a test is the sum of a true score component for the test taker, plus an
independent measurement error component. A few simple premises about these components lead to important relationships among validity, reliability, and other test score statistics.
Comparable forms See alternate forms.
Composite score
A score that combines several scores by a specified formula.
Confidence interval
An interval between two values on a score scale within which, with specified probability, a score or parameter of interest lies.
Content congruence
The extent of similarity of content in two or more tests.
Content domain
The set of behaviors, knowledge, skills, abilities, attitudes or other characteristics measured by a test, represented in a detailed specification, and often organized into categories by which items are classified.
Content standard
A statement of a broad goal describing expectations for students in a subject matter at a particular grade range or at the completion of a level of schooling.
Constructed-response item
An exercise for which examinees must create their own responses or products rather than choose a response from an enumerated set.
Correlation
A measure of the degree of relationship between two paired sets of values on two variables. In this report, it usually refers to the relationship of scores on two tests, taken by a set of students. The index ranges from 1.0, signifying perfect agreement, through 0.0, representing no agreement at all, to -1.0, representing perfect negative agreement, with high scores on one variable associated with low scores on the other.
Criterion-referenced test
A test that allows users to estimate the amount of a specified content domain that an individual has learned. Domains
may be based on sets of instructional objectives, for example. Also called domain-referenced tests.
Cutscore
A specified point on a score scale, such that scores at or above that point are interpreted differently from scores below that point. Sometimes there is only one cut score, dividing the range of possible scores into "passing" and "failing" or "mastery" and “nonmastery" regions. Sometimes two or more cut-scores may be used to define three or more score categories, as in establishing performance standards. See performance standard.
Distribution
The number, or the percentage, of cases having each possible data value on a scale of data values. (In testing, data values are usually test scores.) Distributions are often reported in terms of grouped ranges of data values. A distribution can be characterized by its mean and standard deviation.
Distribution matching
Equipercentile equating, but with possibly different populations.
Domain-referenced test See criterion-referenced test.
Domain
The full array of a particular subject matter being addressed by an assessment.
Domain sampling
The process of selecting test items to represent a specified universe of performance.
Effect size
A measure of the practical effect of a statistical difference, usually a difference of the means of two distributions. The mean difference between two distributions, or an equivalent difference, is expressed in units of the standard deviation of the dominant distribution or of some average of the two standard deviations. For example, if two distributions had means of 50 and 54, and both had standard deviations of 10, the effect size of their mean difference would be 4/10, or 0.4. The effect size is sometimes called the standardized mean difference. In other contexts, other ways are sometimes used to express the practical size of an observed statistical difference.
Equating
The process of statistical adjustments by which the scores on two or more alternate forms are placed on a common scale. The process assumes that the test forms have been constructed to the same explicit content and statistical specifications and administered under identical procedures.
Equipercentile
A type of nonlinear equating in which the entire score distribution of one test is adjusted to match the entire score distribution of the other for a given population. See distribution matching. Scores at the same percentile on two different test forms are made equivalent.
Equivalency scale
A term used to refer to a score scale that has been linked to the scale of another measure.
Equivalent forms See alternate forms.
Error of measurement
The amount of variation in a measured value, such as a score, due to unknown, random factors. In testing, measurement error is viewed as the difference between an observed score and a corresponding theoretical true score or proficiency. See standard error of measurement.
ETS
Educational Testing Service. A not-for-profit organization that produces tests for many testing programs, including the College Entrance Examination Board's Scholastic Assessment Test (SAT).
Form
In testing, a particular test in a set of tests, all of which have the same test specifications, and are mutually equated.
Framework
The detailed description of the test domain in the way that it will be represented by a test.
High-stakes test
A test whose results has important, direct consequences for examinees, programs, or institutions tested.
ITBS
Iowa Tests of Basic Skills. A series of commercial achievement tests in various school subjects, authored at the University of Iowa and published by Riverside Publishing Company, Inc.
Item
A generic term used to refer to a question or an exercise on a test or assessment. The test taker must respond to the item in some way. Since many test questions have the grammatical form of a statement, the neutral term item is preferred.
Item format
The form in which a question is posed on a test and the form in which the response is to be made. They include, among others, selected-response (multiple-choice), and constructed-response formats, which may be either short-answer, or extended-response items.
Item pool
The aggregate of items from which a test's items are selected during test development or the total set of items from which a particular test is selected for test taker during adaptive testing.
Item response theory (IRT)
A theory of test performance that emphasizes the relationship between mean item score (P) and level (4) of the ability or trait measured by the item. In the case of an item scored 0 (incorrect response) or 1 (correct response), the mean item score equals the proportion of correct responses. In most applications, the mathematical function relating P to 4 is assumed to be a logistic function that closely resembles the cumulative normal distribution.
Linkage
The result of placing two or more tests on the same scale so that scores can be used interchangeably. Linking methods include equating, calibration, statistical moderation, and social moderation.
KIRIS
Kentucky Instructional Results Information System An assessment developed by the Kentucky Department of Education, which primarily uses performance tasks.
Linear equating
A form of equating in which the scores on one test are transformed linearly to be equal to the mean and standard deviation of another test. Sometimes both sets of test scores are transformed so that each has a common mean and standard deviation.
Low-stakes test
A test whose results has only minor or indirect consequences for the examinees, programs, or institutions tested.
LSAT
Law School Admissions Test. A large-scale test administered to applicants for admission to law schools.
Matrix sampling
A measurement format in which a large set of test items is organized into a number of relatively short item sets, each of which is randomly assigned to a subsample of test takers, thereby avoiding the need to administer all items to all examinees.
Mean
The numerical average of a set of data values, such as test scores.
Measurement error variance
That portion of the observed score variance attributable to one or more sources of measurement error; the square of the standard error of measurement.
Metric
The units in which the values on a scale are expressed.
Moderation
See statistical moderation, social moderation. Used without a modifier, the term usually means statistical moderation.
MSPAP
Maryland State Performance Assessment Project. A state-produced assessment in several school subjects, containing only extended performance tasks. Some matrix sampling is used in its administration.
NAEP
National Assessment of Educational Progress. An assessment given periodically to a representative sample of U.S. students in 4th, 8th, and 12th grades in reading, mathematics, social studies, and science, and in other subjects on an occasional basis. Since 1990, a separate state-by-state assessment has also been conducted.
NAGB
National Assessment Governing Board, responsible for policy governing the NAEP.
Normal distribution
A particular form of data distribution, with a definite mathematical form. A normal distribution is symmetric in shape, with relatively many values concentrated near the mean, and relatively few that depart greatly from the mean. A normal distribution is specified by its mean and standard deviation. About 68 percent of the values are within 1 standard deviation of the mean, about 96 percent are within 2 standard deviations of the mean, and nearly all values are within 3 standard
deviations of the mean. Many distributions of test scores are approximately normal in shape. The term "normal" is used to connote customary, or related to the norm, not ideal.
Normalized standard score
A derived test score in which a numerical transformation has been chosen so that the score distribution closely approximates a normal distribution for some specific population.
Norm-referenced test
A test on which scores are interpreted on the basis of a comparison of a test taker's performance to the performance of other people in a specified reference population.
Norms
Statistics or tabular data that summarize the distribution of test performance for one or more specified groups, such as test takers of various ages or grades. Norms are usually designed to represent some larger population, such as all test takers in the country. The group of examinees represented by the norms is referred to as the reference population.
Parallel forms See alternate forms.
Percentile
The score on a test below which a given percentage of test takers' scores fall.
Percentile rank
The percentage of scores in a specified distribution that fall below the point at which a given score lies.
Performance assessments
Product- and behavior-based measurements based on settings designed to emulate real-life contexts or conditions in which specific knowledge or skills are actually applied.
Performance standard
An objective definition of a certain level of performance in some domain in terms of a cutscore or a range of scores on the score scale of a test measuring proficiency in that domain. Also, sometimes, a statement or description of a set of operational tasks exemplifying a level of performance associated with a more general content standard; the statement may be used to guide judgments about the location of a cutscore on a score scale.
Pilot test
A test administered to a representative sample of test takers solely for the purpose of determining the properties of the test.
Precision of measurement
A general term that refers to the reliability of a measure, or its sensitivity to error of measurement.
Projection
A method of linking based on the regression of scores from one test (test B) onto the scores of another test (test A). The projected score is the average B score for all persons with a given A score. See regression. The projection of test B to test A is different from the projection of test A to test B.
Random error
An unsystematic error; a quantity (often assessed indirectly) that appears to have no relationship to any other variable.
Raw score
The unadjusted score on a test, often determined by counting the number of correct answers, but more generally a sum or other combination of item scores.
Reference population
The population of test takers represented by test norms. The sample on which the test norms are based is intended to permit accurate estimation of the test score distribution for the reference population. The reference population may be defined in terms of the test taker's age, grade, clinical status at time of testing, or other characteristics.
Regression
A statistical procedure for estimating the value associated with an entity on one variable, called the dependent variable, from the values of that entity on one or more other variables, called independent variables. The term without modification usually refers to linear least-squares regression, in which the values for an entity on the independent variables are combined linearly to form an estimate of the dependent variable. The linear combination is developed using values for a sample of entities on all the variables and finding the linear combination that minimizes the average squared discrepancy between the estimated value and the actual value for the sample.
Regression coefficient
A multiplier of an independent variable in a linear equation that relates a dependent variable to a set of independent
variables. Can also be understood as the marginal effect of a change in an independent variable or the value of the dependent variable. The coefficient is said to be standardized or unstandardized as the variable it multiplies has been scaled to a standard deviation of 1.0 or has some other standard deviation, respectively.
Relative score interpretations
The meaning of a score for an individual, or the average score for a definable group, derived from the rank of the score or average within one or more reference distributions of scores.
Reliability
The degree to which the scores are consistent over repeated applications of a measurement procedure and hence are dependable, and repeatable; the degree to which scores are free of errors of measurement. Reliability is usually expressed by a unit-free index that either is, or resembles, a product-moment correlation. In classical test theory, the term represents the ratio of true score variance to observed score variance for a particular examinee population. The conditions under which the coefficient is estimated may involve variation in test forms, measurement occasions, raters, or scorers, and may entail multiple examinee products or performances. These and other variations in conditions give rise to qualifying adjectives, such as alternate-forms reliability, internal-consistency reliability, test-retest reliability, etc.
SAT
(1) Scholastic Assessment Test, the College Entrance Examination Board's test designed to predict college performance. The test battery contains a verbal section, and a mathematics section, as well as specialized subject tests. It is produced by ETS. (2) Stanford Achievement Test, a set of achievement tests used for student assessment in some states, produced by Harcourt Brace Educational Measurement.
Scale score
A score on a test that is expressed on some defined scale of measurement. See scaling.
SCASS
A project of the Council of Chief State School Officers, the State Collaborative on Assessment and Student Standards is designed to help states develop student standards and assessments working together with other states with similar needs.
Scaling
The process of creating a scale score. Scaling may enhance test
score interpretation by placing scores from different tests or test forms onto a common scale or by producing scale scores designed to support criterion-referenced or norm-referenced score interpretations.
Score
Any specific number resulting from the assessment of an individual; a generic term applied for convenience to such diverse measures as test scores, production counts, absence records, course grades, ratings, and so forth.
Scoring rubric
The principles, rules, and standards used in scoring an examinee performance, product, or constructed-response to a test item. Scoring rubrics vary in the degree of judgment entailed, in the number of distinct score levels defined, in the latitude given scorers for assigning intermediate or fractional score values, and in other ways.
Selected-response item
Test item for which test taker selects response from provided choices; also known as multiple-choice item.
Social moderation
An adjustment in the values of test scores to adjust for known social factors affecting test scores for a group of test takers.
Standard error of measurement
The standard deviation of the distribution of errors of measurement that is associated with the test scores for a specified group of test takers.
Standard score
A type of derived score such that the distribution of these scores for a specified population has convenient, known values for the mean and standard deviation.
Standard deviation
An index of the degree to which a set of data values is concentrated about its mean. Sometimes referred to as "spread." The standard deviation measures the variability in a distribution of quantities. Distributions with relatively small standard deviations are relatively concentrated; larger standard deviations signify greater variability. In common distributions, like the mathematically defined "normal distribution," roughly 67 percent of the quantities are within (plus or minus) 1 standard deviation from the mean; about 95 percent are within (plus or minus) 2 standard deviations; nearly all are within (plus or minus) 3 standard deviations. See also distribution, effect size, normal distribution, variance.
Standardization
In test administration, maintaining a constant testing environment and conducting the test according to detailed rules and specifications so that testing conditions are the same for all test takers. In statistical analysis, transforming a variable so that its standard deviation is 1.0 for some specified population or sample.
Statistical moderation
An adjustment of the score scale of one test, usually by transforming the scores so that their mean and standard deviation are equal to the mean and standard deviation of another distribution of test scores. It is statistically equivalent to linear equating, the simplest form of linking. See also social moderation.
Systematic error
A score component (often observed indirectly), not related to the test performance, that appears to be related to some salient variable or subgrouping of cases in an analysis. See bias.
Test
A set of items given under prescribed and standardized conditions for the purpose of measuring the knowledge, skill, or ability of a person. The person's responses to the items yield a score, which is a numerical evaluation of the person's performance on the test.
Test development
The process through which a test is planned, constructed, evaluated and modified, including consideration of the content, format, administration, scoring, item properties, scaling, and technical quality for its intended purpose.
Test specifications
A framework that specifies the proportion of items that assess each content and process or skill area; the format of items, responses, and scoring protocols and procedures; and the desired psychometric properties of the items and test, such as the distribution of item difficulty and discrimination indices.
Test user
The person(s) or agency responsible for the choice and administration of a test, the interpretation of test scores produced in a given context, and any decisions or actions that are based, in part, on test scores.
TIMSS
Third International Mathematics and Science Study. An assessment given in 1995 to samples of students in a large number of countries.
Unbiased
The obverse of biased. See bias.
Validation
The process of investigation by which the validity of the proposed interpretation of test scores is evaluated.
Validity
When applied to a test, an overall evaluation of the degree to which accumulated evidence and theory support specific interpretations of test scores. When applied to a linkage of two or more tests, the extent to which the scores can from one test can be interpreted in the same way as the scores from others.
Variance
A measure of the spread of data values, such as test scores; the square of the standard deviation. The variance is the mean of the squared deviations of the data values from their mean.
VNT
Voluntary National Tests. Proposed by President Clinton in 1997, achievement tests that states could choose to give to assess performance of 4th-grade students in reading, and 8th-grade students in mathematics. Intended as a nationally sponsored test yielding individual student scores compared to national (and international) benchmarks.