Coming to Terms: Assumptions, Definitions, and Goals of Linkage

LINKING AND EQUATING

Consistent with the scientific literature on psychological and educational measurement, the committee interprets the legislation's reference to “equivalency scale” to mean the result of equating or linking the results of different tests and assessments (see, e.g., Holland and Rubin, 1982). Throughout this report we use the term “linkage” to mean various well-established statistical methods for connecting scores on different tests and assessments to each other and for reporting them on a common scale. The goal of these methods is to enable the performance of one student on one test to be compared with performances of other students on different tests (Mislevy, 1992; Linn, 1993). Box 1 describes these linkage methods.

Whatever the method used, there are two main technical concerns with linkages: accuracy and consistency. Accuracy is analogous to the “margin for error” in opinion polling and depends on the amount of data used in the calculations. The more test takers in a study, the higher the accuracy of the linkage. Consistency refers to the consistency of the linkage found across all of the relevant subpopulations of test takers, which depends (among other things) on the details of the tests themselves and on their relationship to the educational experiences of the test takers. In the dynamic situation of educational reform today, the relevant subpopulations may even include those students to be tested in the next few years for whom there are no data currently available in a linking study done today. The consequence of inaccuracy is a decrease in the reliability of the scores when they are placed on the scale of the test not taken by the test taker. The consequence of inconsistency is a degree of “bias,” which can create disadvantages for some students and advantages for others.

Because the technical aspects of testing are unfamiliar to many readers of this report, analogies with measuring temperature may be useful. For example, test results are often reported on “scales” that are arbitrary in the same way that for temperature on the Fahrenheit scale, 32 degrees is freezing and 212 degrees is boiling—that is the 200 to 800 scale of the Scholastic Achievement Test or the 0 to 500 scale of NAEP. The analogy of the link between different scales for measuring temperature, for example,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 6
EQUIVALENCY AND LINKAGE OF EDUCATIONAL TESTS: INTERIM REPORT Coming to Terms: Assumptions, Definitions, and Goals of Linkage LINKING AND EQUATING Consistent with the scientific literature on psychological and educational measurement, the committee interprets the legislation's reference to “equivalency scale” to mean the result of equating or linking the results of different tests and assessments (see, e.g., Holland and Rubin, 1982). Throughout this report we use the term “linkage” to mean various well-established statistical methods for connecting scores on different tests and assessments to each other and for reporting them on a common scale. The goal of these methods is to enable the performance of one student on one test to be compared with performances of other students on different tests (Mislevy, 1992; Linn, 1993). Box 1 describes these linkage methods. Whatever the method used, there are two main technical concerns with linkages: accuracy and consistency. Accuracy is analogous to the “margin for error” in opinion polling and depends on the amount of data used in the calculations. The more test takers in a study, the higher the accuracy of the linkage. Consistency refers to the consistency of the linkage found across all of the relevant subpopulations of test takers, which depends (among other things) on the details of the tests themselves and on their relationship to the educational experiences of the test takers. In the dynamic situation of educational reform today, the relevant subpopulations may even include those students to be tested in the next few years for whom there are no data currently available in a linking study done today. The consequence of inaccuracy is a decrease in the reliability of the scores when they are placed on the scale of the test not taken by the test taker. The consequence of inconsistency is a degree of “bias,” which can create disadvantages for some students and advantages for others. Because the technical aspects of testing are unfamiliar to many readers of this report, analogies with measuring temperature may be useful. For example, test results are often reported on “scales” that are arbitrary in the same way that for temperature on the Fahrenheit scale, 32 degrees is freezing and 212 degrees is boiling—that is the 200 to 800 scale of the Scholastic Achievement Test or the 0 to 500 scale of NAEP. The analogy of the link between different scales for measuring temperature, for example,

OCR for page 6
EQUIVALENCY AND LINKAGE OF EDUCATIONAL TESTS: INTERIM REPORT Fahrenheit and Celsius, is also a useful starting point for understanding what it means for test scores or scales to be linked; see Box 2. FEASIBILITY The committee interprets feasibility to encompass both practicality and validity. By practicality we mean not only whether a linkage can be created, in the mathematical or statistical sense, but also whether the costs of doing so—the financial cost and logistical burden of collecting and analyzing the data necessary for the linkages are reasonable and manageable. Due to the short time frame of this study, a comprehensive cost analysis could not be conducted. However, the committee has reviewed the Anchor Test Study (Loret et al., 1972), which developed an equivalency scale for eight reading subtests, a number representing almost 90 percent of reading tests used at that time, at the cost of more than $1 million. There is greater diversity and less stability among testing programs today, leading us to assume that the costs of linkage would be substantially higher. The committee's final report will address this issue in more detail; the issue will need more concentrated attention if policy makers decide to proceed with a research, development, and implementation program to link existing tests. Statisticians and other measurement experts link results of different tests through various methods, which include evaluating the content and character of tests being linked, collecting large amounts of data on student performance on each test, and carrying out statistical computations to identify accurate and consistent relationships between these test results (see, e.g., Linn, 1993; Mislevy, 1992). These methods will add to the expected cost of linking. By validity we mean whether any linkage that is created can support the meaning and interpretation—the inferences—that users are likely to draw (see, e.g., Messick, 1989). The validity of an inference based on a linkage hinges on several considerations, including the similarity of content of the linked scales. If two tests both measure the degree to which a student has mastered a particular subject (domain of content), then the link permits comparisons of the results from both tests in similar terms. The validity of a link is supported by a strong statistical correlation of the scores from the two tests and by the consistency of linkage across population groups, such as boys and girls, African Americans and whites, or residents of New York and California. There are many possible purposes for establishing linkages among tests. The committee recognizes that the validity of a linkage varies depending on its purpose and use and on whether it is being used as intended. Linkages that provide valid and useful information for some purposes may nonetheless be inadequate, and so invalid, for others. For example, a link that is sufficiently accurate to categorize the educational quality of schools or school districts may not be sufficiently accurate and stable to classify the proficiency of individual students (Williams et al., 1995). USES OF LINKAGES The committee realizes that with careful planning it is possible to establish valid links between tests that meet certain conditions when such links will be used for well-defined purposes. These linkages frequently involve tests that are intended to be equated and are therefore created to identical specifications; are highly similar in content emphasis, difficulty, and format; are equally reliable; and are expected to be administered under the same conditions. Equating different forms of college admissions tests by the College Board or the American College Testing Program are examples of this type of linkage. The committee is reviewing studies of this type of linkage because they may shed light on our

OCR for page 6
EQUIVALENCY AND LINKAGE OF EDUCATIONAL TESTS: INTERIM REPORT Box 1 Linking Methods Equating. The strongest kind of linking, and the one with the most technical support, is equating. Equating is most frequently used when comparing the results of different forms of a single test that have been designed to be parallel. The College Board equates different forms of the Scholastic Assessment Test (SAT) and treats the results as interchangeable. Equating is possible if test content, format, purpose, administration, item difficulty, and populations are equivalent. In linear equating, the mean and standard deviation of one test is adjusted so that it is the same as the mean and standard deviation of another. Equipercentile equating adjusts the entire score distribution of one test to the entire score distribution of the other. In this case, scores at the same percentile on two different test forms are equivalent. Thus, if a score of 122 on one test, X, is at the 75th percentile and a score of 257 on another test, Y, is also at the 75th percentile for the same population of test takers, then 122 and 257 are linked by the equipercentile method. This means that 75 percent of the test takers in this population would score 122 or less on test X or would score 257 or less on test Y. The linked scores, 122 and 257, have the same meaning in this very specific and widely used sense, and we would place the X score of 122 onto the scale of test Y by using the value of 257 for it. By following this procedure for each percentile value from 1 to 99, tests X and Y are linked. Equipercentile equating is not the only method used to link tests, but it is a basic one and is closely connected to the other methods (Holland and Rubin, 1982; Peterson et al., 1989; Kolen and Brennan, 1995). Two tests can also be equated using a third test as an anchor. This “anchor test” should have similar content to the original tests, although it is typically shorter than the two original tests. Often the anchor test is a separately timed section of the original tests. Sometimes, however, the items on the anchor test are interspersed with the items on the main tests. A separate score is computed for the responses to those items as if they were a separate test. An assumption of the equipercentile equating methodology is that the linking function found in this manner is consistent across the various populations that could be chosen for the equating. For example, the same linking function should be obtained if the population is restricted only to boys or only to girls. However, the research literature shows that this consistency is to be expected only when the tests being linked are very similar in a variety of ways that are discussed in the body of this report. Calibration Tests or assessments that are constructed for different purpose, using different content frameworks or test specifications, will almost always violate the conditions required for equating. When scores from two different tests are put on the same scale, the results are said to be comparable, or calibrated. Most of the statistical methods used in equating can be used in calibration, but it is not expected that the results will be consistent across different populations. Two types of empirical data support equating and calibration of scores between two tests. In one type, the two tests are given to a single group of test takers. When the same group takes both tests, the intercorrelation of the tests provides some empirical evidence of equivalent content. In a second design, two tests are given to equivalent groups of test takers. Equivalent groups are often formed by giving both tests at the same time to a large group, with some of the examinees taking one test and some the other. When the tests are given at different times to different groups of test takers, equivalence is harder to assert. Two tests can be equated or calibrated using a third test as an anchor. This method requires that one group of students takes tests A and C, while another group takes tests B and C. Tests A and B are then calibrated through the anchor test, C. For this method to be valid, the anchor test should have the same content as the original tests, although it is typically shorter than the other tests. One relatively new equating procedure, used extensively in NAEP and many other large testing programs, depends on being able to calibrate the individual items that make up a test, rather than the test itself. Each of a large number of items about a given subject is related or calibrated to a scale measuring that subject, using a statistical theory called item response theory (IRT). The method works only when the items are all assessing the same material, and requires that a large number of items be administered to a large representative set of test takers. Once all items are calibrated, a test can be formed from a subset of the items, and be assured of being equated automatically to another test formed from a selection of different items. Projection A special unidirectional form of linking can be used to predict or “project” scores on one test from scores on another test without any expectation that exactly the same things are being measured. Usually, both tests are given to a sample of students and then statistical regression methods are applied. It is important to note that projecting Test A onto Test B gives different results from projecting Test B onto Test A. Moderation Moderation is the weakest form of linking. It is used when the tests have different blueprints and are given to different, nonequivalent groups of examinees. Procedures that match distributions using scores are called statistical moderation links, while others that match distributions using subjective judgments are referred to as social moderation links. In either case, the resulting links are only valid for making some very general comparisons (Mislevy, 1992; Linn, 1993). charge, but we underscore that our conclusions do not apply to the linking of different forms of the same test. There are other situations in which it is fairly routine for two tests to be linked and the results of the linkage to be used for well-defined purposes. For example, when a new test is introduced into a product line, a test publisher will establish links between the new product and the old one so that results obtained from the two tests can be compared. For example, CTB/McGraw Hill has linked the California Test of Basic Skills with its newer Terra Nova test; Harcourt Brace Educational Measurement has linked the Stanford Achievement Test 8 with the Stanford Achievement Test 9; Riverside Publishing has linked the Iowa Tests of Basic Skills M with earlier versions of the test. Sometimes the test specifications may have changed in response to shifts in educational emphases, and the old and new versions will not be as similar as two different versions of a test made to the same specifications; however, old and new versions can generally be successfully calibrated and put on the same scale. Another routine use of linking occurs when states or schools change from one testing program to another. In these cases it is not uncommon for a test publisher to conduct a study to link the two testing programs, even when the instruments were created by different publishers (Wendy Yen, per-

OCR for page 6
EQUIVALENCY AND LINKAGE OF EDUCATIONAL TESTS: INTERIM REPORT methods used in equating can be used in calibration, but it is not expected that the results will be consistent across different populations. Two types of empirical data support equating and calibration of scores between two tests. In one type, the two tests are given to a single group of test takers. When the same group takes both tests, the intercorrelation of the tests provides some empirical evidence of equivalent content. In a second design, two tests are given to equivalent groups of test takers. Equivalent groups are often formed by giving both tests at the same time to a large group, with some of the examinees taking one test and some the other. When the tests are given at different times to different groups of test takers, equivalence is harder to assert. Two tests can be equated or calibrated using a third test as an anchor. This method requires that one group of students takes tests A and C, while another group takes tests B and C. Tests A and B are then calibrated through the anchor test, C. For this method to be valid, the anchor test should have the same content as the original tests, although it is typically shorter than the other tests. One relatively new equating procedure, used extensively in NAEP and many other large testing programs, depends on being able to calibrate the individual items that make up a test, rather than the test itself. Each of a large number of items about a given subject is related or calibrated to a scale measuring that subject, using a statistical theory called item response theory (IRT). The method works only when the items are all assessing the same material, and requires that a large number of items be administered to a large representative set of test takers. Once all items are calibrated, a test can be formed from a subset of the items, and be assured of being equated automatically to another test formed from a selection of different items. Projection A special unidirectional form of linking can be used to predict or “project” scores on one test from scores on another test without any expectation that exactly the same things are being measured. Usually, both tests are given to a sample of students and then statistical regression methods are applied. It is important to note that projecting Test A onto Test B gives different results from projecting Test B onto Test A. Moderation Moderation is the weakest form of linking. It is used when the tests have different blueprints and are given to different, nonequivalent groups of examinees. Procedures that match distributions using scores are called statistical moderation links, while others that match distributions using subjective judgments are referred to as social moderation links. In either case, the resulting links are only valid for making some very general comparisons (Mislevy, 1992; Linn, 1993). sonal communication). For example, when Virginia changed from the Iowa Tests of Basic Skills to the Stanford Achievement Test 9, the publisher of the Stanford 9, Harcourt Brace Educational Measurement, conducted a linkage study for Virginia that allowed trend lines for school and state data to be maintained. Such calibrations are not as robust as links of equivalent forms, but they suffice for comparing aggregate data. The committee is also reviewing studies that describe state efforts to link their assessment results to NAEP, to estimate how schools or districts (but not individuals) might have performed if their students had participated in NAEP. There are also studies that compare trends on NAEP with trends on state tests, in order to evaluate states ' progress against a national benchmark (Williams et al., 1995; Ercikan, 1997). Most recently, the National Center for Education Statistics of the Department of Education completed a study designed to link 4th and 8th grade mathematics and science results on NAEP and TIMSS; their aim was to estimate how groups of students who participated in the 1996 NAEP would have performed on the 1995 TIMSS (U.S. Department of Education, 1998). The committee is reviewing these studies (and others) in an effort to be comprehensive; we are aware, however, of fundamental differences between links across aggregates (states, districts) and links involving scores of individual

OCR for page 6
EQUIVALENCY AND LINKAGE OF EDUCATIONAL TESTS: INTERIM REPORT Box 2 Fahrenheit, Celsius, and Educational Tests There is a well-known formula for linking Fahrenheit and Celsius temperatures: Fº = (9/5)C + 32∞. Thus, if one reads that Paris is suffering from a 35-degree heat wave—which may not seem very hot—one needs to multiply 35 by 9 and divide that result by 5 to get 63 and then add 32 to get a very recognizably hot 95, in degrees Fahrenheit. This formula is an example of a linking function and is analogous to what is meant by linking two test score scales. Just as one placed the Celsius value of 35 on the Fahrenheit scale and got 95 (which may be more meaningful to some people), linking can allow one to place the scores from one test on the scale of another and interpret that score or to compare it to those of test takers who took the other test. Other uses of linking assessments are to estimate how schools or districts would have performed had their students taken an assessment, such as NAEP, that they did not take. Although the temperature measurement analogy is useful for understanding some aspects of tests, it is only a partial analogy because temperature measurement is very simple compared with the assessment of complex cognitive activities, such as reading or mathematics. students. For example, in linking to produce aggregate summary statistics for school districts or schools it is reasonable to incorporate important demographic information about the test takers into the linking function; such information would not be appropriate when reporting linked scores for individuals. DISTINCT CHARACTER OF NAEP NAEP is a periodically administered, federally sponsored survey of a nationally representative sample of American students that assesses student achievement in key subjects. It combines the data from all test takers and uses the resulting aggregate information to monitor and report on the academic performance of U.S. children as a group, as well as by specific subgroups of the student population. NAEP was not designed to provide achievement information about individual students. Rather, NAEP reports the aggregate, or collective, performance of students in two ways—scale scores and achievement levels: the scale score results provide information about the distribution of student achievement for groups and subgroups in terms of a continuous scale; achievement levels are used to categorize student achievement as basic, proficient, or advanced (U.S. Department of Education, 1997). NAEP makes use of a technique called matrix sampling, which enables it to achieve two goals. First, students are asked to answer a relatively small number of test questions so that the testing task given to students takes a relatively short time. Second, by asking different sets of questions of different students, the assessments cover a much larger array of questions than those given to anyone student. NAEP's statistical design makes it possible to estimate the distribution of student scores by pooling data across subjects (Mislevy et al., 1992; Beaton and Gonzalez, 1995). The price paid for this flexibility is the inability of these assessments to collect enough data from any single student to allow valid individual scores to be reported. NAEP's distinctive characteristics present special challenges of content comparability with other tests (e.g., Kenney and Silver, 1997). First, NAEP content is determined through a rigorous and lengthy consensus process that culminates in “frameworks” deemed relevant to NAEP's principal goal of monitoring aggregate student performance for the nation as a whole. NAEP content is not supposed to reflect particular state or local curricular goals, but rather a broad national consensus on what is or

OCR for page 6
EQUIVALENCY AND LINKAGE OF EDUCATIONAL TESTS: INTERIM REPORT should be taught; by design, its content is different from that of many state assessments (Campbell et al., 1994). Second, NAEP's structure is unique: each student in the NAEP national sample takes only one booklet that contains a few short blocks of NAEP items in a single subject area (generally, three 15-minute or two 25-minute blocks), and no student's test booklet is fully representative of the entire NAEP assessment in that subject area. The scores for the blocks a student takes are used to predict his or her performance on the entire assessment. Thus, the portion of NAEP any one student takes is unlikely to be comparable in content to the full knowledge domain covered by an individual test taker in a state or commercial test (see, e.g., U.S. Department of Education, 1997; National Research Council, 1996; Beaton and Gonzalez, 1995; U.S. Congress, 1992). These issues greatly increase the difficulty of establishing valid and reliable links between commercial or state tests and NAEP.