2
Technical Aspects of Links

Throughout this report we use the term "linkage" to mean various well-established statistical methods (see, e.g., Mislevy, 1992; Linn, 1993) for connecting scores on different tests and assessments with each other and for reporting them on a common scale. In this chapter we explain how linkage works.

Because the technical aspects of testing are unfamiliar to many readers of this report, analogies with measuring temperature may be useful. For one thing, like test results, temperatures are reported on scales that are somewhat arbitrary, such as the 32-212 degrees of the Fahrenheit scale and the 0-500 scale for the National Assessment of Educational Progress (NAEP). Points on each scale represent a quality of what is being measured: 95 degrees Fahrenheit is hot; a 350 NAEP score is high performance. Different temperature scales, say, Fahrenheit and Celsius, can be linked by using a simple formula. In that way, one would know that 30 degrees Celsius is hot, not 2 degrees colder than freezing.

Whether one is discussing links of temperature or scores from different tests, a linkage provides a method for adjusting the results from one instrument to be comparable with another instrument. In some cases, a student's score on one test can be adjusted and then substituted for a score from a test not taken by the student. In other cases, only aggregate or group-level results can be linked and compared; for example, the average



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 20
--> 2 Technical Aspects of Links Throughout this report we use the term "linkage" to mean various well-established statistical methods (see, e.g., Mislevy, 1992; Linn, 1993) for connecting scores on different tests and assessments with each other and for reporting them on a common scale. In this chapter we explain how linkage works. Because the technical aspects of testing are unfamiliar to many readers of this report, analogies with measuring temperature may be useful. For one thing, like test results, temperatures are reported on scales that are somewhat arbitrary, such as the 32-212 degrees of the Fahrenheit scale and the 0-500 scale for the National Assessment of Educational Progress (NAEP). Points on each scale represent a quality of what is being measured: 95 degrees Fahrenheit is hot; a 350 NAEP score is high performance. Different temperature scales, say, Fahrenheit and Celsius, can be linked by using a simple formula. In that way, one would know that 30 degrees Celsius is hot, not 2 degrees colder than freezing. Whether one is discussing links of temperature or scores from different tests, a linkage provides a method for adjusting the results from one instrument to be comparable with another instrument. In some cases, a student's score on one test can be adjusted and then substituted for a score from a test not taken by the student. In other cases, only aggregate or group-level results can be linked and compared; for example, the average

OCR for page 20
--> reading proficiency of 4th-grade students in Maine compared with those in Vermont, derived from different tests. In the case of the Fahrenheit and Celsius scales, the linkage or formula for converting a temperature value from one scale to the other is exact; for example, 35 degrees Celsius is always equivalent to 95 degrees Fahrenheit. In contrast, linking different educational assessments is not exact. As noted in Chapter 1, different testing instruments may purport to assess similar general domains but may place differing emphases on specific aspects of these domains. Even when the content and format of tests are perfectly aligned, any linkage between them can only be estimated and, therefore, will always contain at least some estimation error. The rest of this chapter discusses approaches that have been used to link educational assessments and the problems one might encounter, or the potential problems one should try to uncover, in such linking. Constructing Links Statistical Methods for Linking Most linking methods are based on statistical analyses of the score distributions on the tests or test forms being linked. Various study designs can be used to collect the data needed for linking. A common method is the single-group design, in which a single- group of people takes both tests or test forms. In this design, the intercorrelation of the tests provides some empirical evidence of the extent to which the two tests have equivalent content. However, the single-group design has the disadvantage that each student must take two tests: fatigue may affect the scores on the second test, some part of the first test may suggest how to answer an item on the second test, or there may be a large time lag between the two test administrations. A second method for collecting data for linking is an equivalent-group design. In the equivalent-group design, two tests are given to equivalent-groups of test takers. Equivalent-groups are often formed by giving both tests at the same time to a large group, with one randomly selected half of the examinees taking one test and the remaining half taking the other. When the tests are given at different times to different groups of test takers, the equivalence of the two groups is harder to guarantee. A third method for collecting data for linking involves an anchor test.

OCR for page 20
--> Two tests can be equated or calibrated using a third test as an anchor. This method requires that one group of students takes tests A and C, while another group takes tests B and C. Tests A and B can then be linked through various statistical computations involving the anchor test, C. For this method to be valid, the anchor test has to have the same content as the original tests, although it is typically shorter, and therefore less reliable than the other tests. Forms of Linking Despite efforts by Mislevy (1992) and Linn (1993) to bring coherence to the definitions of linking, the literature is not completely consistent in the use of the terminology. The term equating is often used generally; in this report we use the term linking as the general term. Many of the statistical methods are applicable to all forms of linking, but some are applicable only for some types of linking. This section defines and discusses equating, calibration, projection, and moderation as used in this report. Equating The term equating is reserved for situations in which two or more different forms of a single test have been constructed according to the same blueprint—that is, the forms adhere to the same test specifications, are of about the same difficulty and reliability, are given under the same standardized conditions, and are to be used for the same purposes (see e.g., Holland and Rubin, 1982; Kolen and Brennan, 1995). In linear equating, the scores on one test are adjusted so that the mean and standard deviation of the scores are the same as the mean and standard deviation of the other test. Equipercentile equating adjusts the entire score distribution of one test to the entire score distribution of the other, so that scores at the same percentile on two different test forms are equivalent. Calibration If two assessments have the same framework but have different test specifications (including differing lengths) and different statistical characteristics, then linking the scores for comparability is called calibration. Sometimes a short form of a test is used for screening purposes: its scores are calibrated with the scores from the long form. Sometimes tests designed for different grade levels are calibrated to a common scale; this process is also called vertical equating.

OCR for page 20
--> A common calibration approach is to apply item response theory (IRT) methods to obtain individual proficiency values for the common domain being measured. The IRT procedure, used extensively in NAEP and many other large testing programs, depends on the ability to calibrate the individual items that make up a test, rather than the test itself. Each of a large number of items in a given domain is related or calibrated to a scale measuring that subject, using IRT methods. This method is applicable only when the items are all assessing the same material, and it requires that a large number of items be administered to a large and representative group of test takers, generally using some variant of the anchor test data collection design. After all the items are calibrated, a test can be formed from a subset of the items, and it will then equate automatically to another test formed from a selection of different items. Projection A special unidirectional form of linking can be used to predict or project scores on one test from scores on another test, without any expectation that the same things are being measured. The single-group data collection design is required, and statistical regression methods are used. It is important to note that projecting test A onto test B gives different results from projecting test B onto test A. Also, the distribution of scores projected from test A onto test B will have a smaller standard deviation than the actual scores on test B. For these reasons, projection is not used in the strict equating of test forms. Moderation Moderation is the weakest form of linking, used for tests with different specifications that are given to different, nonequivalent groups of examinees. Procedures that match distributions using scores are called statistical moderation links; procedures that match distributions using judgments are called social moderation links. Social moderation generally relies on information external to the testing situation. In either case, the resulting links are only valid for making some very general comparisons (Mislevy, 1992; Linn, 1993). Examples Major tests, such as the Armed Services Vocational Aptitude Battery (ASVAB), the Scholastic Assessment Test (SAT), the American College Test (ACT), the Law School Admissions Test (LSAT), use the same blueprint for all forms of their tests. New forms are regularly equated with

OCR for page 20
--> past forms, so that the scores on any form mean the same as the scores on any other form in the series. Many statistical methods can be used for equating. Because of the routine nature of equating and its unambiguous meaning in the appropriate circumstances, this type of linking is not discussed further in this report. There are several situations in which it is fairly routine for two tests to be linked and the results of the linkage to be used for well-defined purposes. For example, when a new edition of a test is introduced into a product line, a test publisher will establish links between the new edition and the old one so that results obtained from the two tests can be compared. For example, CTB/McGraw Hill linked the California Test of Basic Skills with its newer TerraNova test; Harcourt Brace Educational Measurement linked the Stanford Achievement Test 8th Edition with the Stanford Achievement Test 9th Edition; Riverside Publishing linked the Iowa Tests of Basic Skills M with earlier editions of the test. Sometimes the test specifications may have changed in response to shifts in educational emphases, and the old and new editions will not be as similar as two different forms of a test made to the same specifications; however, old and new editions generally can be calibrated successfully and put on the same scale. Another routine use of linking occurs when states or schools change from one publisher's testing program to another. In these cases it is not uncommon for the publisher of the new test to conduct a study to link the two testing programs (Wendy Yen, personal communication). For example, in 1997 the state of Virginia switched from one commercial test, the ITBS, to another, the Stanford 9. To effect the switch smoothly, the scores on the two achievement tests were linked, using a same-group linking design. The tests in some subjects were judged to have such different content that a link would not be meaningful, but in mathematics, reading, and language, the content was sufficiently similar that a link would be useful. For example, a correlation of .81 was found between the two tests of 8th-grade mathematics for a representative sample of 596 students. This correlation is high, but it still indicates some substantive differences between the two tests. The results of the linking were checked by comparing the observed performance on the 1996-1997 administration of the Stanford 9 with the performance that was predicted on the basis of scores from the 1995-1996 administration of the ITBS (Virginia Department of Education, 1997). The correspondence at the state-level was high in spite of substantial variation at the district level.

OCR for page 20
--> In 1972 and 1973, at the request of Congress, the Anchor Test Study (Loret et al., 1972, 1973; Bianchini et al., 1974, 1975) was undertaken to link eight major commercial achievement tests. The purpose of the study was to measure student achievement gains regardless of the test that they took, in order to evaluate the impact of Title I of the Elementary and Secondary Education Act. Reading was chosen because of its centrality in Title I and because it was expected to permit linkages more easily than other subjects. The study provided nationally representative norms on these tests and also developed equivalence tables so that a student's standardized reading score on any one of the tests could be put on a single scale. Technically, all tests in the study were calibrated to the scale of the Metropolitan Achievement Test (MAT), using linear or equipercentile procedures. The results indicated that the tests could be linked sufficiently closely that they produced comparable scores on vocabulary and reading comprehension. Although the anchor test study was obsolete by the time it was released, primarily because of changes in the tests, it remains a model of linkage development. A proposal for a similar study was recently made for use in the assessment of California schools. The proposed study would have enabled a local school district to assess student achievement with any of a list of acceptable commercial tests; the tests would be linked to each other as were the tests in the Anchor Test Study. Results on the achievement tests, aggregated for the students in a school, would be used to rate the performance of each school. However, many experts said that satisfactory links could not be developed among the available commercial tests (Haertel, 1996; Yen, 1996), partly because today's achievement tests differ more in content and format than did the reading tests in the 1970s (see also Chapter 4). Many feared that because schools could earn financial rewards for high scores, districts could manipulate the system by selecting the test that most closely conformed to their curriculum. Ultimately, the plan was scrapped; instead, a single commercial test was chosen for use in the entire state. Another study linked the MAT to the connecticut Mastery Test in mathematics (Behuniak and Tucker, 1992). The MAT, which offers separate versions for grades 4, 6, and 8, was chosen as the commercial achievement test series that most closely matched Connecticut's curriculum objectives. In each grade, a sample of students took both the appropriate MAT and the Connecticut tests. The link was apparently determined by statistical moderation and was evaluated by using an index

OCR for page 20
--> based on the correlation of the two tests, as proposed by Gulliksen (1950). The study reported that the two tests were as highly correlated as possible, given their respective reliabilities. This study is similar in many respects to the Anchor Test Study. Unfortunately, the authors did not report detailed analyses of the scores, such as subgroup analyses or other evaluations. Table 2-1 presents summaries of prior linkage studies. Common Problems in Links If two tests (A and B) measure different aspects of the performance of the examinees, either because they measure different domains or because they measure the same domain differently, then the examinees are likely to exhibit different patterns of proficiency on the tests. Thus, the scores on test A will not provide accurate and unbiased estimates of scores on test B. As noted in Chapter 1, the domains of reading and mathematics are large: two tests of 4th-grade reading may measure different arrays of skills and knowledge, and an individual student may perform very differently on them. Linking the scores between the tests would have little utility. A difference in overall reliability is also troublesome in linking tests. It is desirable that tests being linked have errors of measurement of similar magnitude at equivalent score points. If two tests differ in reliability, then their scores should not be used interchangeably. Calibration or projection can be used to link unequally reliable test for some purposes, but the linked scores must be interpreted with great care. Scores from the less reliable test will still have a large margin of error, even if reported on the scale of the more reliable test. Some test users may ignore the larger margin of error and misinterpret the scores. When linking forms of a single test, forms with different reliability cannot be equated. Strict equating requires that it should be a matter of indifference to a test taker which of two equated test forms is used. The lore of equating holds that test takers who have a choice of two forms and who expect to do poorly should, if they are willing to take a chance, choose the less reliable form, for there is a greater chance of getting an unrealistically high score by chance on that form, because of its larger margin of error. Of course there is also a greater chance of getting an unrealistically low score on the same form, but some may be willing to take that chance. No such gamesmanship is possible with equally reliable test forms. This colorful description of the dilemma is widely quoted (see,

OCR for page 20
--> e.g., Lord and Novick, 1968; Peterson et al., 1993) and is intended to force consideration of the more likely situation in which test takers have no choice and may have to take the less reliable test form. Differences in item and response formats or in administration can also affect the validity of a linkage. In some cases, seemingly small changes in conditions can have large effects. For example, in 1984, several changes were made in NAEP's method of measuring reading, with important effects on the results: the 1986 results showed large losses in performance among 9- and 17-year-olds. A series of studies of the “NAEP reading anomaly" found a number of very small effects, none of which, by itself, could have caused the problem, but which together made a substantial change (Beaton and Zwick, 1990). Context effects may also arise in which the difficulty or reliability of a test or block of items is affected by preceding tests or blocks of items (see, e.g., Williams et al., 1995). Some format changes, such as those between paper-and-pencil and computer-based testing, have so far shown little effect on what tests measure or the ways they can be linked (Mead and Drasgow, 1993). Other format changes, such as those between hands-on and computer-simulated performance tasks, show large differences (Shavelson et al., 1992). Since linkage problems may or may not arise when tests to be linked have different formats, prudence suggests that attention be given to format issues in linkage. Differences in the context of test administration, which include the consequences associated with test outcomes (the stakes), are widely believed to affect the stability of test linkages over time. For example, in both Kentucky and North Carolina, where the relationship between the state assessment and NAEP has been examined, there is evidence that scores have improved somewhat more on the state assessments than on NAEP. This effect suggests that as the activities in the classroom become more in line, over time, with the state's intended curriculum, and more aligned with the constructs measured by a state's assessment, the state assessment will become, in effect, easier for each successive cohort of students. However, for NAEP there is no corresponding change in curriculum or instruction in the classroom (because it is a low-stakes test), and its difficulty remains essentially the same over time. The result is that a linkage between state assessment and NAEP that is established at or near the introduction of sanctions or rewards associated with the state assessment, will become out of date and inaccurately reflect equivalent performance on the state assessment and NAEP over time. This effect

OCR for page 20
--> Table 2-1 Abridged Summaries of Prior Linkage Research Study Purpose Methodology Key Findings The Anchor Test Study (Loret et al., 1972, 1973) To develop an equivalency scale to compare reading test results for Title I program evaluation. Number of participants: 200,000 students for norming phase; 21 sample groups of approximately 5,000 students each for the equating phase. Tests with similar content can be linked together with reasonable accuracy.   The study was sponsored by a $1,000,000 contract with the U.S. Office of Education. Eight tests, representing almost 90 percent of reading tests being administered in the states at that time, were selected for the study. Relationships between tests were determined to be reasonably similar for male and female students but not for racial groups.     Participants took two tests. The equivalency scale was accurate for individuals but aggregated results, e.g., school or district, would have increased error stemming from combining results.     Created new national norms for one test and, through equating, all eight tests. Every time a new test is introduced the procedure has to be replicated for that test.     Administered different combinations of standardized reading tests to different subjects taking into account the need to balance demographic factors and instructional differences. The stability of the linkage has to be reestablished regularly because instruction on one test but not on others can invalidate the linkage. Projecting to the NAEP Scale: Results from the North Carolina End-of-Grade Testing Program (Williams et al., 1995) To link a comprehensive state achievement test to the NAEP scale for mathematics so that the more frequently administered state tests could be used for purposes of monitoring progress of North Carolina students with respect to national achievement standards. A total of 2,824 students from 99 schools were tested using 78 items from a short form of the North Carolina End-of-Grade Test and two blocks of released 1992 NAEP items that were embedded in the test. A satisfactory linkage was obtained for statewide statistics as a whole that were accurate enough to predict NAEP means or quartile distributions with only modest error.     Test booklets were spiraled so that some students took NAEP items first, others took North Carolina End-of-Grade Test items first. The linkages had to be adjusted separately by from 0.1 to 0.2 standard deviations for different ethnic groups, demonstrating that the linking was inappropriate for predicting individual scores from the North Carolina Test to the NAEP scale.     The final linkage to the NAEP scale used projection. Scores from the NAEP blocks were determined from student responses using NAEP parameters but not the conditioning analysis used by NAEP. Regular scores from the North Carolina test were used. The following were considered important factors in establishing a strong link: content on the North Carolina Test was closely aligned with state curriculum and NAEP's was not; student performance was affected by the order of the items in their test booklets; motivation or fatigue affected performance for some students.

OCR for page 20
--> Full table appears on previous page.

OCR for page 20
--> Study Purpose Methodology Key Findings Linking Statewide Tests to NAEP (Ercikan, 1997) To examine the accuracy of linking statewide test results to NAEP by comparing the results of four states' assessment programs with the NAEP results for those states. Compared each state's assessment data to their NAEP data using equipercentile comparisons of score distributions. Since none of the four states used exactly the same form of the California Achievement Test for their state testing program, state results had to be converted to a common scale. This scale was developed by the publisher of the California Achievement Test series. The link from separate tests to NAEP varies from one state to the next with effect sizes ranging from 0.18 to 0.6 standard deviation.       It was not possible to determine whether the state-to-state differences were due to the different test(s), the moderate content alignment, the motivation of the students, or the nature of the student population.       Linking state tests to NAEP (by matching distributions) is so imprecise that results should not be used for high-stakes purposes. Toward World-Class Standards: A Research Study Linking International and National Assessments (Pashley and Phillips, 1993) To pilot test a method for obtaining accurate links between the International Assessment of Educational Progress (IAEP) and NAEP so that other countries can be compared with the United States, both nationally and at the state-level, in terms of NAEP performance standards. A sample of 1,609 U.S. 8th-grade students were assessed with both IAEP and NAEP instruments in 1992 to establish a link between these assessments. Differences of proportions of students basic or above, proficient or above, and advanced ranged from 0.01 to 0.03, corresponding to differences of 0.1 to 0.10 standard deviation on the NAEP scale.     Based on test results from the sample testing, the relationships between IAEP and NAEP proficiency estimates were investigated. The methods researchers use to establish links between tests (at least partially) determine how valid the link is for drawing particular inferences about performance.     Projection methodology was used to estimate the percentages of students from the 20 countries, assessed with the IAEP, who could perform at or above the three performance levels established for NAEP. Establishing this link required a sample of students who took both assessments.     Various sources of statistical error were assessed. It is possible to establish an accurate statistical link between the IAEP and NAEP, but policy makers, among others, should proceed with caution when interpreting results from such a link.       IAEP and NAEP were fairly similar in construction and scoring, which made linking easier.       The effects of unexplored sources of nonstatistical error, such as motivation levels, were not determined.

OCR for page 20
--> Full table appears on previous page.

OCR for page 20
--> Study Purpose Methodology Key Findings Using Performance Standards to Link Statewide Achievement Results to NAEP (Waltman, 1997 To investigate the comparability of performance standards obtained by using both statistical and social moderation to link NAEP standards to the Iowa Tests of Basic Skills (ITBS). Compared 1992 NAEP Trial State Assessment with ITBS for Iowa 4th-grade public school students. For students who took both assessments, the corresponding achievement regions on the NAEP and ITBS scales produced low to moderate percents of agreement in student classification. Agreement was particularly low for students at the advanced level; two-thirds or more were classified differently.     Used two different types of linking for separate facets of the study. Cutscores on the ITBS scale, established by moderation, were lower than those used by NAEP resulting in more students being classified as basic, proficient, or advanced on the ITBS than estimated by NAEP, possibly due to content and skills-standards mismatch between the ITBS and NAEP.     A socially moderated linkage was obtained by setting standards independently on the ITBS using the same achievement-level descriptions used to set the NAEP achievement levels. The equipercentile linkage was reasonably invariant across types of communities, in terms of percentages of students classified at each level.     An equipercentile procedure was used to establish a statistically moderated link. Regardless of the method used to establish the ITBS cutscores or the criteria used to classify students, the inconsistency of student-level match limits even many inferences about group performance.

OCR for page 20
--> Full table appears on previous page.

OCR for page 20
--> Study Purpose Methodology Key Findings Study of the Linkages of 1996 NAEP and State Mathematics Assessments in Four States (McLaughlin, 1998) To address the need for clear, rigorous standards for linkage; to provide the foundation for developing practical guidelines for states to use in linking state assessments to NAEP; and to demonstrate that it is important for educational policy makers to be aware that linkages that support one use may not be valid for another. A sample of four states that had participated in the 1996 state NAEP mathematics assessment and whose state assessment mathematics tests could potentially be linked to NAEP at the individual student-level participated in this study. Linked scores had a 95 percent confidence interval of almost 2.0 standard deviations, which were not sufficiently accurate to permit reporting individual student proficiency on NAEP based on the state assessment score.     Participating states used different assessments in their state testing programs. Links differed noticeably by minority status and school district, in all four states. Students with the same state assessment score would be projected to have different standings on the NAEP proficiency scale, depending on their minority status and school district.     There were eight linkage samples, ranging in size from 1,852 to 2,444 students.       Study matched students who participated in the NAEP assessment in their states with their scores on the state assessment instrument, using projection with multilevel regression.   The Maryland School Performance Assessment Program: Performance Assessment with Psychometric Quality Suitable for High-Stakes Usage (Yen and Ferrara, 1997) To compare the Maryland State Performance Assessment Program (MSPAP) with the Comprehensive Test of Basic Skills (CTBS) in order to establish the validity of the state test in reference to national norms. Compared results from a group of 5th-grade students who took both MSPAP and CTBS—correlations were obtained. Intercorrelations of the two tests indicated that the two measures were assessing somewhat different aspects of achievement.     The intent was to establish the validity of MSPAP so a link was not obtained.  

OCR for page 20
--> Full table appears on previous page.

OCR for page 20
--> Study Purpose Methodology Key Findings Linking the National Assessment of Educational Progress and the Third International Mathematics and Science Study: Eighth Grade Results (National Center for Education Statistics, 1998) To provide useful information about the performance of states relative to other countries. The study broadly compares state 8th-grade mathematics and science performance for each of 44 states and jurisdictions participating in NAEP with the 41 nations that participated in the Third International Mathematics and Science Study (TIMSS). The study provides predicted TIMSS results for 44 states and jurisdictions, based on their actual NAEP results. For one state (Minnesota), an excellent link was obtained for 8th-grade mathematics and science. Percentages of students actually scoring in the top 10 percent and in the top 50 percent internationally were within 2 to 5 percent of results predicted by the NAEP-TIMSS link. The 4th-grade results and results in other states have yet to be released, so the evaluation of the NAEP-TIMSS link must be considered incomplete.     A statistically moderated link was used to establish the link between NAEP and TIMSS based on applying formal linear equating procedures.       The link was established using reported results from the 1995 administration of TIMSS in the United States and the 1996 NAEP and matching characteristics of the score distributions for the two assessments.       Validated the linking functions using data provided by states that participated in both state-level NAEP and state-level TIMSS but were not included in the development of the original linking function.   will result in inaccurate predictions of what NAEP performance would be based on the results of the later state assessments. Evaluating Links How can one evaluate the quality of a link? How small can the effects of any particular problem be that will invalidate the linkage? How large can the effects be and still permit a useful linkage? The answers to these questions depend on how the linked scores will be used and what inferences will be drawn. The basic question is: If one administers test A and uses the results to infer what the results on test B would have been, will those inferred results produce the same interpretations as would have resulted if test B had been administered?

OCR for page 20
--> This section considers general empirical approaches to the evaluation of linkages. Specific evaluations should also examine the match in the contexts of the tests, and content experts should assess the degree to which tests measure the same domain. General Considerations To evaluate a link, one should first clearly define the objectives and purposes that the link will serve. Will the link compare individual student performance or group performance or both? How long will the link be in effect? Will subgroup differences be considered? These objectives and purposes, or stated uses, should guide the evaluation of the quality of the link, including estimating the size of possible statistical errors and assessing the link's robustness to changes over time.

OCR for page 20
--> How Accurate Must the Linkage Be? Developers of a linkage should set targets for the level of accuracy that will be required to support its stated uses. An example might be: “We want to compare mean student performance in two districts using different assessments. We would like the standard error of the district B mean projected onto the scale used by district A to be less than .05 standard deviations." Another example might be: "We want to know whether an individual student taking assessment B is achieving at or above the proficient level on NAEP, and we want this classification to be correct for at least 95 percent of the students who take assessment B." Over What Subgroups Must the Linkage Be Stable? In specifying intended uses, linkage developers should describe the different units (states, districts, schools, and individual students) that will be compared with each other or with fixed standards. It is important to understand the ways in which these units might differ and how that could affect the linkage between the two tests. For example, states might use different curricula, which could affect the linkage. If the students in one state are exposed to a curriculum that is more aligned with test A than test B, they will probably score relatively better on test A than will students from another state, where the curriculum does not favor test A. If the linkage from test A to test B is developed on students from the first state and then applied to students in the second state, there will be a constant bias, underestimating the test B performance of students from the second state on the basis of their test A performance. Such a bias could lead to significant error, even in estimating aggregate means. Since the bias is present for all students, the size of the error will not decrease as the sample size increases. When a linkage is to be used with scores for individual students, it is particularly difficult to identify the relative dimensions over which the linkage must be stable. Students differ in many ways (e.g., preparation, motivation, curricular exposure, and other background characteristics) that can differentially affect scores on the two tests being linked. The linkage must be stable across all of these characteristics for the linked individual results to be valid. What Data are Needed? To assess the accuracy and stability of a linkage and thus determine, empirically, whether a given linkage is adequate to support its intended uses, the first step is to gather evidence of

OCR for page 20
--> the relationship of the scores from the two tests. Such evidence provides an indication of the degree to which the two tests measure the same domain (or the correlation of the different domains measured by the two tests). The evidence will also provide an indication of the relative accuracy (reliability and measurement error) with which each test measures its underlying domain. In all cases, empirical data are required to establish a linkage, and the same data can provide much useful information on the accuracy and stability of the linkage. In most cases, additional data are required to determine stability over time or over variation in other factors on which the two tests differ (e.g., administration conditions; use; subgroup membership; and examinee motivation, differential preparation, or exposure to curriculum). Sample Designs Two Tests As noted above, the most direct method for establishing and evaluating a linkage is the single-group design, in which two tests are administered to a common set of examinees. In designing a single-group data collection, the following design issues should be addressed: The same conditions of administration should be followed for each test insofar as possible. These include time of year, factors (such as use) that affect student motivation, test length, and breaks. The two samples should include sufficient numbers of examinees from the different groups across which the linkage is to apply. Adequate sample size should be obtained both overall and for each examinee group to be analyzed separately. Sample size requirements will depend on the accuracy targets of the linkage. The power to detect key differences and the standard error of relevant estimates should be determined in advance. Once a linkage study has been designed and the data have been collected, several analyses should be performed: Examine the data for outliers that could distort results using scatterplots or other means. Eliminate discrepant cases if necessary.

OCR for page 20
--> Estimate score reliability for each test. Determine the correlation between the observed (estimated) scores for the two tests. Also examine the "disattenuated" correlation—that is, an estimation of the underlying true scores, adjusting for the fact that the observed score correlation is attenuated due to lack of perfect reliability of either of the two measures. Examine the linearity of the relationship of the two tests through examination of scatterplots or by fitting regression equations with nonlinear terms and examining the significance of the coefficients for the nonlinear terms. Establish the overall linkage. If the relationship is linear and the score distributions have approximately the same shape, linear linking can be used; otherwise, equiperentile linking is preferred. If the projection method is chosen, ordinary regression methods can be used. Estimate linkage error. Various estimation errors in statistical linking have been identified in previous linking studies. For example, Pashley and Phillips (1993) and Johnson and Siegendorf (National Center for Education Statistics, 1998) attempted to determine the size of potential statistical errors due to regression estimation, sample-to-population estimation, design effects, and measurement error. The types and sizes of linkage errors will depend on the design of the linking study. Estimate linkage stability across relevant groups. The methods used in establishing the overall link must be applied separately for each relevant group. The variation in the results for different groups at each score level can be computed and then averaged across score levels to estimate overall linkage stability. (In doing so, it may be useful to weight the different score levels in proportion to the number of examinees in the different groups at each level.) If linear projection is used, then standard statistical techniques like analysis of covariance can be used to estimate the extent to which regression slopes and intercepts are constant across different groups of students within linkage. Anchor Test Designs Sometimes it is not feasible to administer two tests in their entirety to a single-group of students. Anchor test designs are commonly used in equating studies, but they cause significant difficulties in developing and evaluating linkages. It is not possible with this design, for example, to correlate the two tests directly. The best that can be done is to examine

OCR for page 20
--> the similarity of the correlations of each test with the anchor. Moreover, additional variation may be introduced in assessing linkage stability across relevant groups as the groups could differ in significant ways on the anchor test even if they do not differ on the two tests being linked. Additional Data Collections It may or may not be possible to include sufficient numbers of examinees from all relevant groups in the initial linkage development data collection. Even when it is, one or more additional data collections will be required to establish the stability of a linkage over time. A number of more or less subtle changes, from changes in instruction to changes in cohort characteristics, may affect a linkage, leading to changes in the extent to which one test is relatively easier or harder than another for specific groups of students. Linkages that look good at first often fail to hold up over even short periods of time, as shown in the North Carolina and Kentucky studies mentioned above (Williams et al., 1995; Koretz, 1998). If a linkage is to be used beyond the initial development sample, some effort should be made to assess the stability of the linkage over time. The design issues and analysis procedures outlined above also apply to follow-up data collections. One Final Caution Error in the formation of linkages between tests can remain hidden from immediate view unless serious efforts are made to ferret them out. Statistical procedures can be applied to data that look reasonable and that pass various checks on their quality to ensure that there are no hidden errors. Even if two tests appear to measure similar things to content experts and "pass" a careful statistical evaluation, it is important to explicitly examine the stability of a linkage across the important subgroups of test takers and to check its stability over time.