Read "Equivalency and Linkage of Educational Tests: Interim Report" at NAP.edu

« Previous: Coming to Terms: Assumptions, Definitions, and Goals of Linkage

Page 12 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Findings

COMPARABILITY: CONTENT, FORMAT, AND RELATED FEATURES

The content of a test is shaped by the kinds of knowledge and skills addressed in its questions (“items”). The committee's review indicates that content is not generally comparable among various state assessments and commercial tests, even when they are testing the same subjects. Middle-school mathematics, for instance, covers several subject areas of knowledge, such as arithmetic, algebra, and geometry: the content of one state 's 8th grade mathematics test might focus largely on multiplication, division, and other number operations skills, while another test may stress pattern recognition and other pre-algebra skills (Bond and Jaeger, 1993). In reading, one 4th grade test may emphasize vocabulary and basic comprehension, while another may give greater weight to critical evaluation of an author's themes (Afflerbach et al., 1995).

A related content issue pertains to the skills and cognitive processes required to answer items. Off-the-shelf commercial tests and tests that are custom developed for states are increasingly constructed as mixed-model assessments that contain different types of items, including multiple-choice items and various kinds of open-ended questions for which students construct their own responses by filling in a blank, solving a problem, writing a short answer, writing a longer response, or completing a graph or diagram (see, e.g., Shavelson, Baxter, and Pine, 1992); Colorado, Connecticut, North Carolina, and Maryland are examples of states with mixed-model assessments. Some item types are very useful for testing student recall of factual material (a claim often made for certain types of multiple choice items); other item types are better suited to eliciting direct evidence of how well a student can solve problems.

The effect of format differences on linkages can be substantial. For example, the 1991 NAEP trial state assessment in mathematics contained both multiple choice and short-answer formats. Linn, Shepard, and Hartka, (1992) found that when the two formats were scored separately, there was enough difference between the scores to change the rank order of the states in the mathematics assessment. For items with constructed responses (that is, not multiple choice), variations in scoring may also influence the validity of linkages because different scoring guides may credit different aspects

Page 13 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Box 3

Format Differences in Maryland

The following description from Yen (1996:20) exemplifies the wide differences between two assessments of the same content area, administered to the same students in Maryland. They are the Maryland School Performance Assessment Program (MSPAP), a performance assessment used in high-stake school evaluations and the Comprehensive Test of Basic Skills, Fourth Edition or CTBS/4, published by CTB/McMillan McGraw-Hill in 1989.

MSPAP and CTBS/4 differ in many ways. MSPAP is entirely performance based, and each student is given a limited number of reading selections or scenarios that require in-depth constructed and extended responses. In contrast, CTBS/4 samples a broader range of traditional objectives with a selected-response format and is a more indirect measure of student classroom performance. MSPAP is intended to “guide and goad ” classroom instruction, while CTBS/4 is not intended as a model of instruction. MSPAP, which is targeted at raising student performance, contains many challenging items; CTBS/4 contains items that measure the full range of student performance. Each year three new forms of MSPAP are administered, with random assignment of forms to students; the same form of CTBS/4 is administered to all students every year. MSPAP results are used as part of a high-stakes program of evaluating schools; CTBS/4 results are part of the public reporting of school performance but are not included in the Maryland School Performance Index, which is used in school evaluations. Schools make individual decisions in terms of striking a balance between focusing on the material assessed with MSPAP and that assessed with CTBS/4.

of performance, even when the items appear similar (Linn, 1993). Issues such as how the scorers are trained and which scoring guidelines they use can affect the objectivity and consistency of scoring ( Frederickson and Collins, 1989). Some states, including Vermont and New Mexico, are trying out new assessment formats, such as systematically evaluating collections (“portfolios”) of a student's work, that raise even more complex issues about comparability and scoring (Valencia and Au, 1997; Webb, 1995); see Box 3 for a discussion of format issues.

In short, content, format, and related issues are vitally important in linking and, existing commercially developed achievement tests and state assessments differ substantially among themselves and NAEP on these dimensions. The committee finds that the lack of strong comparability in these areas prevents the development of reliable and valid linkages. In addition, the committee finds that, in the cases that are germane to our concerns here, statistical linkages between tests with substantial differences in content and degrees of difficulty will not be accurate in the sense that they will not be consistent across subpopulations. This lack of consistency, a problem to which we return below, is directly due to the differences in content and test difficulty.

DIVERSITY AND MULTIPLICITY OF TESTING PROGRAMS

Educational testing in the United States is diverse, reflecting the nation's history of state and local control over education policy. The number and variety of existing state and commercial tests pose formidable barriers to developing a single linking scale. State and commercial tests vary not only in content and format, but also in their target ages or grades, sampling techniques, policies for testing students with disabilities or with limited English proficiency, alignment with state and local curricula, score reporting procedures, and other factors (Bond et al., 1997).

Although commercially developed subject-matter achievement tests, especially the most widely used tests in U.S. schools, appear on the surface to be more similar than many existing state assessments, they, too, have significant differences that reflect the publishers ' efforts to capture specialized

Page 14 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Box 4

California Comparability

In 1996 California lawmakers determined that they wanted achievement information for monitoring school effectiveness, but, in the interest of respecting local control of educational issues, they did not want to mandate that all school districts use the same test. They therefore passed a bill encouraging school districts to choose achievement tests from a reviewed and approved list and then mandated the California Department of Education to develop a comparability scale that would allow lawmakers to accurately compare results from different assessments (see, Haertel, 1996; Wilson, 1996; Yen, 1996) Two different methodologies were explored in some depth. The first proposal suggested the development of a short list of acceptable commercial tests, any of which could be selected and administered by a local school district. These few tests would be linked in a manner similar to the Anchor Test Study (Loret et al., 1972). The second proposal was to develop a core reference test that comprehensively reflected California curriculum and to use that reference test as an anchor to which all other tests could be linked. In the end, the project was deemed too complex because more than 40 tests were submitted for comparison, and it was determined that it was too difficult to develop satisfactory links that would be stable over time. California decided to scrap the linkage proposal and selected one test to be administered in all of the state's schools.

markets and meet state and local demands for tests with particular features (Yen, 1998). The substantial variation that exists among commercially produced tests challenges the notion of selecting tests “off-the-shelf” and linking them: Box 4 illustrates an example of a recent attempt to link existing off-the-shelf tests and the difficulties that were encountered.

The complexity of linking even a small subset of existing tests could quickly render the task infeasible. For example, if the goal is to link just 15 different state assessments, it would be necessary to construct comparisons of more than 100 potential pairs of tests; each pair would require data collection, statistical analyses, and empirical validation (see also Los Angeles County Office of Education, 1997). An additional complicating factor could be the changing relationships between some of the tests. Frequent changes could necessitate continual updates to the development and validation of the equivalency scale (Loret et al., 1972; Linn, 1975; Wilson, 1996).

One might argue that pairwise comparisons are not necessary if all tests can be linked to a common scale, such as NAEP. Linking to NAEP simplifies the task in one respect by reducing the number of linkages that would have to be constructed. However, the design and purpose of NAEP complicates the task of linkage in another respect, and casts doubt on the validity of inferences that could be drawn from the link (see, e.g., McLaughlin, 1998). This is true because NAEP, by intent, does not produce scores for individuals and because individual students complete different parts of an entire NAEP assessment.

STABILITY OF RESULTS

The testing landscape in the United States is not only diverse, but it is dynamic: states and districts have moved rapidly, especially during the last 10 years, to adopt new educational goals, new models of testing and assessment, and new strategies for aligning tests and assessments to state content standards (National Research Council, 1997).

Moreover, although there is some similarity and stability among the largest commercial testing programs, states that use commercial programs use them in very different ways. Many states have changed the design of their statewide assessments several times in the last decade and are continuing to do so. For example, some states are developing hybrids of commercial and state-developed tests or

Page 15 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

customizing available off-the-shelf tests. Other states do not use commercial tests as part of their statewide assessment system (Roeber, Bond, and Braskamp, 1997). The diversity of the testing programs currently in the nation's schools is depicted in Table 1. (The committee realizes that information such as that tabulated here changes frequently, and may be summarized differently in different surveys. However, the main point is that the states' testing programs are extremely diverse in content, difficulty, or format (Jaeger, 1996).)

This continual change in educational goals and in the content of tests and assessments, which many people believe reflects a healthy dynamism in American education, makes linkage a moving target. Prior research has consistently shown that even if linkages between tests can be made at one time, they are difficult to maintain (Linn and Kiplinger, 1995). For example, suppose a link could be generated between a test in state A and another test in state B. Conducting the necessary analyses to establish the link takes time. It is quite possible that once the linkage methods are ready to be applied, one or both states will have changed their test format, content, or target group.

While NAEP does not change as frequently or as dramatically as state and commercial assessments, it, too, is not static. The content and nature of the NAEP instruments evolve gradually to reflect changing educational and assessment practices (National Research Council, 1996). These modifications in NAEP make it complicated to maintain stable linkages with state and commercial assessments, which are themselves evolving, and would minimize the validity of inferences from the linkages.

TEST USES AND EFFECTS ON TEACHER AND STUDENT BEHAVIOR

Many states use assessments for multiple purposes related to educational improvement, such as program evaluation, curriculum planning, school performance reporting, and student diagnosis (U.S. Congress, 1992). More and more states are using (or are contemplating using) their assessment programs to make “high-stakes” decisions about people and programs, such as promoting students to the next grade, determining whether students will graduate from high school, grouping students for instructional purposes, making decisions about teacher tenure or bonuses, allocating resources to schools, or imposing sanctions on schools and districts (see, e.g., McLaughlin et al., 1995; McDonnell, 1997). Table 2 shows many of the varied uses of tests in our nation's schools today. (A companion report on appropriate test use will be issued by the National Research Council's Committee on the Fair and Appropriate Use of Educational Tests later this year.)

An important factor in testing goes under the heading of “stakes.” When students are tested, various parties can have different concerns with, or stakes, in the outcomes. For example, a national survey of achievement, like NAEP, is a very low-stakes test for many of the parties concerned—the test takers, their parents, their teachers, and the district administrators. If NAEP is high-stakes for anyone, it is for policy makers who want to use NAEP data to assess the effectiveness of various educational reforms as they vary across the states or regions of the country. In contrast, tests that are used for high-school graduation or college admission are high-stakes for the students taking them and for their parents. Tests that are high stakes for the test takers affect their motivation to do their best. Tests that are high stakes for district administrators can result in various kinds of efforts to assist students in performing better than they would had the same test been of low stakes to those administrators. These examples do not exhaust the possibilities of the effect of “stakes” on test results. When tests carrying different stakes for different parties are linked, one expects different linking functions to result than would be found if the stakes were similar.

Although forms of test-based educational accountability vary across states and districts, changes in how tests are used inevitably lead to changes in how teachers and students react to them (Koretz, 1998).

Page 16 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

TABLE 1 State Testing: A Snapshot of Diversity

State	Use of Commercial Tests	Use of Other Assessments
Alabama	Stanford Achievement Test 9, Otis Lenin School Ability Test	Alabama Kindergarten Assessment, Alabama Direct Assessment of Writing, Differential Aptitude Test, Basic Competency Test, Career Interest Inventory, End-of-Course Algebra and Geometry Test, Alabama High School Basic Skills Exit Exam
Alaska	California Achievement Test 5
Arizona	Stanford Achievement Test 9
Arkansas	Stanford Achievement Test 9	High School Proficiency Test
California	Stanford Achievement Test 9	Golden State Examinations
Colorado	Custom developed	CTB item banks, NAEP items and state items
Connecticut	Custom developed	Connecticut Mastery Test, Connecticut Academic Performance Test
Delaware	Custom developed	State-developed writing assessment
Florida	Custom developed	High School Competency Test, Florida Writing Assessment Program
Georgia	Iowa Test of Basic Skills, Test of Achievement Proficiency	Curriculum-based Assessments, Georgia High School Graduation Tests, Georgia Kindergarten Assessment Program, Writing Assessment
Hawaii	Stanford Achievement Test 8	Hawaii State Test of Essential Competencies, Credit by Examination
Idaho	Iowa Test of Basic Skills Form K, Test of Achievement Proficiency	Direct Writing Assessment, Direct Mathematics Assessment
Illinois	Custom developed	Illinois Goals Assessment Program
Indiana	Custom developed	Indiana StatewideTesting for Educational Progress Plus
Iowa	No mandated statewide testing program, approximately 99 percent of all districts participate in the Iowa Test of Basic Skills on a voluntary basis
Kansas	Custom developed	Kansas Assessment Program (Kansas University Center for Educational Testing and Evaluation)
Kentucky	Custom developed	Kentucky Instructional Results Information System
Louisana	California Achievement Test 5	Louisiana Educational Assessment Program
Maine	Custom developed	Maine Educational Assessment (Advanced Systems in Measurement Inc.)
Maryland	Custom developed, Comprehensive Test of Basic Skills 5	Maryland Student Performance Assessment Program, Maryland Functional Tests, Maryland Writing Test
Massachusetts	Iowa Test of Basic Skills, Iowa Test of Educational Development
Michigan	Custom developed	Michigan Educational Assessment Program: Criterior-referenced tests of 4th-, 7th-, and 11th-grade students in mathematics and reading and 5th-, 8th-, and 11th-grade students in science and writing. Michigan High School Proficiency Test
Minnesota	Custom developed	96-97 students took minimum competency literacy tests in reading and mathematics
Mississippi	Iowa Test of Basic Skills, Test of Achievement Proficiency	Functional Literacy Examination, Subject Area Testing Program

Page 17 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Missouri	Custom developed, Terra Nova	Missouri Mastery and Achievement Test
Montana	Stanford Achievement Test, Iowa Test of Basic Skills, Comprehensive Test of Basic Skills
Nebraska	No statewide assessment program in 96-97
Nevada	TerraNova	Grade 8 Writing Proficiency Exam, Grade 11 Proficiency Exam
New Hampshire	Custom developed	New Hampshire Education Improvement and Assessment Program (Advanced Systems in Measurement and Evaluation, Inc.)
New Jersey	Custom developed	Grade 11 High School Proficiency Test, Grade 8 Early Warning Test
New Mexico	Iowa Test of Basic Skills, Form K	New Mexico High School Competency Exam, Portfolio Writing Assessment, Reading Assessment for Grades 1 and 2
New York	Custom developed	Occupational Education Proficiency Examinations, Preliminary Comptency Tests, Program Evaluation Tests, Pupil Evaluation Program Tests, Regents Competency Tests, Regents Examination Program, Second Language Proficiency Examinations
North Carolina	Iowa Test of Basic Skills	North Carolina End of Grade
North Dakota	Comprehensive Test of Basic Skills/4, TCS
Ohio	Custom developed	Fourth-, Sixth-, Ninth-, and Twelfth-Grade Proficiency Tests
Oklahoma	Iowa Test of Basic Skills	Oklahoma Core Curriculum Tests
Oregon	Custom developed	Reading, Writing, and Mathematics Assessment
Pennsylvania	Custom developed	Writing, Reading, and Mathematics Assessment
Rhode Island	Metropolitan Achievement Test 7, Custom developed	Health Performance Assessment, Mathematics Performance Assessment, Writing Performance Assessment
South Carolina	Metropolitan Achievement Test 7, Custom developed	Basic Skills Assessment Program
South Dakota	Stanford Achievement Test 9, Metropolitan Achievement Test 7
Tennesee	Custom developed	Tennessee Comprehensive Assessment Program (TCAP) Achievement Test Grades 2-8, TCAP Competency Graduation Test , TCAP Writing Assessment Grades 4,8, and 11.
Texas	Custom developed	Texas Assessment of Academic Skills, Texas End-of-Course Test
Utah	Stanford Achievement Test 9, Custom developed	Core Curriculum Assessment Program
Vermont	Has a voluntary state assessment program	New Standards reference exams in math, Portfolio assessment in math and writing
Virginia	Customized off the shelf	Literacy Passport Test, Degrees of Reading Power, Standards of Learning Assessments, Virginia State Assessment Program

Page 18 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Washington	Comprehensive Test of Basic Skills 4, Curriculum Frameworks Assessment System
West Virginia	Comprehensive Test of Basic Skills	Writing Assessment, Metropolitan Readiness Test
Wisconsin	TerraNova, Custom developed	Knowledge and Concepts Tests, Wisconsin Reading Comprehension Test at Grade 3
Wyoming	State assessment program in vocational education only for students grades 9-12
NOTES: Custom developed assessments result from a joint venture between a state and a commercial test publisher to design a test to the state 's specification, perhaps to more closely match the state's curriculum than an off-the-shelf test does. Customized off-the-shelf assessments result from modifications to a commerical test publisher's existing product. SOURCE: Data from 1997 Council of Chief State School Officers FallState Student Assessment Program Survey

Indeed, one of the underlying rationales for test-based accountability is to spur changes in teaching and learning. These uses are hotly debated and beyond the scope of this report (Jones, 1997). For our purposes, it is sufficient to note that the difficulty of maintaining linkages between tests is exacerbated when test results have significant consequences for individuals or schools.

In these situations, teachers may change what and how they teach to help students respond to the content and problems on the test (Shepard, 1991), schools and districts may align curriculum more closely with test content, and test takers may have stronger motivation to do well (e.g., Koretz et al., 1991). Performance gains on tests used for accountability (high-stakes tests) will often not be reflected in scores on tests used for monitoring or other non-accountability (low-stakes) purposes. The resulting differences in student performance could alter the relationship between linked tests (Shepard et al., 1996; Yen, 1996). Hence, any valid linkages created initially would have to be reestablished regularly, which would raise important questions about any hoped-for cost effective advantages of linkage.

The effects of test use on student and teacher behavior pose a special problem for linkage with NAEP. To protect its historical purpose as a monitor of educational progress, NAEP was designed expressly with safeguards to prevent it from becoming a high-stakes test. As a result, the motivation level of students who participate in NAEP may be low (O'Neil et al., 1992; Kiplinger and Linn, 1996), and they may not always exhibit peak performance. Linkages between a low-stakes instrument like NAEP and high-stakes state or commercial tests are likely to be misleading because students are likely to put forth more effort for the latter kinds of tests than for the former.

POPULATION OR SUBGROUP DIFFERENCES

When the function that links Test A with Test B differs for different groups, for example, boys and girls, it does not indicate that one group is “better” than the other. Rather, it means that a boy and a girl with the same score on Test A would be expected to have different scores on Test B, and that this

Page 19 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

TABLE 2 Student Testing: Diversity of Purpose

State	Decisions for Students	Decisions for Schools	Instructional Purposes
Alabama	High school graduation	School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Alaska		School performance reporting	Improve instruction
Arizona		School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Arkansas		School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
California	Student diagnosis or placement		Student diagnosis or placement
Colorado^a
Connecticut	Student diagnosis or placement	Awards or recognition; School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Delaware			Student diagnosis or placement; Improve instruction; Program evaluation
Florida	High school graduation		Improve instruction; Program evaluation
Georgia	High school graduation	School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Hawaii	High school graduation	Awards or recognition; School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Idaho		School performance reporting	Improve instruction
Iowa^a
Illinois		Accreditation
Indiana		Awards or recognition; School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Kansas		School performance; reporting; Accreditation	Student diagnosis or placement; Improve instruction; Program evaluation
Kentucky		Awards or recognition	Improve instruction; Program evaluation
Louisiana	Student Promotion; High school graduation	Awards or recognition; School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Maine	Student diagnosis or placement		Improve instruction; Program evaluation

Page 20 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Maryland	High school graduation	School performance reporting; Skills guarantee; Accreditation	Student diagnosis or placement; Improve instruction; Program evaluation
Massachusetts		School performance reporting	Improve instruction
Michigan	Student diagnosis or placement; Endorsed Diploma;	Awards or recognition; School performance reporting; Accreditation	Improve instruction; Program evaluation
Minnesota^a
Mississippi	High school graduation	School performance reporting; Skills guarantee; Accreditation	Student diagnosis or placement; Improve instruction; Program evaluation
Missouri		School performance reporting; Accreditation	Improve instruction; Program evaluation
Montana			Improve instruction; Program evaluation
Nebraska^a
Nevada	High school graduation	School performance reporting; Accreditation	Improve instruction; Program evaluation
New Hampshire			Improve instruction; Program evaluation
New Jersey	High school graduation	School performance reporting; Accreditation	Student diagnosis or placement; Improve instruction
New Mexico	High school graduation	School performance reporting; Accreditation	Improve instruction; Program evaluation
New York	Student diagnosis or placement; Student Promotion; Honors diploma; Endorsed diploma; High school graduation	School performance reporting	Improve instruction; Program evaluation
North Carolina	Student diagnosis or placement; Student Promotion; High school graduation		Improve instruction; Program evaluation
North Dakota	Student diagnosis or placement		Student diagnosis or placement; Improve instruction; Program evaluation
Ohio	High school graduation	Awards or recognition; School performance reporting	Improve instruction; Program evaluation
Oklahoma		School performance reporting; Accreditation	Student diagnosis or placement; Improve instruction; Program evaluation

Page 21 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Oregon		School performance reporting	Improve instruction; Program evaluation
Pennsylvania		School performance reporting	Student diagnosis or placement; Program evaluation
Rhode Island		School performance reporting	Improve instruction; Program evaluation
South Carolina	Student promotion; High school graduation	Awards or recognition; School performance reporting; Skills guarantee;	Student diagnosis or placement; Improve instruction; Program evaluation
South Dakota			Improve instruction; Program evaluation
Tennessee	Endorsed diploma; High school graduation		Student diagnosis or placement; Improve instruction; Program evaluation
Texas	Student diagnosis or placement; High school graduation		Student diagnosis or placement; Improve instruction; Program evaluation
Utah	Student diagnosis or placement	School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Vermont		School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Virginia	Student diagnosis or placement; Student promotion; High school graduation	School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
Washington		School performance reporting	Student diagnosis or placement; Improve instruction; Program evaluation
West Virginia		Skills guarantee; Accreditation	Improve instruction
Wisconsin		School performance reporting	Program evaluation
Wyoming			Improve instruction; Program evaluation
^a Colorado, Minnesota, and Nebraska did not administer any statewide assessments in 1995-96. Iowa does not administer a statewide assessment. SOURCE: Data from 1996 Council of Chief State School Officers FallState Student Assessment Program Survey

Page 22 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

effect is consistent for members of the two groups. Researchers generally suppose that group differences occur because of differing test content or format, different motivation levels, or differences in prior exposure to relevant learning opportunities. Perhaps the material in Test A is more familiar to one group than the other, while the material in Test B is equally familiar to both groups. Alternatively, one group might be motivated to perform well on one test, while both groups were equally motivated on Test B.

Simply put, it is often the case that the relative differences among the test performances of different groups of students will vary from test to test, depending on a host of factors that are subtle but important. For example, on mathematics tests boys may do better on word problems while girls may do better solving equations. When this is true, overall estimates of gender differences in 8th grade mathematics performance will depend on the relative emphasis a test gives to these two areas. Unless the two tests are very closely aligned in content, linking them might require separate formulas for boys and girls because a single linking formula would underestimate performance for one group and overestimate it for the other. Another example is that student achievement on two tests with differing emphases on algebra could vary widely across the states as a function of when and to what extent algebra is introduced into the middle school curriculum. As a result, students from different states who obtain the same score on one test (e.g., a commercial test) might have different estimated (linked) scores on a second test, such as NAEP (e.g., McLaughlin, 1998). These problems are attributable in part to the tests themselves, but linkage magnifies them and increases the risk of unfair inferences about individual achievement.

REPORTING RESULTS IN TERMS OF THE NAEP ACHIEVEMENT LEVELS

Linking other tests to NAEP raises the possibility of reporting individual student scores on state and commercial tests in terms of the NAEP achievement levels. The committee explored this issue and finds that such links would raise new and significant methodological problems (see, Wu, Royal, and McLaughlin, 1997).

To understand them, one must recognize that all test scores carry with them some amount of uncertainty or “noise,” an issue usually treated in the testing literature under the heading “reliability” (see, e.g., Feldt and Brennan, 1989; American Educational Research Association, American Psychological Association, 1985). Because scores are less than 100 percent reliable, it is never possible to assign students to achievement levels with complete certainty (Johnson and Mazzeo, 1998). The two key issues are the likelihood that students will be misclassified and the degree of error in the classification. Clearly, the more reliable the test, the less ambiguity there will be in the assignment of students to categories of performance. Unfortunately, the empirical evidence that the committee has reviewed to date suggests that transforming performances on selected existing assessments to the NAEP achievement levels produces results with substantial practical ambiguity.

For example, consider a 4th grade student with a reasonably good score on a state or commercial reading test. Transforming this child 's score into the NAEP achievement levels could easily produce the following type of report: “Sally scored [x] on the [State Reading Assessment.] Of 100 students with the same score, 10 are likely to be in the ‘below basic' category; 60 are likely to be ‘basic;' 38 are likely to be ‘proficient;' and 2 are likely to be in the highest, or ‘advanced' category.” Alternatively, the report could be issued in terms of Sally's probabilities of falling in the various categories (Johnson and Mazzeo, 1998). This ambiguity will be due to measurement error in the student 's score on the state assessment; to measurement error in NAEP; to the less than perfect correlation between the state

Page 23 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

assessment scores and NAEP scores; to potential differences in linking functions by different subgroups; and to other unidentified sources of measurement error.

The committee has not been able to conduct a thorough study of parental and public reaction to this kind of scenario, but we caution that one of the more important putative purposes of linkage—providing clear and relevant information about the performance of individual students—might be severely undermined by the need to report information which, in order to be faithful to the underlying statistics, must be ambiguous in its meaning.

Table 3 presents a summary of some major studies of linkage between different tests and assessments. Some of the studies describe methods for linking to NAEP and the implications for such a link.

Page 24 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

TABLE 3 Abridged Summaries of Prior Linkage Research

Study

Purpose

Methodology

Key Findings

The Anchor Test Study (Loret et al., 1972, 1973)

To develop an equivalency scale to compare reading test results for Title I program evaluation.

The study was sponsored by a $1,000,000 contract with the U.S. Office of Education.

Number of participants: 200,000 students for norming phase; 21 sample groups of approximately 5,000 students each for the equating phase

Eight tests, representing almost 90% of reading tests being administered in the states at that time, were selected for the study

Participants took two tests

Created new national norms for one test and through equating, all eight tests

Administered different combinations of standardized reading tests to different subjects taking into account the need to balance demographic factors and instructional differences

Tests with similar content can be linked together with reasonable accuracy

Relationships between tests were determined to be reasonably similar for male and female students but not for racial groups

The equivalency scale was accurate for individuals but aggregated results, e.g., school or district, would have increased error stemming from combining results

Every time a new test is introduced the procedure has to be replicated for that test

The stability of the linkage has to be reestablished regularly because instruction on one test but not on others can invalidate the linkage

Projecting to the NAEP Scale: Results from the North Carolina End-of-Grade Testing Program (Williams, Valerie et al., 1995)

To link a comprehensive state achievement test to the NAEP scale for mathematics so that the more frequently administered state tests could be used for purposes of monitoring progress of North Carolina students with respect to national achievement standards.

A total of 2,824 students from 99 schools was tested using 78 items from a short form of the North Carolina End-of-Grade Test and two blocks of released 1992 NAEP items that were embedded in the test.

Test booklets were spiraled so that some students took NAEP items first, other took North Carolina End-of-Grade items first.

The final linkage to the NAEP scale used projection. Scores from the NAEP blocks were determined from student responses using NAEP parameters but not the conditioning analysis used by NAEP. Regular scores from the North Carolina test were used.

A satisfactory linking was obtained for statewide statistics as a whole that were accurate enough to predict NAEP means or quartile distributions with only modest error.

The linkages had to be adjusted separately for different ethnic groups demonstrating that the linking was inappropriate for predicting individual scores from the North Carolina Test to the NAEP scale.

The following were considered important factors in establishing a strong link: content on the North Carolina Test was closely aligned with state curriculum and NAEP's was not; student performance was effected by the order of the items in their test booklets; motivation or fatigue effects performance for some students.

Page 25 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Linking Statewide Tests to NAEP (Ercikan, 1997)

To examine the accuracy of linking statewide test results to NAEP by comparing the results of four states' assessment programs with the NAEP results for those states.

Compared each state's assessment data to their NAEP data using equipercentile comparisons of score distributions. Since none of the four states used exactly the same form of the California Achievement Test for their state testing program, state results had to be converted to a common scale. This scale was developed by the publisher of the California Achievement Test series.

The link from separate tests to NAEP varies from one state to the next It was not possible to determine whether the state-to-state differences were due to the different test(s), the moderate content alignment, the motivation of the students, or the nature of the student population

Linking state tests to NAEP (by matching distributions) is so imprecise that results should not be used for high-stakes purposes.

Toward World-Class Standards: A Research Study Linking International and National Assessments (Pashley and Phillips, 1993)

To pilot test a method for obtaining accurate links between the International Assessment of Educational Progress (IAEP) and NAEP so that other countries can be compared with United States, both nationally and at the state level, terms of NAEP performance standards.

A sample of 1,609 U.S. grade eight students were assessed with both IAEP and NAEP instruments in 1992 to establish a link between these assessments.

Based on test results from the sample testing, the relationships between IAEP and NAEP proficiency estimates were investigated.

Projection methodology was used to estimate the percentages of students from the 20 countries, assessed with the IAEP, who could perform at or above the three performance levels established for NAEP.

Various sources of statistical error were assessed.

The methods researchers use to establish links between tests (at least partially) determines how valid the link is for drawing particular inferences about performance

Establishing this link required a linking sample of students who took both assessments

While it is possible to establish an accurate statistical link between the IAEP and NAEP assessments, policymakers, among others, should proceed with caution when interpreting results from such a link.

The fact that the IAEP and NAEP were fairly similar in construction and scoring made the linking easier.

The effects of unexplored sources of non-statistical error, such as motivation levels had on the results was not determined.

Page 26 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Comparing the NAEP Trial State Assessment (TSA) Results with the IAEP International Results. Beaton & Gonzalez, 1993)

To determine how American students compare to foreign students in mathematics, and how well foreign students meet the NAGB mathematics standards

At that time data were not available for examinees that took both assessments, therefore they relied on a simple distribution-matching procedure

Rescaled scores to produce a common mean and standard deviation on the two tests.

Translated IAEP scores into NAEP scores by aligning the means and standard deviations for the two tests

Transformed the IAEP scores for students in the IAEP samples in each participating country into equivalent NAEP scores

Moderation procedures are sensitive to age/grade differences

The IAEP and NAEP have many similarities but are not identical and differ in some significant ways

Results of the linking were different for countries with high average IAEP scores

Different methods of linking the IAEP and NAEP can produce different results, and further study is necessary to determine which method is best

Linking to a Large-Scale Assessment: An Empirical Evaluation (Bloxom, Pashley, Nicewander and Yan, 1995)

To compare the mathematics achievement of new military recruits with the general United States student population, using a link between the Armed Services Vocational Aptitude Battery (ASVAB), and NAEP. The emphasis of the study was to provide and illustrate an approach for empirically evaluating the statistical accuracy of such a linkage

A sample of 8,239 applicants for military service were administered an operational ASVAB and a NAEP survey in 1992. These applicants were told that there were no stakes attached to the NAEP survey.

ASVAB scores were projected on the NAEP scale in mathematics to allow for comparison between the achievement of military applicants with the general U.S. population of 12th grade students.

Statistical checks were made by constructing the link separately for low-scoring candidates and for high-scoring candidates.

Statistically, an accurate distribution of recruit achievement can be found by projecting onto the NAEP scale

Factors related to motivation may have underestimated the assessment-based proficiency distribution of recruits in this study, meaning that in spite of the statistical precision of the linkage, the resulting estimates may not be valid for practical purposes.

Page 27 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

The Potential of Criterion-Referenced Tests with Projected Norms (Behuniak and Tucker, 1992)

To determine if norm-referenced scores could be provided, for the purpose of Chapter 1 program evaluation, by linking the Connecticut Mastery Test, a criterion-referenced test closely aligned with state curriculum, and a national “off-the-shelf ” norm-referenced achievement test. The purpose of the linking was to meet Federal guidelines for Chapter 1 reporting without requiring students to take two tests.

Compared two tests, the Metropolitan Achievement Test 6 (MAT 6) and the Stanford Achievement Test 7 (SAT 7) to determine which was more closely aligned with state content standards. Selected the MAT 6 for the study.

For a relevant population, calibrated the items from the two instruments in a given subject as a single IRT calibration then used the results to calibrate the tests

Linked results using equipercentile equating.

Examined changes over two years to check the stability of the link

There were enough content differences between the two norm-referenced tests and the Connecticut Mastery Test to decide that one test would make a better, if not perfect, candidate for linking to the state test than the other.

It was possible to develop a link between the MAT 6 and the Connecticut Mastery Test that accurately predicted Normal Curve Equivalent scores for the MAT 6 from the CMT but no good validity checks were used.

The linking function changed somewhat over time and the authors believed that this divergence would continue because teachers were gearing instruction to state standards which were more closely aligned with the Connecticut test than the Metropolitan. Thus, the linking would have to be reestablished regularly to remain valid for the purposes that it was intended to serve.

Page 28 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Linking Statewide Tests to the National Assessment of Educational Progress: Stability of Results (Linn & Kiplinger, 1995)

To investigate the adequacy of linking statewide standardized test results to the National Assessment of Educational Progress (NAEP) to allow for accurate comparisons between state academic performance and the national performance levels measured by NAEP.

Obtained two years (1990 & 1992) results from four state's testing programs and corresponding results from the NAEP-TSA for the same two years

Standardized tests used in the four states were different

Used equipercentile equating procedures to compare data from state tests and NAEP

The standardized test results were converted to the NAEP scale using the 1990 data and resulting conversion tables were then applied to the 1992 data

Examined content match between standardized tests and NAEP and re-analyzed data using subsections of the standardized tests and NAEP

The link could estimate average state performance on NAEP, but was not accurate for scores at the top or bottom of the scale

The equating function diverged for males and females, meaning that NAEP scores for a state would have been over predicted if the equating function for males was used rather than the equating function for females

Linking standardized tests to NAEP using equipercentile equating procedures is not sufficiently trustworthy to use for other than rough approximations

Designing tests in accordance with a common framework might make linking more feasible

Page 29 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Using Performance Standards to Link Statewide Achievement Results to NAEP (Waltman, 1997)

To investigate how the comparability of performance standards obtained by using both statistical and social moderation to link NAEP standards to the ITBS.

Compared 1992 NAEP-TSA with ITBS for Iowa 4th grade public school students.

Used two different types of linking for separate facets of the study.

A socially moderated linkage was obtained by setting standards independently on the ITBS using the same achievement-level descriptions used to set the NAEP achievement levels.

An equipercentile procedure was used to establish a statistically moderated link.

For students who took both assessments, the corresponding achievement regions on the NAEP and ITBS scaled produced low to moderate percents of agreement in student classification. Agreement was particularly low for students at the advanced level, two-thirds or more were classified differently.

Cutscores on the ITBS scale, established by moderation, were lower than those used by NAEP resulting in more students being classified as basic, proficient or advanced on the ITBS than estimated by NAEP, possibly due to content and skills-standards mismatch between the ITBS and NAEP.

The equipercentile linkage was reasonably invariant across types of communities, in terms of percentages of students classified at each level.

Regardless of the method used to establish the ITBS cut-scores or the criteria used to classify students, the inconsistency of student-level match limits even many performance

Page 30 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Study of the Linkages of 1996 NAEP and State Mathematics Assessments in Four States (McLaughlin, 1998)

To address the need for clear, rigorous standards for linkage; to provide the foundation for developing practical guidelines for states to use in linking state assessments to NAEP; and to demonstrate that it is important for educational policymakers to be aware that linkages that support one use may not be valid for another.

A sample of four states who had participated in the 1996 State NAEP mathematics assessment and whose state assessment mathematics tests could potentially be linked to NAEP at the individual student level participated in this study. Participating states used different assessments in their state testing programs.

There were eight linkage samples, ranging in size from 1,852 to 2,444 students.

Study matched students who participated in the NAEP assessment in their states with their scores on the state assessment instrument using projection with multilevel regression.

Links were not sufficiently accurate to permit reporting individual student proficiency on NAEP based on the state assessment score.

Links differed noticeably by minority status and school district, in all four states. Students with the same state assessment score, would be projected to have different standings on the NAEP proficiency scale, depending on their minority status and school district.

The Maryland School Performance Program: Performance Assessment with Psychometric Quality Suitable for High Stakes Usage (Yen & Ferrarra, 1997)

To compare the Maryland State Performance assessment (MSPAP) with the California Test of Basic Skills (CTBS) in order to establish the validity of the state test in reference to national norms.

Compared results from a group of 5^th grade students who took both the MSPAP, and the CTBS—correlations were obtained

The intent was to establish the validity of the MSPAP so a link was not obtained.

Intercorrelations of the two tests indicated that the two measures were assessing somewhat different aspects of achievement.

Page 31 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

A TIMSS-NAEP Link (Johnson, 1998)

To provide useful information about the performance of states relative to other countries. The study broadly compares state eighth-grade mathematics and science performance for each of 44 states and jurisdictions participating in the NAEP with the 41 nations who participated in TIMSS.

The study provides predicted TIMSS results for 44 states and jurisdictions, based on their actual NAEP results.

A statistically moderated link was used to establish the link between NAEP and TIMSS based on applying formal linear equating procedures.

The link was established using reported results from the 1995 administration of TIMSS in the U.S. and the 1996 NAEP and matching characteristics of the score distributions for the two assessments.

Validated the linking functions using data provided by states who participated in both state level NAEP and state level TIMSS but were not included in the development of the original linking function

Although all of the findings have not yet been released, apparently some links were satisfactory and others were not.

Page 12 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 13 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 14 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 15 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 16 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 17 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 18 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 19 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 20 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 21 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 22 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 23 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 24 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 25 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 26 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 27 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 28 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 29 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 30 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Page 31 Cite

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9525.

Next: Conclusions »

Equivalency and Linkage of Educational Tests: Interim Report (1998)

Chapter: Findings

Findings

COMPARABILITY: CONTENT, FORMAT, AND RELATED FEATURES

DIVERSITY AND MULTIPLICITY OF TESTING PROGRAMS

STABILITY OF RESULTS

TEST USES AND EFFECTS ON TEACHER AND STUDENT BEHAVIOR

POPULATION OR SUBGROUP DIFFERENCES

REPORTING RESULTS IN TERMS OF THE NAEP ACHIEVEMENT LEVELS

Welcome to OpenBook!

Get Email Updates