National Academies Press: OpenBook
« Previous: Coming to Terms: Assumptions, Definitions, and Goals of Linkage
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Findings

COMPARABILITY: CONTENT, FORMAT, AND RELATED FEATURES

The content of a test is shaped by the kinds of knowledge and skills addressed in its questions (“items”). The committee's review indicates that content is not generally comparable among various state assessments and commercial tests, even when they are testing the same subjects. Middle-school mathematics, for instance, covers several subject areas of knowledge, such as arithmetic, algebra, and geometry: the content of one state 's 8th grade mathematics test might focus largely on multiplication, division, and other number operations skills, while another test may stress pattern recognition and other pre-algebra skills (Bond and Jaeger, 1993). In reading, one 4th grade test may emphasize vocabulary and basic comprehension, while another may give greater weight to critical evaluation of an author's themes (Afflerbach et al., 1995).

A related content issue pertains to the skills and cognitive processes required to answer items. Off-the-shelf commercial tests and tests that are custom developed for states are increasingly constructed as mixed-model assessments that contain different types of items, including multiple-choice items and various kinds of open-ended questions for which students construct their own responses by filling in a blank, solving a problem, writing a short answer, writing a longer response, or completing a graph or diagram (see, e.g., Shavelson, Baxter, and Pine, 1992); Colorado, Connecticut, North Carolina, and Maryland are examples of states with mixed-model assessments. Some item types are very useful for testing student recall of factual material (a claim often made for certain types of multiple choice items); other item types are better suited to eliciting direct evidence of how well a student can solve problems.

The effect of format differences on linkages can be substantial. For example, the 1991 NAEP trial state assessment in mathematics contained both multiple choice and short-answer formats. Linn, Shepard, and Hartka (1992) found that when the two formats were scored separately, there was enough difference between the scores to change the rank order of the states in the mathematics assessment. For items with constructed responses (that is, not multiple choice), variations in scoring may also influence the validity of linkages because different scoring guides may credit different aspects of performance,

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

BOX 3. Format Differences in Maryland

The following description from Yen (1996:20) exemplifies the wide differences between two assessments of the same content area, administered to the same students in Maryland. They are the Maryland School Performance Assessment Program (MSPAP), a performance assessment used in high-stake school evaluations, and the Comprehensive Test of Basic Skills, Fourth Edition, or CTBS/4, published by CTB/McMillan McGraw-Hill in 1989.

MSPAP and CTBS/4 differ in many ways. MSPAP is entirely performance based, and each student is given a limited number of reading selections or scenarios that require in-depth constructed and extended responses. In contrast, CTBS/4 samples a broader range of traditional objectives with a selected-response format and is a more indirect measure of student classroom performance. MSPAP is intended to “guide and goad ” classroom instruction, while CTBS/4 is not intended as a model of instruction. MSPAP, which is targeted at raising student performance, contains many challenging items; CTBS/4 contains items that measure the full range of student performance. Each year three new forms of MSPAP are administered, with random assignment of forms to students; the same form of CTBS/4 is administered to all students every year. MSPAP results are used as part of a high-stakes program of evaluating schools; CTBS/4 results are part of the public reporting of school performance but are not included in the Maryland School Performance Index, which is used in school evaluations. Schools make individual decisions in terms of striking a balance between focusing on the material assessed with MSPAP and that assessed with CTBS/4.

even when the items appear similar (Linn, 1993). Issues such as how the scorers are trained and which scoring guidelines they use can affect the objectivity and consistency of scoring (Frederiksen and Collins, 1989). Some states, including Vermont and New Mexico, are trying out new assessment formats, such as systematically evaluating collections (“portfolios”) of a student's work, that raise even more complex issues about comparability and scoring (Valencia and Au, 1997; Webb, 1995); see Box 3 for a discussion of format issues.

In short, content, format, and related issues are vitally important in linking, and existing commercially developed achievement tests and state assessments differ substantially among themselves and NAEP on these dimensions. The committee finds that the lack of strong comparability in these areas prevents the development of reliable and valid linkages. In addition, the committee finds that, in the cases that are germane to our concerns here, statistical linkages between tests with substantial differences in content and degrees of difficulty will not be accurate in the sense that they will not be consistent across subpopulations. This lack of consistency, a problem to which we return below, is directly due to the differences in content and test difficulty.

DIVERSITY AND MULTIPLICITY OF TESTING PROGRAMS

Educational testing in the United States is diverse, reflecting the nation's history of state and local control over education policy. The number and variety of existing state and commercial tests pose formidable barriers to developing a single linking scale. State and commercial tests vary not only in content and format, but also in their target ages or grades, sampling techniques, policies for testing students with disabilities or with limited English proficiency, alignment with state and local curricula, score reporting procedures, and other factors (Bond et al., 1997).

Although commercially developed subject-matter achievement tests, especially the most widely used tests in U.S. schools, appear on the surface to be more similar than many existing state assessments, they, too, have significant differences that reflect the publishers ' efforts to capture specialized

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

BOX 4. California Comparability

In 1996 California lawmakers determined that they wanted achievement information for monitoring school effectiveness, but, in the interest of respecting local control of educational issues, they did not want to mandate that all school districts use the same test. They therefore passed a bill encouraging school districts to choose achievement tests from a reviewed and approved list and then mandated the California Department of Education to develop a comparability scale that would allow lawmakers to accurately compare results from different assessments (see, Haertel, 1996; Wilson, 1996; Yen, 1996). Two different methodologies were explored in some depth. The first proposal suggested the development of a short list of acceptable commercial tests, any of which could be selected and administered by a local school district. These few tests would be linked in a manner similar to the Anchor Test Study (Loret et al., 1972). The second proposal was to develop a core reference test that comprehensively reflected California curriculum and to use that reference test as an anchor to which all other tests could be linked. In the end, the project was deemed too complex because more than 40 tests were submitted for comparison, and it was determined that it was too difficult to develop satisfactory links that would be stable over time. California decided to scrap the linkage proposal and selected one test to be administered in all of the state's schools.

markets and meet state and local demands for tests with particular features (Yen, 1998). The substantial variation that exists among commercially produced tests challenges the notion of selecting tests “off-the-shelf” and linking them: Box 4 illustrates an example of a recent attempt to link existing off-the-shelf tests and the difficulties that were encountered.

The complexity of linking even a small subset of existing tests could quickly render the task infeasible. For example, if the goal is to link just 15 different state assessments, it would be necessary to construct comparisons of more than 100 potential pairs of tests; each pair would require data collection, statistical analyses, and empirical validation (see also Los Angeles County Office of Education, 1997). An additional complicating factor could be the changing relationships between some of the tests. Frequent changes could necessitate continual updates to the development and validation of the equivalency scale (Loret et al., 1972; Linn, 1975; Wilson, 1996).

One might argue that pairwise comparisons are not necessary if all tests can be linked to a common scale, such as NAEP. Linking to NAEP simplifies the task in one respect by reducing the number of linkages that would have to be constructed. However, the design and purpose of NAEP complicates the task of linkage in another respect, and casts doubt on the validity of inferences that could be drawn from the link (see, e.g., McLaughlin, 1998). This is true because NAEP, by intent, does not produce scores for individuals and because individual students complete different parts of an entire NAEP assessment.

STABILITY OF RESULTS

The testing landscape in the United States is not only diverse, but it is dynamic: states and districts have moved rapidly, especially during the last 10 years, to adopt new educational goals, new models of testing and assessment, and new strategies for aligning tests and assessments to state content standards (National Research Council, 1997).

Moreover, although there is some similarity and stability among the largest commercial testing programs, states that use commercial programs use them in very different ways. Many states have changed the design of their statewide assessments several times in the last decade and are continuing to do so. For example, some states are developing hybrids of commercial and state-developed tests or

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

customizing available off-the-shelf tests. Other states do not use commercial tests as part of their statewide assessment system (Roeber, Bond, and Braskamp, 1997). The diversity of the testing programs currently in the nation's schools is depicted in Table 1. (The committee realizes that information such as that tabulated here changes frequently, and may be summarized differently in different surveys. However, the main point is that the states' testing programs are extremely diverse in content, difficulty, and format (Jaeger, 1996).)

This continual change in educational goals and in the content of tests and assessments, which many people believe reflects a healthy dynamism in American education, makes linkage a moving target. Prior research has consistently shown that even if linkages between tests can be made at one time, they are difficult to maintain (Linn and Kiplinger, 1995). For example, suppose a link could be generated between a test in state A and another test in state B. Conducting the necessary analyses to establish the link takes time. It is quite possible that once the linkage methods are ready to be applied, one or both states will have changed their test format, content, or target group.

While NAEP does not change as frequently or as dramatically as state and commercial assessments, it, too, is not static. The content and nature of the NAEP instruments evolve gradually to reflect changing educational and assessment practices (National Research Council, 1996). These modifications in NAEP make it complicated to maintain stable linkages with state and commercial assessments, which are themselves evolving, and would minimize the validity of inferences from the linkages.

TEST USES AND EFFECTS ON TEACHER AND STUDENT BEHAVIOR

Many states use assessments for multiple purposes related to educational improvement, such as program evaluation, curriculum planning, school performance reporting, and student diagnosis (U.S. Congress, 1992). More and more states are using (or are contemplating using) their assessment programs to make “high-stakes” decisions about people and programs, such as promoting students to the next grade, determining whether students will graduate from high school, grouping students for instructional purposes, making decisions about teacher tenure or bonuses, allocating resources to schools, or imposing sanctions on schools and districts (see, e.g., McLaughlin et al., 1995; McDonnell, 1997). Table 2 shows many of the varied uses of tests in our nation's schools today. (A companion report on appropriate test use will be issued by the National Research Council's Committee on the Fair and Appropriate Use of Educational Tests later this year.)

An important factor in testing goes under the heading of “stakes.” When students are tested, various parties can have different concerns with, or stakes, in the outcomes. For example, a national survey of achievement, like NAEP, is a very low-stakes test for many of the parties concerned—the test takers, their parents, their teachers, and the district administrators. If NAEP is high-stakes for anyone, it is for policy makers who want to use NAEP data to assess the effectiveness of various educational reforms as they vary across the states or regions of the country. In contrast, tests that are used for high-school graduation or college admission are high-stakes for the students taking them and for their parents. Tests that are high stakes for the test takers affect their motivation to do their best. Tests that are high stakes for district administrators can result in various kinds of efforts to assist students in performing better than they would had the same test been of low stakes to those administrators. These examples do not exhaust the possibilities of the effect of “stakes” on test results. When tests carrying different stakes for different parties are linked, one expects different linking functions to result than would be found if the stakes were similar.

Although forms of test-based educational accountability vary across states and districts, changes in how tests are used inevitably lead to changes in how teachers and students react to them (Koretz,

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

TABLE 1 State Testing: A Snapshot of Diversity

State

Use of Commercial Tests

Use of Other Assessments

Alabama

Stanford Achievement Test 9, Otis Lenin School Ability Test

Alabama Kindergarten Assessment, Alabama Direct Assessment of Writing, Differential Aptitude Test, Basic Competency Test, Career Interest Inventory, End-of-Course Algebra and Geometry Test, Alabama High Sshool Basic Skill Exit Exam

Alaska

California Achievement Test 5

Arizona

Stanford Achievement Test 9

Arkansas

Stanford Achievement Test 9

High School Proficiency Test

California

Stanford Achievement Test 9

Golden State Examinations

Colorado

Custom developed

CTB item banks, NAEP items, and state items

Connecticut

Custom developed

Connecticut Mastery Test, Connecticut Academic Performance Test

Delaware

Custom developed

State-developed writing assessment

Florida

Custom developed

High School Competency Test, Florida Writing Assessment Program

Georgia

Iowa Test of Basic Skills, Test of Achievement Proficiency

Curriculum-based Assessments, Georgia High School Graduation Tests, Georgia Kindergarten Assessment Program, Writing Assessment

Hawaii

Stanford Achievement Test 8

Hawaii State Test of Essential Competencies, Credit by Examination

Idaho

Iowa Test of Basic Skills Form K, Test of Achievement Proficiency

Direct Writing Assessment, Direct Mathematics Assessment

Illinois

Custom developed

Illinois Goals Assessment Program

Indiana

Custom developed

Indiana Statewide Testing for Educational Progress Plus

Iowa

No mandated statewide testing program, approximately 99 percent of all districts participate in the Iowa Test of Basic Skills on a voluntary basis

Kansas

Custom developed

Kansas Assessment Program (Kansas University Center for Educational Testing and Evaluation)

Kentucky

Custom developed

Kentucky Instructional Results Information System

Louisana

California Achievement Test 5

Louisiana Educational Assessment Program

Maine

Custom developed

Maine Educational Assessment (Advanced Systems in Measurement Inc.)

Maryland

Custom developed, Comprehensive Test of Basic Skills 5

Maryland Student Performance Assessment Program, Maryland Functional Tests, Maryland Writing Test

Massachusetts

Iowa Test of Basic Skills, Iowa Test of Educational Development

Michigan

Custom developed

Michigan Educational Assessment Program: Criterion-referenced tests of 4th-, 7th-, and 11th-grade students in mathematics and reading and 5th-, 8th-, and 11th-grade students in science and writing; Michigan High School Proficiency Test

Minnesota

Custom developed

96-97 students took minimum competency literacy tests in reading and mathematics

Mississippi

Iowa Test of Basic Skills, Test of Achievement Proficiency

Functional Literacy Examination, Subject Area Testing Program

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Missouri

Custom developed, TerraNova

Missouri Mastery and Achievement Test

Montana

Stanford Achievement Test, Iowa Test of Basic Skills, Comprehensive Test of Basic Skills

Nebraska

No statewide assessment program in 96-97

Nevada

TerraNova

Grade 8 Writing Proficiency Exam, Grade 11 Proficiency Exam

New Hampshire

Custom developed

New Hampshire Education Improvement and Assessment Program (Advanced Systems in Measurement and Evaluation, Inc.)

New Jersey

Custom developed

Grade 11 High School Proficiency Test, Grade 8 Early Warning Test

New Mexico

Iowa Test of Basic Skills, Form K

New Mexico High School Competency Exam, Portfolio Writing Assessment, Reading Assessment for Grades 1 and 2

New York

Custom developed

Occupational Education Proficiency Examinations, Preliminary Competency Tests, Program Evaluation Tests, Pupil Evaluation Program Tests, Regents Competency Tests, Regents Examination Program, Second Language Proficiency Examinations

North Carolina

Iowa Test of Basic Skills

North Carolina End of Grade

North Dakota

Comprehensive Test of Basic Skills/4, TCS

Ohio

Custom developed

Fourth-, Sixth-, Ninth-, and Twelfth-Grade Proficiency Tests

Oklahoma

Iowa Test of Basic Skills

Oklahoma Core Curriculum Tests

Oregon

Custom developed

Reading, Writing, and Mathematics Assessment

Pennsylvania

Custom developed

Writing, Reading, and Mathematics Assessment

Rhode Island

Metropolitan Achievement Test 7, Custom developed

Health Performance Assessment, Mathematics Performance Assessment, Writing Performance Assessment

South Carolina

Metropolitan Achievement Test 7, Custom developed

Basic Skills Assessment Program

South Dakota

Stanford Achievement Test 9, Metropolitan Achievement Test 7

Tennesee

Custom developed

Tennessee Comprehensive Assessment Program (TCAP) Achievement Test Grades 2-8, TCAP Competency Graduation Test, TCAP Writing Assessment Grades 4, 8, and 11

Texas

Custom developed

Texas Assessment of Academic Skills, Texas End-of-Course Test

Utah

Stanford Achievement Test 9, Custom developed

Core Curriculum Assessment Program

Vermont

Has a voluntary state assessment program

New Standards reference exams in math, Portfolio assessment in math and writing

Virginia

Customized off the shelf

Literacy Passport Test, Degrees of Reading Power, Standards of Learning Assessments, Virginia State Assessment Program

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Washington

Comprehensive Test of Basic Skills 4, Curriculum Frameworks Assessment System

West Virginia

Comprehensive Test of Basic Skills

Writing Assessment, Metropolitan Readiness Test

Wisconsin

TerraNova, Custom developed

Knowledge and Concepts Tests, Wisconsin Reading Comprehension Test at Grade 3

Wyoming

State assessment program in vocational education only for students grades 9-12

NOTES: Custom developed assessments result from a joint venture between a state and a commercial test publisher to design a test to the state 's specification, perhaps to more closely match the state's curriculum than an off-the-shelf test does. Customized off-the-shelf assessments result from modifications to a commerical test publisher 's existing product.

SOURCE: Data from 1997 Council of Chief State School Officers FallState Student Assessment Program Survey

1998). Indeed, one of the underlying rationales for test-based accountability is to spur changes in teaching and learning. These uses are hotly debated and beyond the scope of this report (Jones, 1997). For our purposes, it is sufficient to note that the difficulty of maintaining linkages between tests is exacerbated when test results have significant consequences for individuals or schools.

In these situations, teachers may change what and how they teach to help students respond to the content and problems on the test (Shepard and Dougherty, 1991), schools and districts may align curriculum more closely with test content, and test takers may have stronger motivation to do well (e.g., Koretz et al., 1991). Performance gains on tests used for accountability (high-stakes tests) will often not be reflected in scores on tests used for monitoring or other non-accountability (low-stakes) purposes. The resulting differences in student performance could alter the relationship between linked tests (Shepard et al., 1996; Yen, 1996). Hence, any valid linkages created initially would have to be reestablished regularly, which would raise important questions about any hoped-for cost-effective advantages of linkage.

The effects of test use on student and teacher behavior pose a special problem for linkage with NAEP. To protect its historical purpose as a monitor of educational progress, NAEP was designed expressly with safeguards to prevent it from becoming a high-stakes test. As a result, the motivation level of students who participate in NAEP may be low (O'Neil et al., 1992; Kiplinger and Linn, 1996), and they may not always exhibit peak performance. Linkages between a low-stakes instrument like NAEP and high-stakes state or commercial tests are likely to be misleading because students are likely to put forth more effort for the latter kinds of tests than for the former.

POPULATION OR SUBGROUP DIFFERENCES

When the function that links Test A with Test B differs for different groups, for example, boys and girls, it does not indicate that one group is “better” than the other. Rather, it means that a boy and a girl with the same score on Test A would be expected to have different scores on Test B, and that this

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

TABLE 2 Student Testing: Diversity of Purpose

State

Decisions for Students

Decisions for Schools

Instructional Purposes

Alabama

High school graduation

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Alaska

School performance reporting

Improve instruction

Arizona

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Arkansas

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

California

Student diagnosis or placement

Student diagnosis or placement

Colorado a

Connecticut

Student diagnosis or placement

Awards or recognition; school performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Delaware

Student diagnosis or placement; improve instruction; program evaluation

Florida

High school graduation

Improve instruction; program evaluation

Georgia

High school graduation

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Hawaii

High school graduation

Awards or recognition; school performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Idaho

School performance reporting

Improve instruction

Iowa a

Illinois

Accreditation

Indiana

Awards or recognition; school performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Kansas

School performance; reporting; accreditation

Student diagnosis or placement; improve instruction; program evaluation

Kentucky

Awards or recognition

Improve instruction; program evaluation

Louisiana

Student promotion; high school graduation

Awards or recognition; school performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Maine

Student diagnosis or placement

Improve instruction; program evaluation

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Maryland

High school graduation

School performance reporting; skills guarantee; accreditation

Student diagnosis or placement; improve instruction; program evaluation

Massachusetts

School performance reporting

Improve instruction

Michigan

Student diagnosis or placement; endorsed diploma

Awards or recognition; school performance reporting; accreditation

Improve instruction; program evaluation

Minnesota a

Mississippi

High school graduation

School performance reporting; skills guarantee; accreditation

Student diagnosis or placement; improve instruction; program evaluation

Missouri

School performance reporting; accreditation

Improve instruction; program evaluation

Montana

Improve instruction; program evaluation

Nebraska a

Nevada

High school graduation

School performance reporting; accreditation

Improve instruction; program evaluation

New Hampshire

Improve instruction; program evaluation

New Jersey

High school graduation

School performance reporting; accreditation

Student diagnosis or placement; improve instruction

New Mexico

High school graduation

School performance reporting; accreditation

Student diagnosis or placement; improve instruction; program evaluation

New York

Student diagnosis or placement; student promotion; honors diploma; endorsed diploma; high school graduation

School performance reporting

Improve instruction; program evaluation

North Carolina

Student diagnosis or placement; student Promotion; high school graduation

Improve instruction; program evaluation

North Dakota

Student diagnosis or placement

Student diagnosis or placement; improve instruction; program evaluation

Ohio

High school graduation

Awards or recognition; school performance reporting

Improve instruction; program evaluation

Oklahoma

School performance reporting; accreditation

Student diagnosis or placement; improve instruction; program evaluation

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Oregon

School performance reporting

Improve instruction; program evaluation

Pennsylvania

School performance reporting

Student diagnosis or placement; program evaluation

Rhode Island

School performance reporting

Improve instruction; program evaluation

South Carolina

Student promotion; high school graduation

Awards or recognition; school performance reporting; skills guarantee

Student diagnosis or placement; improve instruction; program evaluation

South Dakota

Improve instruction; program evaluation

Tennessee

Endorsed diploma; high school graduation

Student diagnosis or placement; improve instruction; program evaluation

Texas

Student diagnosis or placement; high school graduation

Student diagnosis or placement; improve instruction; program evaluation

Utah

Student diagnosis or placement

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Vermont

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Virginia

Student diagnosis or placement; student promotion; high school graduation

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

Washington

School performance reporting

Student diagnosis or placement; improve instruction; program evaluation

West Virginia

Skills guarantee; Accreditation

Improve instruction

Wisconsin

School performance reporting

Program evaluation

Wyoming

Improve instruction; program evaluation

a  Colorado, Minnesota, and Nebraska did not administer any statewide assessments in 1995-96. Iowa does not administer a statewide assessment.

SOURCE: Data from 1996 Council of Chief State School Officers FallState Student Assessment Program Survey.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

effect is consistent for members of the two groups. Researchers generally suppose that group differences occur because of differing test content or format, different motivation levels, or differences in prior exposure to relevant learning opportunities. Perhaps the material in Test A is more familiar to one group than the other, while the material in Test B is equally familiar to both groups. Alternatively, one group might be motivated to perform well on one test, while both groups were equally motivated on Test B.

Simply put, it is often the case that the relative differences among the test performances of different groups of students will vary from test to test, depending on a host of factors that are subtle but important. For example, on mathematics tests boys may do better on word problems while girls may do better solving equations. When this is true, overall estimates of gender differences in 8th grade mathematics performance will depend on the relative emphasis a test gives to these two areas. Unless the two tests are very closely aligned in content, linking them might require separate formulas for boys and girls because a single linking formula would underestimate performance for one group and overestimate it for the other. Another example is that student achievement on two tests with differing emphases on algebra could vary widely across the states as a function of when and to what extent algebra is introduced into the middle school curriculum. As a result, students from different states who obtain the same score on one test (e.g., a commercial test) might have different estimated (linked) scores on a second test, such as NAEP (e.g., McLaughlin, 1998). These problems are attributable in part to the tests themselves, but linkage magnifies them and increases the risk of unfair inferences about individual achievement.

REPORTING RESULTS IN TERMS OF NAEP ACHIEVEMENT LEVELS

Linking other tests to NAEP raises the possibility of reporting individual student scores on state and commercial tests in terms of the NAEP achievement levels. The committee explored this issue and finds that such links would raise new and significant methodological problems (see Wu, Royal, and McLaughlin, 1997).

To understand them, one must recognize that all test scores carry with them some amount of uncertainty or “noise,” an issue usually treated in the testing literature under the heading “reliability” (see, e.g., Feldt and Brennan, 1989; American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1985). Because scores are less than 100 percent reliable, it is never possible to assign students to achievement levels with complete certainty (Johnson and Mazzeo, 1998). The two key issues are the likelihood that students will be misclassified and the degree of error in the classification. Clearly, the more reliable the test, the less ambiguity there will be in the assignment of students to categories of performance. Unfortunately, the empirical evidence that the committee has reviewed to date suggests that transforming performances on selected existing assessments to the NAEP achievement levels produces results with substantial practical ambiguity.

For example, consider a 4th grade student with a reasonably good score on a state or commercial reading test. Transforming this child 's score into the NAEP achievement levels could easily produce the following type of report: “Sally scored [x] on the [State Reading Assessment]. Of 100 students with the same score, 10 are likely to be in the ‘below basic' category; 60 are likely to be ‘basic;' 28 are likely to be ‘proficient;' and 2 are likely to be in the highest, or ‘advanced,' category.” Alternatively, the report could be issued in terms of Sally's probabilities of falling in the various categories (Johnson and Mazzeo, 1998). This ambiguity will be due to measurement error in the student's score on the state assessment; to measurement error in NAEP; to the less than perfect correlation between the state

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

assessment scores and NAEP scores; to potential differences in linking functions by different subgroups; and to other unidentified sources of measurement error.

The committee has not been able to conduct a thorough study of parental and public reaction to this kind of scenario, but we caution that one of the more important putative purposes of linkage— providing clear and relevant information about the performance of individual students—might be severely undermined by the need to report information which, in order to be faithful to the underlying statistics, must be ambiguous in its meaning.

Table 3 presents a summary of some major studies of linkage between different tests and assessments. Some of the studies describe methods for linking to NAEP and the implications for such a link.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

TABLE 3 Abridged Summaries of Prior Linkage Research

Study

Purpose

Methodology

Key Findings

The Anchor Test Study (Loret et al., 1972, 1973)

To develop an equivalency scale to compare reading test results for Title I program evaluation.

The study was sponsored by a $1,000,000 contract with the U.S. Office of Education.

Number of participants: 200,000 students for norming phase; 21 sample groups of approximately 5,000 students each for the equating phase.

Eight tests, representing almost 90% of reading tests being administered in the states at that time, were selected for the study.

Participants took two tests.

Created new national norms for one test and through equating, all eight tests.

Administered different combinations of standardized reading tests to different subjects taking into account the need to balance demographic factors and instructional differences.

Tests with similar content can be linked together with reasonable accuracy.

Relationships between tests were determined to be reasonably similar for male and female students but not for racial groups.

The equivalency scale was accurate for individuals but aggregated results, e.g., school or district, would have increased error stemming from combining results.

Every time a new test is introduced the procedure has to be replicated for that test.

The stability of the linkage has to be reestablished regularly because instruction on one test but not on others can invalidate the linkage.

Projecting to the NAEP Scale: Results from the North Carolina End-of-Grade Testing Program (Williams et al., 1995)

To link a comprehensive state achievement test to the NAEP scale for mathematics so that the more frequently administered state tests could be used for purposes of monitoring progress of North Carolina students with respect to national achievement standards.

A total of 2,824 students from 99 schools was tested using 78 items from a short form of the North Carolina End-of-Grade Test and two blocks of released 1992 NAEP items that were embedded in the test.

Test booklets were spiraled so that some students took NAEP items first, others took North Carolina End-of-Grade items first.

The final linkage to the NAEP scale used projection. Scores from the NAEP blocks were determined from student responses using NAEP parameters but not the conditioning analysis used by NAEP. Regular scores from the North Carolina test were used.

A satisfactory linking was obtained for statewide statistics as a whole that were accurate enough to predict NAEP means or quartile distributions with only modest error.

The linkages had to be adjusted separately for different ethnic groups, demonstrating that the linking was inappropriate for predicting individual scores from the North Carolina test to the NAEP scale.

The following were considered important factors in establishing a strong link: content on the North Carolina test was closely aligned with state curriculum and NAEP's was not; student performance was affected by the order of the items in their test booklets; motivation or fatigue affects performance for some students.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Linking Statewide Tests to NAEP (Ercikan, 1997)

To examine the accuracy of linking statewide test results to NAEP by comparing the results of four states' assessment programs with the NAEP results for those states.

Compared each state's assessment data to its NAEP data using equipercentile comparisons of score distributions. Since none of the four states used exactly the same form of the California Achievement Test for their state testing program, state results had to be converted to a common scale. This scale was developed by the publisher of the California Achievement Test series.

The link from separate tests to NAEP varies from one state to the next. It was not possible to determine whether the state-to-state differences were due to the different test(s), the moderate content alignment, the motivation of the students, or the nature of the student population.

Linking state tests to NAEP (by matching distributions) is so imprecise that results should not be used for high-stakes purposes.

Toward WorldClass Standards: A Research Study Linking International and National Assessments (Pashley and Phillips et al., 1993)

To pilot test a method for obtaining accurate links between the International Assessment of Educational Progress (IAEP) and NAEP so that other countries can be compared with the United States, both nationally and at the state level, in terms of NAEP performance standards.

A sample of 1,609 U.S. grade eight students were assessed with both IAEP and NAEP instruments in 1992 to establish a link between these assessments.

Based on test results from the sample testing, the relationships between IAEP and NAEP proficiency estimates were investigated.

Projection methodology was used to estimate the percentages of students from the 20 countries, assessed with the IAEP, who could perform at or above the three performance levels established for NAEP.

Various sources of statistical error were assessed.

The methods researchers use to establish links between tests (at least partially) determine how valid the link is for drawing particular inferences about performance.

Establishing this link required a linking sample of students who took both assessments.

While it is possible to establish an accurate statistical link between the IAEP and NAEP assessments, policy makers, among others, should proceed with caution when interpreting results from such a link.

The fact that the IAEP and NAEP were fairly similar in construction and scoring made the linking easier.

The effects of unexplored sources of non-statistical error, such as motivation levels, had on the results was not determined.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Comparing the NAEP Trial State Assessment (TSA) Results with the IAEP International Results (Beaton and Gonzalez, 1993)

To determine how American students compare to foreign students in mathematics, and how well foreign students meet the NAGB mathematics standards.

At that time data were not available for examinees that took both assessments, therefore they relied on a simple distribution-matching procedure.

Rescaled scores to produce a common mean and standard deviation on the two tests.

Translated IAEP scores into NAEP scores by aligning the means and standard deviations for the two tests.

Transformed the IAEP scores for students in the IAEP samples in each participating country into equivalent NAEP scores.

Moderation procedures are sensitive to age/grade differences.

The IAEP and NAEP have many similarities but are not identical and differ in some significant ways.

Results of the linking were different for countries with high average IAEP scores.

Different methods of linking the IAEP and NAEP can produce different results, and further study is necessary to determine which method is best.

Linking to a Large-Scale Assessment: An Empirical Evaluation (Bloxom et al., 1995)

To compare the mathematics achievement of new military recruits with the general U.S. student population, using a link between the Armed Services Vocational Aptitude Battery (ASVAB) and NAEP. The emphasis of the study was to provide and illustrate an approach for empirically evaluating the statistical accuracy of such a linkage,

A sample of 8,239 applicants for military service was administered an operational ASVAB and an NAEP survey in 1992. These applicants were told that there were no stakes attached to the NAEP survey.

ASVAB scores were projected on the NAEP scale in mathematics to allow for comparison between the achievement of military applicants with the general U.S. population of 12th grade students.

Statistical checks were made by constructing the link separately for low-scoring candidates and for high-scoring candidates.

Statistically, an accurate distribution of recruit achievement can be found by projecting onto the NAEP scale.

Factors related to motivation may have underestimated the assessment-based proficiency distribution of recruits in this study, meaning that in spite of the statistical precision of the linkage, the resulting estimates may not be valid for practical purposes.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

The Potential of Criterion-Referenced Tests with Projected Norms (Behuniak and Tucker, 1992)

To determine if norm-referenced scores could be provided, for the purpose of Chapter 1 program evaluation, by linking the Connecticut Mastery Test, a criterion-referenced test closely aligned with state curriculum, and a national “off-the-shelf” norm-referenced achievement test. The purpose of the linking was to meet Federal guidelines for Chapter 1 reporting without requiring students to take two tests.

Compared two tests, the Metropolitan Achievement Test 6 (MAT 6) and the Stanford Achievement Test 7 (SAT 7) to determine which was more closely aligned with state content standards. Selected the MAT 6 for the study.

For a relevant population, calibrated the items from the two instruments in a given subject as a single IRT calibration then used the results to calibrate the tests.

Linked results using equipercentile equating.

Examined changes over two years to check the stability of the link.

There were enough content differences between the two norm-referenced tests and the Connecticut Mastery Test to decide that one test would make a better, if not perfect, candidate for linking to the state test than the other.

It was possible to develop a link between the MAT 6 and the Connecticut Mastery Test that accurately predicted Normal Curve Equivalent scores for the MAT 6 from the CMT but no good validity checks were used.

The linking function changed somewhat over time and the authors believed that this divergence would continue because teachers were gearing instruction to state standards which were more closely aligned with the Connecticut test than the Metropolitan. Thus, the linking would have to be reestablished regularly to remain valid for the purposes that it was intended to serve.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Linking Statewide Tests to the National Assessment of Educational Progress: Stability of Results (Linn and Kiplinger, 1995)

To investigate the adequacy of linking statewide standardized test results to the National Assessment of Educational Progress (NAEP) to allow for accurate comparisons between state academic performance and the national performance levels measured by NAEP.

Obtained two years (1990 and 1992) results from four states' testing programs and corresponding results from the NAEP-TSA for the same two years. (Standardized tests used in the four states were different.)

Used equipercentile equating procedures to compare data from state tests and NAEP.

The standardized test results were converted to the NAEP scale using the 1990 data and resulting conversion tables were then applied to the 1992 data.

Examined content match between standardized tests and NAEP and re-analyzed data using subsections of the standardized tests and NAEP.

The link could estimate average state performance on NAEP, but was not accurate for scores at the top or bottom of the scale.

The equating function diverged for males and females, meaning that NAEP scores for a state would have been over predicted if the equating function for males was used rather than the equating function for females.

Linking standardized tests to NAEP using equipercentile equating procedures is not sufficiently trustworthy to use for other than rough approximations.

Designing tests in accordance with a common framework might make linking more feasible.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Using Performance Standards to Link Statewide Achievement Results to NAEP (Waltman, 1997)

To investigate how the comparability of performance standards obtained by using both statistical and social moderation to link NAEP standards to the ITBS.

Compared 1992 NAEP-TSA with ITBS for Iowa 4th grade public school students.

Used two different types of linking for separate facets of the study.

A socially moderated linkage was obtained by setting standards independently on the ITBS using the same achievement-level descriptions used to set the NAEP achievement levels.

An equipercentile procedure was used to establish a statistically moderated link.

For students who took both assessments, the corresponding achievement regions on the NAEP and ITBS scales produced low to moderate percents of agreement in student classification. Agreement was particularly low for students at the advanced level, two-thirds or more were classified differently.

Cut-scores on the ITBS scale, established by moderation, were lower than those used by NAEP, resulting in more students being classified as basic, proficient, or advanced on the ITBS than estimated by NAEP, possibly due to content and skills-standards mismatch between the ITBS and NAEP.

The equipercentile linkage was reasonably invariant across types of communities, in terms of percentages of students classified at each level.

Regardless of the method used to establish the ITBS cut-scores or the criteria used to classify students, the inconsistency of student-level match limits even many inferences about group performances.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

Study of the Linkages of 1996 NAEP and State Mathematics Assessments in Four States (McLaughlin, 1998)

To address the need for clear, rigorous standards for linkage; to provide the foundation for developing practical guidelines for states to use in linking state assessments to NAEP; and to demonstrate that it is important for educational policy makers to be aware that linkages that support one use may not be valid for another.

A sample of four states that had participated in the 1996 State NAEP mathematics assessment and whose state assessment mathematics tests could potentially be linked to NAEP at the individual student level participated in this study. Participating states used different assessments in their state testing programs.

There were eight linkage samples, ranging in size from 1,852 to 2,444 students.

Study matched students who participated in the NAEP assessment in their states with their scores on the state assessment instrument using projection with multilevel regression.

Links were not sufficiently accurate to permit reporting individual student proficiency on NAEP based on the state assessment score.

Links differed noticeably by minority status and school district, in all four states. Students with the same state assessment score would be projected to have different standings on the NAEP proficiency scale, depending on their minority status and school district.

The Maryland School Performance Program: Performance Assessment with Psychometric Quality Suitable for High Stakes Usage (Yen and Ferrarra, 1997)

To compare the Maryland State Performance Assessment (MSPAP) with the California Test of Basic Skills (CTBS) in order to establish the validity of the state test in reference to national norms.

Compared results from a group of 5th grade students who took both the MSPAP and the CTBS—correlations were obtained.

The intent was to establish the validity of the MSPAP so a link was not obtained.

Intercorrelations of the two tests indicated that the two measures were assessing somewhat different aspects of achievement.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×

A TIMSS-NAEP Link (Johnson, 1998)

To provide useful information about the performance of states relative to other countries. The study broadly compares state eighth-grade mathematics and science performance for each of 44 states and jurisdictions participating in the NAEP with the 41 nations who participated in TIMSS.

The study provides predicted TIMSS results for 44 states and jurisdictions, based on their actual NAEP results.

A statistically moderated link was used to establish the link between NAEP and TIMSS based on applying formal linear equating procedures.

The link was established using reported results from the 1995 administration of TIMSS in the U.S. and the 1996 NAEP and matching characteristics of the score distributions for the two assessments.

Validated the linking functions using data provided by states that participated in both state-level NAEP and state-level TIMSS but were not included in the development of the original linking function.

Although all of the findings have not yet been released, apparently some links were satisfactory and others were not.

Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 13
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 14
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 15
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 16
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 17
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 18
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 19
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 20
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 21
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 22
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 23
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 24
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 25
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 26
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 27
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 28
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 29
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 30
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 31
Suggested Citation:"Findings." National Research Council. 1998. Equivalency and Linkage of Educational Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/6209.
×
Page 32
Next: Conclusions »
Equivalency and Linkage of Educational Tests: Interim Report Get This Book
×
Buy Paperback | $47.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF
  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!