Findings

COMPARABILITY: CONTENT, FORMAT, AND RELATED FEATURES

The content of a test is shaped by the kinds of knowledge and skills addressed in its questions (“items”). The committee's review indicates that content is not generally comparable among various state assessments and commercial tests, even when they are testing the same subjects. Middle-school mathematics, for instance, covers several subject areas of knowledge, such as arithmetic, algebra, and geometry: the content of one state 's 8th grade mathematics test might focus largely on multiplication, division, and other number operations skills, while another test may stress pattern recognition and other pre-algebra skills (Bond and Jaeger, 1993). In reading, one 4th grade test may emphasize vocabulary and basic comprehension, while another may give greater weight to critical evaluation of an author's themes (Afflerbach et al., 1995).

A related content issue pertains to the skills and cognitive processes required to answer items. Off-the-shelf commercial tests and tests that are custom developed for states are increasingly constructed as mixed-model assessments that contain different types of items, including multiple-choice items and various kinds of open-ended questions for which students construct their own responses by filling in a blank, solving a problem, writing a short answer, writing a longer response, or completing a graph or diagram (see, e.g., Shavelson, Baxter, and Pine, 1992); Colorado, Connecticut, North Carolina, and Maryland are examples of states with mixed-model assessments. Some item types are very useful for testing student recall of factual material (a claim often made for certain types of multiple choice items); other item types are better suited to eliciting direct evidence of how well a student can solve problems.

The effect of format differences on linkages can be substantial. For example, the 1991 NAEP trial state assessment in mathematics contained both multiple choice and short-answer formats. Linn, Shepard, and Hartka (1992) found that when the two formats were scored separately, there was enough difference between the scores to change the rank order of the states in the mathematics assessment. For items with constructed responses (that is, not multiple choice), variations in scoring may also influence the validity of linkages because different scoring guides may credit different aspects of performance,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Findings COMPARABILITY: CONTENT, FORMAT, AND RELATED FEATURES The content of a test is shaped by the kinds of knowledge and skills addressed in its questions (“items”). The committee's review indicates that content is not generally comparable among various state assessments and commercial tests, even when they are testing the same subjects. Middle-school mathematics, for instance, covers several subject areas of knowledge, such as arithmetic, algebra, and geometry: the content of one state 's 8th grade mathematics test might focus largely on multiplication, division, and other number operations skills, while another test may stress pattern recognition and other pre-algebra skills (Bond and Jaeger, 1993). In reading, one 4th grade test may emphasize vocabulary and basic comprehension, while another may give greater weight to critical evaluation of an author's themes (Afflerbach et al., 1995). A related content issue pertains to the skills and cognitive processes required to answer items. Off-the-shelf commercial tests and tests that are custom developed for states are increasingly constructed as mixed-model assessments that contain different types of items, including multiple-choice items and various kinds of open-ended questions for which students construct their own responses by filling in a blank, solving a problem, writing a short answer, writing a longer response, or completing a graph or diagram (see, e.g., Shavelson, Baxter, and Pine, 1992); Colorado, Connecticut, North Carolina, and Maryland are examples of states with mixed-model assessments. Some item types are very useful for testing student recall of factual material (a claim often made for certain types of multiple choice items); other item types are better suited to eliciting direct evidence of how well a student can solve problems. The effect of format differences on linkages can be substantial. For example, the 1991 NAEP trial state assessment in mathematics contained both multiple choice and short-answer formats. Linn, Shepard, and Hartka (1992) found that when the two formats were scored separately, there was enough difference between the scores to change the rank order of the states in the mathematics assessment. For items with constructed responses (that is, not multiple choice), variations in scoring may also influence the validity of linkages because different scoring guides may credit different aspects of performance,

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT BOX 3. Format Differences in Maryland The following description from Yen (1996:20) exemplifies the wide differences between two assessments of the same content area, administered to the same students in Maryland. They are the Maryland School Performance Assessment Program (MSPAP), a performance assessment used in high-stake school evaluations, and the Comprehensive Test of Basic Skills, Fourth Edition, or CTBS/4, published by CTB/McMillan McGraw-Hill in 1989. MSPAP and CTBS/4 differ in many ways. MSPAP is entirely performance based, and each student is given a limited number of reading selections or scenarios that require in-depth constructed and extended responses. In contrast, CTBS/4 samples a broader range of traditional objectives with a selected-response format and is a more indirect measure of student classroom performance. MSPAP is intended to “guide and goad ” classroom instruction, while CTBS/4 is not intended as a model of instruction. MSPAP, which is targeted at raising student performance, contains many challenging items; CTBS/4 contains items that measure the full range of student performance. Each year three new forms of MSPAP are administered, with random assignment of forms to students; the same form of CTBS/4 is administered to all students every year. MSPAP results are used as part of a high-stakes program of evaluating schools; CTBS/4 results are part of the public reporting of school performance but are not included in the Maryland School Performance Index, which is used in school evaluations. Schools make individual decisions in terms of striking a balance between focusing on the material assessed with MSPAP and that assessed with CTBS/4. even when the items appear similar (Linn, 1993). Issues such as how the scorers are trained and which scoring guidelines they use can affect the objectivity and consistency of scoring (Frederiksen and Collins, 1989). Some states, including Vermont and New Mexico, are trying out new assessment formats, such as systematically evaluating collections (“portfolios”) of a student's work, that raise even more complex issues about comparability and scoring (Valencia and Au, 1997; Webb, 1995); see Box 3 for a discussion of format issues. In short, content, format, and related issues are vitally important in linking, and existing commercially developed achievement tests and state assessments differ substantially among themselves and NAEP on these dimensions. The committee finds that the lack of strong comparability in these areas prevents the development of reliable and valid linkages. In addition, the committee finds that, in the cases that are germane to our concerns here, statistical linkages between tests with substantial differences in content and degrees of difficulty will not be accurate in the sense that they will not be consistent across subpopulations. This lack of consistency, a problem to which we return below, is directly due to the differences in content and test difficulty. DIVERSITY AND MULTIPLICITY OF TESTING PROGRAMS Educational testing in the United States is diverse, reflecting the nation's history of state and local control over education policy. The number and variety of existing state and commercial tests pose formidable barriers to developing a single linking scale. State and commercial tests vary not only in content and format, but also in their target ages or grades, sampling techniques, policies for testing students with disabilities or with limited English proficiency, alignment with state and local curricula, score reporting procedures, and other factors (Bond et al., 1997). Although commercially developed subject-matter achievement tests, especially the most widely used tests in U.S. schools, appear on the surface to be more similar than many existing state assessments, they, too, have significant differences that reflect the publishers ' efforts to capture specialized

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT BOX 4. California Comparability In 1996 California lawmakers determined that they wanted achievement information for monitoring school effectiveness, but, in the interest of respecting local control of educational issues, they did not want to mandate that all school districts use the same test. They therefore passed a bill encouraging school districts to choose achievement tests from a reviewed and approved list and then mandated the California Department of Education to develop a comparability scale that would allow lawmakers to accurately compare results from different assessments (see, Haertel, 1996; Wilson, 1996; Yen, 1996). Two different methodologies were explored in some depth. The first proposal suggested the development of a short list of acceptable commercial tests, any of which could be selected and administered by a local school district. These few tests would be linked in a manner similar to the Anchor Test Study (Loret et al., 1972). The second proposal was to develop a core reference test that comprehensively reflected California curriculum and to use that reference test as an anchor to which all other tests could be linked. In the end, the project was deemed too complex because more than 40 tests were submitted for comparison, and it was determined that it was too difficult to develop satisfactory links that would be stable over time. California decided to scrap the linkage proposal and selected one test to be administered in all of the state's schools. markets and meet state and local demands for tests with particular features (Yen, 1998). The substantial variation that exists among commercially produced tests challenges the notion of selecting tests “off-the-shelf” and linking them: Box 4 illustrates an example of a recent attempt to link existing off-the-shelf tests and the difficulties that were encountered. The complexity of linking even a small subset of existing tests could quickly render the task infeasible. For example, if the goal is to link just 15 different state assessments, it would be necessary to construct comparisons of more than 100 potential pairs of tests; each pair would require data collection, statistical analyses, and empirical validation (see also Los Angeles County Office of Education, 1997). An additional complicating factor could be the changing relationships between some of the tests. Frequent changes could necessitate continual updates to the development and validation of the equivalency scale (Loret et al., 1972; Linn, 1975; Wilson, 1996). One might argue that pairwise comparisons are not necessary if all tests can be linked to a common scale, such as NAEP. Linking to NAEP simplifies the task in one respect by reducing the number of linkages that would have to be constructed. However, the design and purpose of NAEP complicates the task of linkage in another respect, and casts doubt on the validity of inferences that could be drawn from the link (see, e.g., McLaughlin, 1998). This is true because NAEP, by intent, does not produce scores for individuals and because individual students complete different parts of an entire NAEP assessment. STABILITY OF RESULTS The testing landscape in the United States is not only diverse, but it is dynamic: states and districts have moved rapidly, especially during the last 10 years, to adopt new educational goals, new models of testing and assessment, and new strategies for aligning tests and assessments to state content standards (National Research Council, 1997). Moreover, although there is some similarity and stability among the largest commercial testing programs, states that use commercial programs use them in very different ways. Many states have changed the design of their statewide assessments several times in the last decade and are continuing to do so. For example, some states are developing hybrids of commercial and state-developed tests or

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT customizing available off-the-shelf tests. Other states do not use commercial tests as part of their statewide assessment system (Roeber, Bond, and Braskamp, 1997). The diversity of the testing programs currently in the nation's schools is depicted in Table 1. (The committee realizes that information such as that tabulated here changes frequently, and may be summarized differently in different surveys. However, the main point is that the states' testing programs are extremely diverse in content, difficulty, and format (Jaeger, 1996).) This continual change in educational goals and in the content of tests and assessments, which many people believe reflects a healthy dynamism in American education, makes linkage a moving target. Prior research has consistently shown that even if linkages between tests can be made at one time, they are difficult to maintain (Linn and Kiplinger, 1995). For example, suppose a link could be generated between a test in state A and another test in state B. Conducting the necessary analyses to establish the link takes time. It is quite possible that once the linkage methods are ready to be applied, one or both states will have changed their test format, content, or target group. While NAEP does not change as frequently or as dramatically as state and commercial assessments, it, too, is not static. The content and nature of the NAEP instruments evolve gradually to reflect changing educational and assessment practices (National Research Council, 1996). These modifications in NAEP make it complicated to maintain stable linkages with state and commercial assessments, which are themselves evolving, and would minimize the validity of inferences from the linkages. TEST USES AND EFFECTS ON TEACHER AND STUDENT BEHAVIOR Many states use assessments for multiple purposes related to educational improvement, such as program evaluation, curriculum planning, school performance reporting, and student diagnosis (U.S. Congress, 1992). More and more states are using (or are contemplating using) their assessment programs to make “high-stakes” decisions about people and programs, such as promoting students to the next grade, determining whether students will graduate from high school, grouping students for instructional purposes, making decisions about teacher tenure or bonuses, allocating resources to schools, or imposing sanctions on schools and districts (see, e.g., McLaughlin et al., 1995; McDonnell, 1997). Table 2 shows many of the varied uses of tests in our nation's schools today. (A companion report on appropriate test use will be issued by the National Research Council's Committee on the Fair and Appropriate Use of Educational Tests later this year.) An important factor in testing goes under the heading of “stakes.” When students are tested, various parties can have different concerns with, or stakes, in the outcomes. For example, a national survey of achievement, like NAEP, is a very low-stakes test for many of the parties concerned—the test takers, their parents, their teachers, and the district administrators. If NAEP is high-stakes for anyone, it is for policy makers who want to use NAEP data to assess the effectiveness of various educational reforms as they vary across the states or regions of the country. In contrast, tests that are used for high-school graduation or college admission are high-stakes for the students taking them and for their parents. Tests that are high stakes for the test takers affect their motivation to do their best. Tests that are high stakes for district administrators can result in various kinds of efforts to assist students in performing better than they would had the same test been of low stakes to those administrators. These examples do not exhaust the possibilities of the effect of “stakes” on test results. When tests carrying different stakes for different parties are linked, one expects different linking functions to result than would be found if the stakes were similar. Although forms of test-based educational accountability vary across states and districts, changes in how tests are used inevitably lead to changes in how teachers and students react to them (Koretz,

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT TABLE 1 State Testing: A Snapshot of Diversity State Use of Commercial Tests Use of Other Assessments Alabama Stanford Achievement Test 9, Otis Lenin School Ability Test Alabama Kindergarten Assessment, Alabama Direct Assessment of Writing, Differential Aptitude Test, Basic Competency Test, Career Interest Inventory, End-of-Course Algebra and Geometry Test, Alabama High Sshool Basic Skill Exit Exam Alaska California Achievement Test 5 Arizona Stanford Achievement Test 9 Arkansas Stanford Achievement Test 9 High School Proficiency Test California Stanford Achievement Test 9 Golden State Examinations Colorado Custom developed CTB item banks, NAEP items, and state items Connecticut Custom developed Connecticut Mastery Test, Connecticut Academic Performance Test Delaware Custom developed State-developed writing assessment Florida Custom developed High School Competency Test, Florida Writing Assessment Program Georgia Iowa Test of Basic Skills, Test of Achievement Proficiency Curriculum-based Assessments, Georgia High School Graduation Tests, Georgia Kindergarten Assessment Program, Writing Assessment Hawaii Stanford Achievement Test 8 Hawaii State Test of Essential Competencies, Credit by Examination Idaho Iowa Test of Basic Skills Form K, Test of Achievement Proficiency Direct Writing Assessment, Direct Mathematics Assessment Illinois Custom developed Illinois Goals Assessment Program Indiana Custom developed Indiana Statewide Testing for Educational Progress Plus Iowa No mandated statewide testing program, approximately 99 percent of all districts participate in the Iowa Test of Basic Skills on a voluntary basis Kansas Custom developed Kansas Assessment Program (Kansas University Center for Educational Testing and Evaluation) Kentucky Custom developed Kentucky Instructional Results Information System Louisana California Achievement Test 5 Louisiana Educational Assessment Program Maine Custom developed Maine Educational Assessment (Advanced Systems in Measurement Inc.) Maryland Custom developed, Comprehensive Test of Basic Skills 5 Maryland Student Performance Assessment Program, Maryland Functional Tests, Maryland Writing Test Massachusetts Iowa Test of Basic Skills, Iowa Test of Educational Development Michigan Custom developed Michigan Educational Assessment Program: Criterion-referenced tests of 4th-, 7th-, and 11th-grade students in mathematics and reading and 5th-, 8th-, and 11th-grade students in science and writing; Michigan High School Proficiency Test Minnesota Custom developed 96-97 students took minimum competency literacy tests in reading and mathematics Mississippi Iowa Test of Basic Skills, Test of Achievement Proficiency Functional Literacy Examination, Subject Area Testing Program

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Missouri Custom developed, TerraNova Missouri Mastery and Achievement Test Montana Stanford Achievement Test, Iowa Test of Basic Skills, Comprehensive Test of Basic Skills Nebraska No statewide assessment program in 96-97 Nevada TerraNova Grade 8 Writing Proficiency Exam, Grade 11 Proficiency Exam New Hampshire Custom developed New Hampshire Education Improvement and Assessment Program (Advanced Systems in Measurement and Evaluation, Inc.) New Jersey Custom developed Grade 11 High School Proficiency Test, Grade 8 Early Warning Test New Mexico Iowa Test of Basic Skills, Form K New Mexico High School Competency Exam, Portfolio Writing Assessment, Reading Assessment for Grades 1 and 2 New York Custom developed Occupational Education Proficiency Examinations, Preliminary Competency Tests, Program Evaluation Tests, Pupil Evaluation Program Tests, Regents Competency Tests, Regents Examination Program, Second Language Proficiency Examinations North Carolina Iowa Test of Basic Skills North Carolina End of Grade North Dakota Comprehensive Test of Basic Skills/4, TCS Ohio Custom developed Fourth-, Sixth-, Ninth-, and Twelfth-Grade Proficiency Tests Oklahoma Iowa Test of Basic Skills Oklahoma Core Curriculum Tests Oregon Custom developed Reading, Writing, and Mathematics Assessment Pennsylvania Custom developed Writing, Reading, and Mathematics Assessment Rhode Island Metropolitan Achievement Test 7, Custom developed Health Performance Assessment, Mathematics Performance Assessment, Writing Performance Assessment South Carolina Metropolitan Achievement Test 7, Custom developed Basic Skills Assessment Program South Dakota Stanford Achievement Test 9, Metropolitan Achievement Test 7 Tennesee Custom developed Tennessee Comprehensive Assessment Program (TCAP) Achievement Test Grades 2-8, TCAP Competency Graduation Test, TCAP Writing Assessment Grades 4, 8, and 11 Texas Custom developed Texas Assessment of Academic Skills, Texas End-of-Course Test Utah Stanford Achievement Test 9, Custom developed Core Curriculum Assessment Program Vermont Has a voluntary state assessment program New Standards reference exams in math, Portfolio assessment in math and writing Virginia Customized off the shelf Literacy Passport Test, Degrees of Reading Power, Standards of Learning Assessments, Virginia State Assessment Program

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Washington Comprehensive Test of Basic Skills 4, Curriculum Frameworks Assessment System West Virginia Comprehensive Test of Basic Skills Writing Assessment, Metropolitan Readiness Test Wisconsin TerraNova, Custom developed Knowledge and Concepts Tests, Wisconsin Reading Comprehension Test at Grade 3 Wyoming State assessment program in vocational education only for students grades 9-12 NOTES: Custom developed assessments result from a joint venture between a state and a commercial test publisher to design a test to the state 's specification, perhaps to more closely match the state's curriculum than an off-the-shelf test does. Customized off-the-shelf assessments result from modifications to a commerical test publisher 's existing product. SOURCE: Data from 1997 Council of Chief State School Officers FallState Student Assessment Program Survey 1998). Indeed, one of the underlying rationales for test-based accountability is to spur changes in teaching and learning. These uses are hotly debated and beyond the scope of this report (Jones, 1997). For our purposes, it is sufficient to note that the difficulty of maintaining linkages between tests is exacerbated when test results have significant consequences for individuals or schools. In these situations, teachers may change what and how they teach to help students respond to the content and problems on the test (Shepard and Dougherty, 1991), schools and districts may align curriculum more closely with test content, and test takers may have stronger motivation to do well (e.g., Koretz et al., 1991). Performance gains on tests used for accountability (high-stakes tests) will often not be reflected in scores on tests used for monitoring or other non-accountability (low-stakes) purposes. The resulting differences in student performance could alter the relationship between linked tests (Shepard et al., 1996; Yen, 1996). Hence, any valid linkages created initially would have to be reestablished regularly, which would raise important questions about any hoped-for cost-effective advantages of linkage. The effects of test use on student and teacher behavior pose a special problem for linkage with NAEP. To protect its historical purpose as a monitor of educational progress, NAEP was designed expressly with safeguards to prevent it from becoming a high-stakes test. As a result, the motivation level of students who participate in NAEP may be low (O'Neil et al., 1992; Kiplinger and Linn, 1996), and they may not always exhibit peak performance. Linkages between a low-stakes instrument like NAEP and high-stakes state or commercial tests are likely to be misleading because students are likely to put forth more effort for the latter kinds of tests than for the former. POPULATION OR SUBGROUP DIFFERENCES When the function that links Test A with Test B differs for different groups, for example, boys and girls, it does not indicate that one group is “better” than the other. Rather, it means that a boy and a girl with the same score on Test A would be expected to have different scores on Test B, and that this

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT TABLE 2 Student Testing: Diversity of Purpose State Decisions for Students Decisions for Schools Instructional Purposes Alabama High school graduation School performance reporting Student diagnosis or placement; improve instruction; program evaluation Alaska School performance reporting Improve instruction Arizona School performance reporting Student diagnosis or placement; improve instruction; program evaluation Arkansas School performance reporting Student diagnosis or placement; improve instruction; program evaluation California Student diagnosis or placement Student diagnosis or placement Colorado a Connecticut Student diagnosis or placement Awards or recognition; school performance reporting Student diagnosis or placement; improve instruction; program evaluation Delaware Student diagnosis or placement; improve instruction; program evaluation Florida High school graduation Improve instruction; program evaluation Georgia High school graduation School performance reporting Student diagnosis or placement; improve instruction; program evaluation Hawaii High school graduation Awards or recognition; school performance reporting Student diagnosis or placement; improve instruction; program evaluation Idaho School performance reporting Improve instruction Iowa a Illinois Accreditation Indiana Awards or recognition; school performance reporting Student diagnosis or placement; improve instruction; program evaluation Kansas School performance; reporting; accreditation Student diagnosis or placement; improve instruction; program evaluation Kentucky Awards or recognition Improve instruction; program evaluation Louisiana Student promotion; high school graduation Awards or recognition; school performance reporting Student diagnosis or placement; improve instruction; program evaluation Maine Student diagnosis or placement Improve instruction; program evaluation

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Maryland High school graduation School performance reporting; skills guarantee; accreditation Student diagnosis or placement; improve instruction; program evaluation Massachusetts School performance reporting Improve instruction Michigan Student diagnosis or placement; endorsed diploma Awards or recognition; school performance reporting; accreditation Improve instruction; program evaluation Minnesota a Mississippi High school graduation School performance reporting; skills guarantee; accreditation Student diagnosis or placement; improve instruction; program evaluation Missouri School performance reporting; accreditation Improve instruction; program evaluation Montana Improve instruction; program evaluation Nebraska a Nevada High school graduation School performance reporting; accreditation Improve instruction; program evaluation New Hampshire Improve instruction; program evaluation New Jersey High school graduation School performance reporting; accreditation Student diagnosis or placement; improve instruction New Mexico High school graduation School performance reporting; accreditation Student diagnosis or placement; improve instruction; program evaluation New York Student diagnosis or placement; student promotion; honors diploma; endorsed diploma; high school graduation School performance reporting Improve instruction; program evaluation North Carolina Student diagnosis or placement; student Promotion; high school graduation Improve instruction; program evaluation North Dakota Student diagnosis or placement Student diagnosis or placement; improve instruction; program evaluation Ohio High school graduation Awards or recognition; school performance reporting Improve instruction; program evaluation Oklahoma School performance reporting; accreditation Student diagnosis or placement; improve instruction; program evaluation

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Oregon School performance reporting Improve instruction; program evaluation Pennsylvania School performance reporting Student diagnosis or placement; program evaluation Rhode Island School performance reporting Improve instruction; program evaluation South Carolina Student promotion; high school graduation Awards or recognition; school performance reporting; skills guarantee Student diagnosis or placement; improve instruction; program evaluation South Dakota Improve instruction; program evaluation Tennessee Endorsed diploma; high school graduation Student diagnosis or placement; improve instruction; program evaluation Texas Student diagnosis or placement; high school graduation Student diagnosis or placement; improve instruction; program evaluation Utah Student diagnosis or placement School performance reporting Student diagnosis or placement; improve instruction; program evaluation Vermont School performance reporting Student diagnosis or placement; improve instruction; program evaluation Virginia Student diagnosis or placement; student promotion; high school graduation School performance reporting Student diagnosis or placement; improve instruction; program evaluation Washington School performance reporting Student diagnosis or placement; improve instruction; program evaluation West Virginia Skills guarantee; Accreditation Improve instruction Wisconsin School performance reporting Program evaluation Wyoming Improve instruction; program evaluation a  Colorado, Minnesota, and Nebraska did not administer any statewide assessments in 1995-96. Iowa does not administer a statewide assessment. SOURCE: Data from 1996 Council of Chief State School Officers FallState Student Assessment Program Survey.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT effect is consistent for members of the two groups. Researchers generally suppose that group differences occur because of differing test content or format, different motivation levels, or differences in prior exposure to relevant learning opportunities. Perhaps the material in Test A is more familiar to one group than the other, while the material in Test B is equally familiar to both groups. Alternatively, one group might be motivated to perform well on one test, while both groups were equally motivated on Test B. Simply put, it is often the case that the relative differences among the test performances of different groups of students will vary from test to test, depending on a host of factors that are subtle but important. For example, on mathematics tests boys may do better on word problems while girls may do better solving equations. When this is true, overall estimates of gender differences in 8th grade mathematics performance will depend on the relative emphasis a test gives to these two areas. Unless the two tests are very closely aligned in content, linking them might require separate formulas for boys and girls because a single linking formula would underestimate performance for one group and overestimate it for the other. Another example is that student achievement on two tests with differing emphases on algebra could vary widely across the states as a function of when and to what extent algebra is introduced into the middle school curriculum. As a result, students from different states who obtain the same score on one test (e.g., a commercial test) might have different estimated (linked) scores on a second test, such as NAEP (e.g., McLaughlin, 1998). These problems are attributable in part to the tests themselves, but linkage magnifies them and increases the risk of unfair inferences about individual achievement. REPORTING RESULTS IN TERMS OF NAEP ACHIEVEMENT LEVELS Linking other tests to NAEP raises the possibility of reporting individual student scores on state and commercial tests in terms of the NAEP achievement levels. The committee explored this issue and finds that such links would raise new and significant methodological problems (see Wu, Royal, and McLaughlin, 1997). To understand them, one must recognize that all test scores carry with them some amount of uncertainty or “noise,” an issue usually treated in the testing literature under the heading “reliability” (see, e.g., Feldt and Brennan, 1989; American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1985). Because scores are less than 100 percent reliable, it is never possible to assign students to achievement levels with complete certainty (Johnson and Mazzeo, 1998). The two key issues are the likelihood that students will be misclassified and the degree of error in the classification. Clearly, the more reliable the test, the less ambiguity there will be in the assignment of students to categories of performance. Unfortunately, the empirical evidence that the committee has reviewed to date suggests that transforming performances on selected existing assessments to the NAEP achievement levels produces results with substantial practical ambiguity. For example, consider a 4th grade student with a reasonably good score on a state or commercial reading test. Transforming this child 's score into the NAEP achievement levels could easily produce the following type of report: “Sally scored [x] on the [State Reading Assessment]. Of 100 students with the same score, 10 are likely to be in the ‘below basic' category; 60 are likely to be ‘basic;' 28 are likely to be ‘proficient;' and 2 are likely to be in the highest, or ‘advanced,' category.” Alternatively, the report could be issued in terms of Sally's probabilities of falling in the various categories (Johnson and Mazzeo, 1998). This ambiguity will be due to measurement error in the student's score on the state assessment; to measurement error in NAEP; to the less than perfect correlation between the state

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT assessment scores and NAEP scores; to potential differences in linking functions by different subgroups; and to other unidentified sources of measurement error. The committee has not been able to conduct a thorough study of parental and public reaction to this kind of scenario, but we caution that one of the more important putative purposes of linkage— providing clear and relevant information about the performance of individual students—might be severely undermined by the need to report information which, in order to be faithful to the underlying statistics, must be ambiguous in its meaning. Table 3 presents a summary of some major studies of linkage between different tests and assessments. Some of the studies describe methods for linking to NAEP and the implications for such a link.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT TABLE 3 Abridged Summaries of Prior Linkage Research Study Purpose Methodology Key Findings The Anchor Test Study (Loret et al., 1972, 1973) To develop an equivalency scale to compare reading test results for Title I program evaluation. The study was sponsored by a $1,000,000 contract with the U.S. Office of Education. Number of participants: 200,000 students for norming phase; 21 sample groups of approximately 5,000 students each for the equating phase. Eight tests, representing almost 90% of reading tests being administered in the states at that time, were selected for the study. Participants took two tests. Created new national norms for one test and through equating, all eight tests. Administered different combinations of standardized reading tests to different subjects taking into account the need to balance demographic factors and instructional differences. Tests with similar content can be linked together with reasonable accuracy. Relationships between tests were determined to be reasonably similar for male and female students but not for racial groups. The equivalency scale was accurate for individuals but aggregated results, e.g., school or district, would have increased error stemming from combining results. Every time a new test is introduced the procedure has to be replicated for that test. The stability of the linkage has to be reestablished regularly because instruction on one test but not on others can invalidate the linkage. Projecting to the NAEP Scale: Results from the North Carolina End-of-Grade Testing Program (Williams et al., 1995) To link a comprehensive state achievement test to the NAEP scale for mathematics so that the more frequently administered state tests could be used for purposes of monitoring progress of North Carolina students with respect to national achievement standards. A total of 2,824 students from 99 schools was tested using 78 items from a short form of the North Carolina End-of-Grade Test and two blocks of released 1992 NAEP items that were embedded in the test. Test booklets were spiraled so that some students took NAEP items first, others took North Carolina End-of-Grade items first. The final linkage to the NAEP scale used projection. Scores from the NAEP blocks were determined from student responses using NAEP parameters but not the conditioning analysis used by NAEP. Regular scores from the North Carolina test were used. A satisfactory linking was obtained for statewide statistics as a whole that were accurate enough to predict NAEP means or quartile distributions with only modest error. The linkages had to be adjusted separately for different ethnic groups, demonstrating that the linking was inappropriate for predicting individual scores from the North Carolina test to the NAEP scale. The following were considered important factors in establishing a strong link: content on the North Carolina test was closely aligned with state curriculum and NAEP's was not; student performance was affected by the order of the items in their test booklets; motivation or fatigue affects performance for some students.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Linking Statewide Tests to NAEP (Ercikan, 1997) To examine the accuracy of linking statewide test results to NAEP by comparing the results of four states' assessment programs with the NAEP results for those states. Compared each state's assessment data to its NAEP data using equipercentile comparisons of score distributions. Since none of the four states used exactly the same form of the California Achievement Test for their state testing program, state results had to be converted to a common scale. This scale was developed by the publisher of the California Achievement Test series. The link from separate tests to NAEP varies from one state to the next. It was not possible to determine whether the state-to-state differences were due to the different test(s), the moderate content alignment, the motivation of the students, or the nature of the student population. Linking state tests to NAEP (by matching distributions) is so imprecise that results should not be used for high-stakes purposes. Toward WorldClass Standards: A Research Study Linking International and National Assessments (Pashley and Phillips et al., 1993) To pilot test a method for obtaining accurate links between the International Assessment of Educational Progress (IAEP) and NAEP so that other countries can be compared with the United States, both nationally and at the state level, in terms of NAEP performance standards. A sample of 1,609 U.S. grade eight students were assessed with both IAEP and NAEP instruments in 1992 to establish a link between these assessments. Based on test results from the sample testing, the relationships between IAEP and NAEP proficiency estimates were investigated. Projection methodology was used to estimate the percentages of students from the 20 countries, assessed with the IAEP, who could perform at or above the three performance levels established for NAEP. Various sources of statistical error were assessed. The methods researchers use to establish links between tests (at least partially) determine how valid the link is for drawing particular inferences about performance. Establishing this link required a linking sample of students who took both assessments. While it is possible to establish an accurate statistical link between the IAEP and NAEP assessments, policy makers, among others, should proceed with caution when interpreting results from such a link. The fact that the IAEP and NAEP were fairly similar in construction and scoring made the linking easier. The effects of unexplored sources of non-statistical error, such as motivation levels, had on the results was not determined.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Comparing the NAEP Trial State Assessment (TSA) Results with the IAEP International Results (Beaton and Gonzalez, 1993) To determine how American students compare to foreign students in mathematics, and how well foreign students meet the NAGB mathematics standards. At that time data were not available for examinees that took both assessments, therefore they relied on a simple distribution-matching procedure. Rescaled scores to produce a common mean and standard deviation on the two tests. Translated IAEP scores into NAEP scores by aligning the means and standard deviations for the two tests. Transformed the IAEP scores for students in the IAEP samples in each participating country into equivalent NAEP scores. Moderation procedures are sensitive to age/grade differences. The IAEP and NAEP have many similarities but are not identical and differ in some significant ways. Results of the linking were different for countries with high average IAEP scores. Different methods of linking the IAEP and NAEP can produce different results, and further study is necessary to determine which method is best. Linking to a Large-Scale Assessment: An Empirical Evaluation (Bloxom et al., 1995) To compare the mathematics achievement of new military recruits with the general U.S. student population, using a link between the Armed Services Vocational Aptitude Battery (ASVAB) and NAEP. The emphasis of the study was to provide and illustrate an approach for empirically evaluating the statistical accuracy of such a linkage, A sample of 8,239 applicants for military service was administered an operational ASVAB and an NAEP survey in 1992. These applicants were told that there were no stakes attached to the NAEP survey. ASVAB scores were projected on the NAEP scale in mathematics to allow for comparison between the achievement of military applicants with the general U.S. population of 12th grade students. Statistical checks were made by constructing the link separately for low-scoring candidates and for high-scoring candidates. Statistically, an accurate distribution of recruit achievement can be found by projecting onto the NAEP scale. Factors related to motivation may have underestimated the assessment-based proficiency distribution of recruits in this study, meaning that in spite of the statistical precision of the linkage, the resulting estimates may not be valid for practical purposes.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT The Potential of Criterion-Referenced Tests with Projected Norms (Behuniak and Tucker, 1992) To determine if norm-referenced scores could be provided, for the purpose of Chapter 1 program evaluation, by linking the Connecticut Mastery Test, a criterion-referenced test closely aligned with state curriculum, and a national “off-the-shelf” norm-referenced achievement test. The purpose of the linking was to meet Federal guidelines for Chapter 1 reporting without requiring students to take two tests. Compared two tests, the Metropolitan Achievement Test 6 (MAT 6) and the Stanford Achievement Test 7 (SAT 7) to determine which was more closely aligned with state content standards. Selected the MAT 6 for the study. For a relevant population, calibrated the items from the two instruments in a given subject as a single IRT calibration then used the results to calibrate the tests. Linked results using equipercentile equating. Examined changes over two years to check the stability of the link. There were enough content differences between the two norm-referenced tests and the Connecticut Mastery Test to decide that one test would make a better, if not perfect, candidate for linking to the state test than the other. It was possible to develop a link between the MAT 6 and the Connecticut Mastery Test that accurately predicted Normal Curve Equivalent scores for the MAT 6 from the CMT but no good validity checks were used. The linking function changed somewhat over time and the authors believed that this divergence would continue because teachers were gearing instruction to state standards which were more closely aligned with the Connecticut test than the Metropolitan. Thus, the linking would have to be reestablished regularly to remain valid for the purposes that it was intended to serve.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Linking Statewide Tests to the National Assessment of Educational Progress: Stability of Results (Linn and Kiplinger, 1995) To investigate the adequacy of linking statewide standardized test results to the National Assessment of Educational Progress (NAEP) to allow for accurate comparisons between state academic performance and the national performance levels measured by NAEP. Obtained two years (1990 and 1992) results from four states' testing programs and corresponding results from the NAEP-TSA for the same two years. (Standardized tests used in the four states were different.) Used equipercentile equating procedures to compare data from state tests and NAEP. The standardized test results were converted to the NAEP scale using the 1990 data and resulting conversion tables were then applied to the 1992 data. Examined content match between standardized tests and NAEP and re-analyzed data using subsections of the standardized tests and NAEP. The link could estimate average state performance on NAEP, but was not accurate for scores at the top or bottom of the scale. The equating function diverged for males and females, meaning that NAEP scores for a state would have been over predicted if the equating function for males was used rather than the equating function for females. Linking standardized tests to NAEP using equipercentile equating procedures is not sufficiently trustworthy to use for other than rough approximations. Designing tests in accordance with a common framework might make linking more feasible.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Using Performance Standards to Link Statewide Achievement Results to NAEP (Waltman, 1997) To investigate how the comparability of performance standards obtained by using both statistical and social moderation to link NAEP standards to the ITBS. Compared 1992 NAEP-TSA with ITBS for Iowa 4th grade public school students. Used two different types of linking for separate facets of the study. A socially moderated linkage was obtained by setting standards independently on the ITBS using the same achievement-level descriptions used to set the NAEP achievement levels. An equipercentile procedure was used to establish a statistically moderated link. For students who took both assessments, the corresponding achievement regions on the NAEP and ITBS scales produced low to moderate percents of agreement in student classification. Agreement was particularly low for students at the advanced level, two-thirds or more were classified differently. Cut-scores on the ITBS scale, established by moderation, were lower than those used by NAEP, resulting in more students being classified as basic, proficient, or advanced on the ITBS than estimated by NAEP, possibly due to content and skills-standards mismatch between the ITBS and NAEP. The equipercentile linkage was reasonably invariant across types of communities, in terms of percentages of students classified at each level. Regardless of the method used to establish the ITBS cut-scores or the criteria used to classify students, the inconsistency of student-level match limits even many inferences about group performances.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT Study of the Linkages of 1996 NAEP and State Mathematics Assessments in Four States (McLaughlin, 1998) To address the need for clear, rigorous standards for linkage; to provide the foundation for developing practical guidelines for states to use in linking state assessments to NAEP; and to demonstrate that it is important for educational policy makers to be aware that linkages that support one use may not be valid for another. A sample of four states that had participated in the 1996 State NAEP mathematics assessment and whose state assessment mathematics tests could potentially be linked to NAEP at the individual student level participated in this study. Participating states used different assessments in their state testing programs. There were eight linkage samples, ranging in size from 1,852 to 2,444 students. Study matched students who participated in the NAEP assessment in their states with their scores on the state assessment instrument using projection with multilevel regression. Links were not sufficiently accurate to permit reporting individual student proficiency on NAEP based on the state assessment score. Links differed noticeably by minority status and school district, in all four states. Students with the same state assessment score would be projected to have different standings on the NAEP proficiency scale, depending on their minority status and school district. The Maryland School Performance Program: Performance Assessment with Psychometric Quality Suitable for High Stakes Usage (Yen and Ferrarra, 1997) To compare the Maryland State Performance Assessment (MSPAP) with the California Test of Basic Skills (CTBS) in order to establish the validity of the state test in reference to national norms. Compared results from a group of 5th grade students who took both the MSPAP and the CTBS—correlations were obtained. The intent was to establish the validity of the MSPAP so a link was not obtained. Intercorrelations of the two tests indicated that the two measures were assessing somewhat different aspects of achievement.

OCR for page 13
Equivalency and Linkage of Educational Tests: INTERIM REPORT A TIMSS-NAEP Link (Johnson, 1998) To provide useful information about the performance of states relative to other countries. The study broadly compares state eighth-grade mathematics and science performance for each of 44 states and jurisdictions participating in the NAEP with the 41 nations who participated in TIMSS. The study provides predicted TIMSS results for 44 states and jurisdictions, based on their actual NAEP results. A statistically moderated link was used to establish the link between NAEP and TIMSS based on applying formal linear equating procedures. The link was established using reported results from the 1995 administration of TIMSS in the U.S. and the 1996 NAEP and matching characteristics of the score distributions for the two assessments. Validated the linking functions using data provided by states that participated in both state-level NAEP and state-level TIMSS but were not included in the development of the original linking function. Although all of the findings have not yet been released, apparently some links were satisfactory and others were not.